2020-06-21

Learn more about rdd in spark

RDD (Resilient Distributed Dataset) is called the elastic distributed data set. It is the most basic data abstraction in Spark. It represents an immutable, partitionable, and set of elements that can be calculated in parallel. RDD has the characteristics of a data flow model: automatic fault tolerance, location-aware scheduling, and scalability. RDD allows users to explicitly cache the working set in memory when executing multiple queries. Subsequent queries can reuse the working set, which greatly improves query speed. Today, let's talk briefly about RDD in Spark. The RDD API will be put into the next chapter and then detailed

RDD Introduction

RDD can be regarded as an object of Spark, which runs in memory itself. For example, reading a file is an RDD, calculating a file is an RDD, and the result set is also an RDD. Different shards, data dependencies, key- Value type map data can be regarded as RDD. (Note: from Baidu Encyclopedia), here, RDD will not say much, just talk about the two operations of RDD.

Two types of RDD (operator) operations: Transformation and Action

Transformation

The literal translation is transfer, the main function is to generate a new RDD with an RDD through operation. Transformation operation code will not be executed immediately. Only when the action type action appears in the code, the code will be actually executed. This design allows Spark to run more efficiently.

Transformation API

The following table lists some of the common transformations supported by Spark. Refer to the RDD API doc (Scala, Java, Python, R) and pair RDD functions doc (Scala, Java) for details.


reduce(func)
Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.
Through the function func, first gather the data set of each partition, and then gather the data between the partitions. func receives two parameters and returns a new value. The new value is then passed as a parameter to the function func until the last element.

collect()
Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
Return all elements of the data set to the driver as an array. This is usually useful in filters or other operations that return a sufficiently small subset of data.

count()
Return the number of elements in the dataset.
Returns the number of elements in the data set.

first()
Return the first element of the dataset (similar to take(1)).
Returns the first element in the data set.

take(n)
Return an array with the first n elements of the dataset.
Returns an array consisting of the first n elements of the dataset.

takeSample(withReplacement, num, [seed])
Return an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed.
Returns an array containing random samples of num elements of the data set, which can be replaced or not, and a random number generator seed can be specified in advance.

*takeOrdered(n, [ordering])
Return a new dataset that contains the union of the elements in the source dataset and the argument.
After the union of the source RDD and the parameter RDD returns a new RDD.

intersection(otherDataset)
Return a new RDD that contains the intersection of elements in the source dataset and the argument.
Return a new RDD after finding the intersection of the source RDD and the parameter RDD.

distinct([numPartitions]))
Return a new dataset that contains the distinct elements of the source dataset.
Deduplication returns new RDD.

groupByKey([numPartitions])
When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. After passing key groupBy, a new (K, Iterable<V>) RDD is returned. Note: If grouping is performed in order to perform aggregation (such as sum or average) on each key, then use reduceByKey or aggregateByKey will get better performance. Note: By default, the level of parallelism in the output depends on the number of partitions of the parent RDD. You can pass the optional numPartitions parameter to set a different number of tasks.

reduceByKey(func, [numPartitions])
When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.
When called on the dataset of (K, V) pairs, return a dataset of (K, V) pairs, where the value of each key is aggregated using the given reduce function func, which must be (V, V )=>V type. Like groupByKey, the number of reduce tasks can be configured with an optional second parameter.

aggregateByKey(zeroValue)(seqOp, combOp, [numPartitions])
When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral "zero" value. Allows an aggregated value type that is different than the input value type, while avoiding unnecessary allocations. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.
When called on a dataset of (K, V) pairs, return a dataset of (K, U) pairs, where the value of each key is aggregated using the given combination function and a neutral "zero" value. Allow aggregate value types different from the input value type, while avoiding unnecessary allocation. Like groupByKey, the number of reduce tasks can be configured with an optional second parameter.

sortByKey([ascending], [numPartitions])
When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument. When an ordered (K, V) call to the data set is implemented for K, the (K, V) pair of data sets sorted in ascending or descending order by key is returned, as specified in the boolean ascending parameter.

join(otherDataset, [numPartitions])
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin. When called on (K, V) and (K, W) type data sets, a (K, (V, W)) pair data set is returned, which contains all element pairs for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.

cogroup(otherDataset, [numPartitions])
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable<V>, Iterable<W>)) tuples. This operation is also called groupWith.
When called on data sets of type (K, V) and (K, W), it returns a data set of (K, (Iterable<V>, Iterable<W>) tuple. This operation is also called groupWith.

cartesian(otherDataset)
When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements).
When called on a data set of type T and U, the data set of (T, U) pairs (all element pairs) is returned.

pipe(command, [envVars])
Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the process's stdin and lines output to its stdout are returned as an RDD of strings.
Pipeline each partition of the RDD through shell commands (such as Perl or bash scripts). The RDD element is written to the process's stdin, and the line output to its stdout is returned as the string's RDD.

coalesce(numPartitions)
Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset. Reduce the number of partitions in the RDD to numPartitions. Helps run operations more efficiently after screening large data sets.

repartition (numPartitions)
Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network. Randomly reorganize the data in the RDD to create more or fewer partitions and balance them. This always shuffles all data through the network.

repartitionAndSortWithinPartitions (partitions)
Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. This is more efficient than calling repartition and then sorting within each partition because it can push the sorting down into the shuffle machinery. Re-partition the RDD according to the given partitioner, and in each generated partition, sort the records by their keys. This is more efficient than calling repartitioning and then sorting within each partition, because it can push the sorting down to the shuffle machine.

Action

When the program encounters an action, the code is actually executed, and at least one action operation is required in each piece of spark code.

Actions API

The following table lists some of the common actions supported by Spark. Refer to the RDD API doc (Scala, Java, Python, R)


reduce(func)
Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.
Through the function func, first gather the data set of each partition, and then gather the data between the partitions. func receives two parameters and returns a new value. The new value is then passed as a parameter to the function func until the last element

collect()
Return a new dataset formed by selecting those elements of the source on which funcreturns true. Return all elements of the data set as an array in the driver. This is usually only useful after filters or other operations that return a sufficiently small subset of data.

count()
Return the number of elements in the dataset.
Returns the number of elements in the data set.

first()
Return the first element of the dataset (similar to take(1)).
Returns the first element in the data set.

take(n)
Return an array with the first n elements of the dataset.
Returns the first n elements in the data set.

takeSample(withReplacement, num, [seed])
Return an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed. Returns an array containing random samples of num elements of the data set, which can be replaced or not, and a random number generator seed can be specified in advance.

takeOrdered(n, [ordering])
Return the first n elements of the RDD using either their natural order or a custom comparator. Use the natural order of RDD or the order comparison rules implemented by yourself to return the first n elements of RDD.

saveAsTextFile(path)
Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file. Write data set elements as text files (or a set of text files) to a given directory in the local file system, HDFS, or any other file system supported by Hadoop. Spark will call toString on each element to convert it to a line of text in the file.

saveAsSequenceFile(path) (Java and Scala)
Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs that implement Hadoop's Writable interface. In Scala, it is also available on types that are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc).
Write the elements of the data set as Hadoop SequenceFile in the given path in the local file system, HDFS or any other file system supported by Hadoop. This is available on rdd that implements key-value pairs of the Hadoop writable interface. In Scala, it is also applicable to implicitly convertible to writable types (Spark includes conversion to basic types such as Int, Double, String, etc.).

saveAsObjectFile(path) (Java and Scala)
Write the elements of the dataset in a simple format using Java serialization, which can then be loaded using SparkContext.objectFile(). Use Java serialization to write the elements of the dataset in a simple format, and then you can use SparkContext.objectFile() to load these elements.

countByKey()
Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of each key. Only available on RDDs of type (K, V). Returns the hash map of (K, Int) pairs and the count of each key.

foreach(func)
Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems. Note: modifying variables other than Accumulators outside of the foreach() may result in undefined behavior. See Understanding closures for more details. Run the function func on each element of the data set. This is usually for side effects, such as updating the accumulator or interacting with an external storage system. Note: Modifying variables other than accumulators outside of foreach() may result in undefined behavior. For more information, see Understanding Closures.

No comments:

Post a Comment