Search This Blog

Apache Spark RDDs - RDD transformations and RDD actions

Definitions:

RDD - a resilient distributed dataset

DAG of operations - a graph containing the requested RDD operations; it is built by Spark under the hood; it kind of tells Spark what RDD operations the code wants to perform on the input RDD and in what order it wants to perform them.


Example (PySpark code):

>>> sc.parallelize([1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8]).filter(lambda x : x % 2 == 0).filter(lambda x : x < 6).map(lambda x : x*x).distinct().collect()

The result of this PySpark code (Python code for Spark) is the list [16, 4] 

The input RDD in this case is the result of this call 

sc.parallelize([1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8])

This call just turns a simple list of elements into an RDD.


Explanations:

An RDD is an immutable distributed collection of data elements which is partitioned across a set of nodes/hosts of the cluster that can be recovered if a partition is lost, thus providing fault tolerance.

For programming purposes it can be thought of as one large collection, it's just that its elements are not stored on a single node/host, instead they are partitioned and stored on multiple nodes/hosts.

The RDD API provides two fundamentally different types of operations that can be performed on an RDD. 


RDD Transformations

When called they construct new RDDs from existing ones, their results are these new RDDs. The new/resulting RDDs are not immediately evaluated, this is called lazy evaluation. So when an RDD transformation is called it just adds up an operation (a transformation) to a DAG of operations. This DAG of operations describeds what operations we want to perform on our input RDD. Here is a list of the most frequently used RDD transformations: map, flatMap, filter, distinct, sample, union, intersection, subtract, cartesian, zip


RDD Actions

When called they actually lead to some computations/evaluations performed on the DAG of operations that was previously built. They return the result or the results of these computations. They can return a single value or a collection of values. Here is a list of the most frequently used RDD actions: collect, count, countByValue, take, top, reduce


Nothing is really computed in our Spark driver program until an RDD action is called.

In the example above the operations filter, filter, map, distinct are RDD transformations and the final operation collect is an RDD action. 


No comments:

Post a Comment