Fold in spark
Fold is a very powerful operation in spark which allows you to calculate many important values in O(n) time. If you are familiar with Scala collection it will be like using fold operation on collection. Even if you not used fold in Scala, this post will make you comfortable in using fold.
###Syntax
The above is kind of high level view of fold api. It has following three things
- T is the data type of RDD
- acc is accumulator of type T which will be return value of the fold operation
- A function , which will be called for each element in rdd with previous accumulator.
Let’s see some examples of fold
###Finding max in a given RDD
Let’s first build a RDD
Now we want to find an employee, with maximum salary. We can do that using fold.
To use fold we need a start value. The following code defines a dummy employee as starting accumulator.
Now using fold, we can find the employee with maximum salary.
###Fold by key In Map/Reduce key plays a role of grouping values. We can use foldByKey operation to aggregate values based on keys.
In this example, employees are grouped by department name. If you want to find the maximum salaries in a given department we can use following code.