Evaluating Spark RDD's for side effects
Accumulators in Spark are highly useful to do side effect based operations. For example, the following code calculates both sum and sum of squares as a side effect.
The code looks good, but it will produce zero as the sum. This is because map is a lazy operation. Here we want to evaluate sumRDD just to update accumulators. Normally we use collect or count to trigger the calculation. But collect unnecessarily loads whole split to memory and count does the unnecessary shuffling.
So we need an operation which just evaluates the RDD for it’s side effect without actually returning any value.
##Evaluating a RDD The following function takes an RDD and evaluates it
We are using runJob api on context which triggers the evaluation. Api takes a RDD which has to be evaluated and a function which of form
We pass a function which just goes over the iterator without producing any value. This allows us to update just the needed accumulators.
##Using evaluate Now we use evaluate function to evaluate our sumRDD and get our accumulator values.
Now it prints correct values of sum and squared sum.