Statistical Data Exploration using Spark 2.0  Part 1 : Five Number Summary
Data exploration is an important part of data analysis to understand nature of data. Data scientists use various mathematical and statistics techniques to understand the distribution and shape of the data which comes handy to draw conclusions.
Recently I started going through the coursera course “Making Sense of Data” videos to understand the data exploration techniques. Its an excellent course which explains all the basics of theory and practical aspects of data exploration. Currently the course is not available at the coursera. But you can find recording of earlier course on youtube.
Most of the examples of course are explained using R programming language. As Spark 2.0 and R share dataframe as common abstraction, I thought it will be interesting to explore possibility of using Spark dataframe/datasets abstractions to do explore the data.
This series of blog posts are focused on the data exploration using spark. I will be comparing the R dataframe capabilities with spark ones. I will be using Spark 2.0 version with Scala API and Zeppelin notebooks for visualizations.This is the first blog in series where we will be discussing how to derive summary statistics of a dataset. You can find all other blogs in the series here.
TL;DR All code examples available on github.
Loading Dataset
For our example, we will be using life expectancy dataset. This is a dataset which has average life expectancy of all countries across the world. Its space separated file with following three fields.
 Country
 Life Expectancy
 Region
The below code shows to how to read the data in R
We can load the same data in spark as below
We have to do little bit more work in spark as sparkcsv package doesn’t handle double spaces properly.
Now we have data ready to explore.
Five Number Summary
Five number summary is one of the basic data exploration technique where we will find how values of dataset columns are distributed. In our example, we are interested to know the summary of “LifeExp” column.
Five Number Summary Contains following information
 Min  Minimum value of the column
 First Quantile  The 25% th data
 Median  Middle Value
 Third Quartile  75% of the value
 Max  maximum value
The above values gives a fair idea about how the values are distributed for the column. For categorical columns it will be different. We will discuss about that in future blogs.
The below code in R allows to compute above values
We get below result
R shows mean also as part of the summary.
Five Number Summary in Spark
In Spark, we can get same information using describe method.
The below is the output
If you observe the result, rather than giving quantiles values and median, spark gives standard deviation. The reason is, median and quantiles are costly to compute on large data. Both values need data to be in sorted order and result in skewed calculations.
Calculating Quantiles in Spark
We can calculate quartiles using approxQuantile method introduced in spark 2.0. This allows us to find 25%, median and 75% values like R. The name of the method suggests that we can get approximate values whenever we specify the error rate. This makes calculations much faster compared to absolute value.
In our example we will be calculating the exact values so that we can compare to R.
In above code, the below are the parameter to approxQuantile method
 Name of the column

Array of values signifying which quantile we want. In our example we are calculating 25%, 50% and 75%. 50% is same as median
 The last parameter signifies the error rate. 0.0 signifies we want exact value.
The below is the result
If you compare the result, it matches with the R values.
You can access complete code on github.
Conclusion
In this post we discussed how to get started with exploring data using statistics in spark 2.0. In future blogs we will discuss further more techniques and API’s.