Introduction to Spark 2.0 - Part 3 : Porting Code from RDD API to Dataset API
Spark 2.0 is the next major release of Apache Spark. This release brings major changes to abstractions, API’s and libraries of the platform. This release sets the tone for next year’s direction of the framework. So understanding these few features is critical to understand for the ones who want to make use all the advances in this new release. So in this series of blog posts, I will be discussing about different improvements landing in Spark 2.0.
This is the third blog in series, where I will be discussing about how to port your RDD based code to Dataset API. You can access all the posts in the series here.
TL;DR All code examples are available on github.
RDD to Dataset
Dataset API combines best of RDD and DataFrame API’s in one API. Many API’s in Dataset mimic the RDD API though they differ a lot in the implementation. So most of RDD code can be easily ported to Dataset API. In this post, I will be sharing few code snippets to show how a given code in RDD API can be written in Dataset API.
##1. Loading Text Data
RDD
Dataset
2. Calculating count
###RDD
Dataset
3. WordCount Example
RDD
Dataset
4. Caching
RDD
Dataset
5. Filter
RDD
Dataset
6. Map Partitions
RDD
Dataset
7. reduceByKey
RDD
Dataset
7. Conversions
RDD
Dataset
Converting a RDD to dataframe is little bit work as we need to specify the schema. Here we are showing how to convert RDD[String] to DataFrame[String].
##8. Double Based Operations
RDD
Dataset
##9. Reduce API
RDD
Dataset
You can access complete code here.
The above code samples show how to move your RDD based code base to new Dataset API. Though it doesn’t cover complete RDD API, it should give you fair idea about how RDD and Dataframe API’s are related.