Migrating to Spark 2.0 - Part 9 : Hive Integration
Spark 2.0 brings a significant changes to abstractions and API’s of spark platform. With performance boost, this version has made some of non backward compatible changes to the framework. To keep up to date with the latest updates, one need to migrate their spark 1.x code base to 2.x. In last few weeks, I was involved in migrating one of fairly large code base and found it quite involving process. In this series of posts, I will be documenting my experience of migration so it may help all the ones out there who are planning to do the same.
This is the ninth post in this series.In this post we will discuss about hive integration in spark. You can access all the posts here.
TL;DR You can access all the code on github.
Hive Integration in Spark
From very beginning for spark sql, spark had good integration with hive. Hive was primarily used for the sql parsing in 1.3 and for metastore and catalog API’s in later versions. In spark 1.x, we needed to use HiveContext for accessing HiveQL and the hive metastore.
From spark 2.0, there is no more extra context to create. It integrates directly with the spark session. Also the catalog API, which we discussed in last post will be available for hive metastore also.
In below sections, we will discuss how to use hive using spark 2.0 API’s. This will help you to migrate your HiveContext code to the new code.
Enabling Hive Support
By default spark session is not configured to connect to hive. We need to explicitley have to enable using enableHiveSupport at the time of session creation.
Spark session looks for hive-site.xml for connecting to hive metastore.
Hive State
Before we start running different operations on hive, make sure that you have hive installed and running. Also make sure you have hive-site.xml in the spark classpath.
Currently my hive has single table sales which contains the sales.csv data which we have used in earlier posts. We can observe the same from hive command line as below.
Catalog to List Tables
The first operation is to list tables in hive. We can use spark catalog listTables for listing it from hive metastore.
Output
As you observe from the output it’s quite different than we observed we queried in memory tables in last post. When we connected to hive the below additional information is filled up
- database - Name of the database in Hive
- tableType - MANAGED means it native hive table. It will be EXTERNAL for external tables.
- isTemporary - For spark view it is set to true. Since table is loaded from the hive, it’s false.
Loading Table
Once we queried the tables , we can now load the table from hive. We use table API on spark session to do the same.
Output
Saving Dataframe as Hive Table
We can also save the dataframe to hive as table. It will create table metadata in hive metastore and save data in parquet format.
Observe output in the hive
Complete Code
You can access complete code on github.
Conclusion
Spark 2.0 unifies the hive integration with spark session and catalog API. We don’t need to create multiple contexts and use different API to access hive anymore.