Data Modeling in Apache Spark - Part 1 : Date Dimension
Data modeling is one of the important aspect of data analysis. Having right kind of model, allows user to ask business questions easily. The data modeling techniques have been bedrock of the SQL data warehouses in past few decades.
As Apache Spark has become new warehousing technology, we should be able to use the earlier data modeling techniques in spark also. This makes Spark data pipelines much more effective.
In this series of posts, I will be discussing different data modeling in the context of spark. This is the first post in the series which discusses about using date dimension. You can access all the posts in the series here.
Importance of Data and Time in Data Analysis
Most of the data that we analyse typically captured contains Date or Timestamp. For example, it may be
• Trading date of the Stock
• Time of the transactions in POS systems
Many of the analysis what we do typically is around the date or time. We typically want to slice and dice the data using the same.
Date Analysis using Built In Spark
This section of the document talks about the how to do date analysis using builtin spark date functions.
Apple Stocks Data
For this example, we will be using Apple Stocks data. The below is the sample data
Loading into Spark Dataframe
The below code loads data into spark dataframe.
Date Analysis
In this section, let’s see how we can answer date related questions.
Is there any records which belongs to weekend?
This analysis is typically done to make sure the quality of the data. There should not be any data for weekend as there will be no trading done on weekend.
In above code, 1 signifies Sunday and 7 signifies Saturday. As we can see here code is not readable unless we know how to decode these magic numbers.
Show Quarterly Max Price
This analysis finds the maximum for a given quarter.
Challenges with Date Analysis using Spark Date Functions
Even though we can do the above analysis using spark builtin date functions, writing them is tricky. Also these one cannot be easily expressed from an external BI solutions, where typically business analyst users are the end users. So we need an easier and better way to do the above.
Date Dimension
Date dimension is a static dataset which lists all the different properties of a given date in it’s columns. This sample dataset schema looks as below
In above schema, some of the important columns are
- full_date - Timestamp for given day
- year - year in the date
- quarter - quarter the given date belongs
etc.
This static dataset can be generated for multi years and kept available. A sample we are using in the example can be downloaded from below link.
Date Analysis using Date Dimension
This section of the document talks about how to do the above analysis using date dimension.
Loading the Data to Spark Dataframe
We can create a dataframe for our date dataset as below
In the above code, preprocessing is done to convert the String to spark date datatype.
Joining with Stocks Data
We can combine stocks data with Date using spark joins
This join doesn’t increase size of the data as it’s an inner join.
Analysis
This section shows how the analysis can be done without using complex spark functions
Is there any records which belongs to weekend?
Show Quarterly Max Price
Advantages of Date Dimension
This section discusses about the advantages of date dimension.
Reuse Across Different Analysis
Same dataset can be used for different data analysis. Rather writing special functions in the query or adding these columns on dataset itself, having a standard date dimension helps to standardise all date analysis.
Scalable
Users can add more properties on date dimension like regional holidays etc. This will enrich the analysis for every one. No additional querying is needed there.
User Friendly
The queries generated by using date dimension are more easier to understand.
Reference
Code
You can access complete code on github.