Analysing Kaggle Titanic Survival Data using Spark ML
Kaggle is one of the biggest data and code repository for data science. Recently I have started learning various python data science tools like scikit-learn,tensorflow, etc. As to practice these tools, I have started exploring the kaggle datasets. In kaggle, you will find many notebooks(kernels) which explore the data in various ways. Most of them are written in python and R.
Kaggle has a introductory dataset called titanic survivor dataset for learning basics of machine learning process. In this post, I have taken some of the ideas to analyse this dataset from kaggle kernels and implemented using spark ml. So as part of the analysis, I will be discussing about preprocessing the data, handling null values and running cross validation to get optimal performance.
Titanic Survivor Dataset
Titanic survivor dataset captures the various details of people who survived or not survived in the shipwreck. Using this data, you need to build a model which predicts probability of someone’s survival based on attributes like sex, cabin etc. It’s a classification problem.
You can learn more about the dataset at kaggle.
The rest of the post will discuss various steps to build the model.
TL;DR You can access all the code on github.
1. Preprocessing
First part of data analysis to load data and pre-process to fit to machine learning.
1.1 Loading CSV data
First we load the data using spark data source API.
The below is the schema of data
The Survived column is the target column.
1.2 Handling Missing Values
In many data science cases, we need to handle missing values. These are the values which are not observed or not present due to issue in data capturing process.
In our dataset, Age of some of the passengers is not known. So we use na.fill API to fill the mean age for all the missing age values.
1.3 Split Data for Train and Holdout
One we prepared our data, we split the data for training and hold out. We use spark’s randomSplit method to do the same.
1.4 Handling Categorical Variables
In our dataset, many columns like Sex,Embarked are categorical variables. So we are one-hot encoding them using spark ML pipeline API’s. In this example, we are using StringIndexer and OneHotEncoder to do that.
Create stages for all categorical variables
2. Classification using RandomForest
Random Forest is one of the best algorithm for classification purposes. So we will using it for our problem.
2.1. Create ML Pipeline with RandomForest Classifier
The below code creates a vector assembler stage which accumulates all the features we are using for our training. Then we create random forest stage for classification.
Then finally we create a ML pipeline with all the stages.
2.2 Fit the model
2.3. Accuracy Score
Once model is trained, we need to know how it’s performing. So we use accuracy score as our evaluation metric. The below code shows how to calculate the same.
2.4. Results
The below are the results.
3. Cross Validation and Hyper Parameter Tuning
Random forest comes with many parameters which we can tune. Tuning them manually is lot of work. So we can use cross validation facility provided by spark ML to search through these parameter space to come up with best parameters for our data.
3.1. Specifying the parameter grid
The below are the parameters which we want to search for.
3.2 Cross Validation
Once we define the parameters, we define cross validation stage to search through these parameters. Cross validation also make sures that we don’t overfit the data.
3.3 Results
As you can we got much better results using cross validation.
4. Generating Submit File
As the data used in this example is part of kaggle competition, I generated results for their test data to sumbit using below code.
You can access complete code on github.
Kaggle Ranking
Using the above code I got accuracy of 0.81 and 250 rank in kaggle.