Converting Matlab file to Spark RDD
Many of the times, research data available in data science is in matlab format. So if you want to analyze that data through spark you have to have a way to convert matlab files to spark rdd’s. This post I am going to discuss about using open source JMatIO library to convert matlab files to spark rdd’s.
##JMatIO - Matlab’s MAT-file I/O in JAVA
JMatIO is an open source library provided to read matlab files in java. We can use this to read matlab files in spark also. You can download jar from here or if you are using maven , you can add the following dependency
Reading a mat file
We are going to use mnsit mat file for this example. It has following four matrix in it
- train_x - train data x features
- train_y - train data labels
- test_x - test data x features
- test_y - test data labels
Follow the following steps to read these matrices and store them as RDD.
- Reading mat file using JMatIO
- Getting specific Matlab variable from content
Casting to MLUInt8 says that array content is integers.
- Converting Matlab arrays to spark label point rdd
- toList method
Here we take both X and Y vector and converting to a Spark Labeled point which is used in most of the classification algorithms.
“toDouble” is used as spark label point expects all values to be in double.
- Saving rdd for further processing