Many of the times, research data available in data science is in matlab format. So if you want to analyze that data through spark you have to have a way to convert matlab files to spark rdd’s. This post I am going to discuss about using open source JMatIO library to convert matlab files to spark rdd’s.

##JMatIO - Matlab’s MAT-file I/O in JAVA

JMatIO is an open source library provided to read matlab files in java. We can use this to read matlab files in spark also. You can download jar from here or if you are using maven , you can add the following dependency

  <dependency>
    <groupId>net.sourceforge.jmatio</groupId>
    <artifactId>jmatio</artifactId>
    <version>1.0</version>
  </dependency>

Reading a mat file

We are going to use mnsit mat file for this example. It has following four matrix in it

  1. train_x - train data x features
  2. train_y - train data labels
  3. test_x - test data x features
  4. test_y - test data labels

Follow the following steps to read these matrices and store them as RDD.

  • Reading mat file using JMatIO
val file = new MatFileReader("src/main/resources/mnist_uint8.mat")
val content = file.getContent  
  • Getting specific Matlab variable from content
 val train_x = content.get("train_x").asInstanceOf[MLUInt8].getArray
 val train_y = content.get("train_y").asInstanceOf[MLUInt8].getArray

Casting to MLUInt8 says that array content is integers.

  • Converting Matlab arrays to spark label point rdd
val trainList = toList(train_x,train_y)
val trainRDD = sparkContext.makeRDD(trainList)
  • toList method
    Here we take both X and Y vector and converting to a Spark Labeled point which is used in most of the classification algorithms.
 def toList(xValue:Array[Array[Byte]],yValue:Array[Array[Byte]]):
 Array[(Double,Vector)] ={
     xValue.zipWithIndex.map{
       case (row,rowIndex) => {
         val features = row.map(value => value.toDouble)
         val label = yValue(rowIndex)(0).toDouble
         (label,Vectors.dense(features))
       }
     }
   }

“toDouble” is used as spark label point expects all values to be in double.

  • Saving rdd for further processing
trainRDD.saveAsObjectFile("mnsit")