In earlier series of posts, we have discussed about new data source API that was introduced in 2.3. In 2.3, it was in beta version. In 2.4, API have been updated and marked as stable. Some of these are breaking changes.So if you are porting the datasources to 2.4 version, you need to update your code to match the latest API.

In this post, we are going to discuss the changes that introduced in the API. Also all the code what we saw in earlier series is ported to new version. This allow you to see how to update your own code.

Reader API

In this section, we are going to see the changes that introduced for Reader API.

DataSourceReader Interface

In DataSourceReader interaface, the function

def createDataReaderFactories: List[DataReaderFactory[T]]

is replaced by

def planInputPartitions: List[InputPartition]

As you can see in this, factory is replaced by partition.

DataReaderFactory to InputPartition

DataReaderFactory interface is renamed to InputPartition.

Also the method

def createDataReader: DataReader[Row]

is replaced by

def createPartitionReader:InputPartitionReader[InternalRow]

DataReader to PartitionReader

DataReader interface is renamed to InputPartitionReader.

Also the method

def createDataReader

is replaced by

def createPartitionReader

Row to InternalRow

In earlier API’s, the interfaces used to take Row. All that is changed to InternalRow. This allows all the data sources load data in more efficient way.

Writer API

Compared to Reader API, writer API has gone very minimal changes. In this section we are going to discuss about those changes.

DataWriterFactory[Row] to DataWriterFactory[InternalRow]

As we discussed in earlier section, Row is changed to InternalRow

DataWriterFactory Interface

In DataWriterFactory interface, the method

def createDataWriter(partitionId: Int, attemptNumber: Int)

is changed to

def createDataWriter(partitionId: Int, taskId:Long,epochId: Long)

epochId is added to support spark structured streaming.

Code

You can access the port of earlier code on github.

Conclusion

Spark 2.4 has updated API’s for newly introduced data source API. Now the API is marked as stable and going to serve the spark developers for long.