Introduction to Spark 3.0 - Part 1 : Multi Character Delimiter in CSV Source

Spark 3.0 is the next major release of Apache Spark. This release brings major changes to abstractions, API’s and libraries of the platform. This release sets the tone for next year’s direction of the framework. So understanding these few features is critical to understand for the ones who want to make use all the advances in this new release. So in this series of blog posts, I will be discussing about different improvements landing in Spark 3.0.

This is the first post in the series where I am going to talk about improvements in built in csv source. You can access all posts in this series here.

TL;DR All code examples are available on github.

CSV Source

CSV is one of most used data source in Apache Spark. So from spark 2.0, it has become built-in source.

Spark 3.0 brings one of the important improvement to this source by allowing user to specify the multi character delimiter.

Delimiter Support in Spark 2.x

Till Spark 3.0, spark allowed only single character as the delimiter in CSV.

Let’s try to load the below csv which has || as it’s delimiter.

a||b||c||d
1||2||3||4
5||6||7||8

Using below code

val df  = sparkSession.read
      .option("delimiter","||")
      .option("header","true")
      .csv("src/main/resources/multicharacterseperator.csv")

When you run the above code, you will get below exception

throws java.lang.IllegalArgumentException: Delimiter cannot be more than one character: ||

As you can see from the exception, spark only supports single character as the delimited. This made user to process the csv outside the spark which is highly inconvenient.

Multiple Character Delimiter Support in Spark 3.0

Spark 3.0 has added an improvement now to support multiple characters in csv. So when you run the same code in 3.0, you will get below output

+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+
|  1|  2|  3|  4|
|  5|  6|  7|  8|
+---+---+---+---+

Even though it looks as a small improvement, it now alleviates need of doing these kind of processing outside of spark which will be huge improvement for larger datasets.

Code

You can access complete code on github.

References

https://issues.apache.org/jira/browse/SPARK-24540.

CSV Source

Delimiter Support in Spark 2.x

Multiple Character Delimiter Support in Spark 3.0

Code

References

Related posts