Apache Spark supports programming in multiple languages like Scala, Java, Python and R. This multi-language support has made spark widely accessible for variety of users and use cases.
Not all the languages supported by Spark have equal API support. Scala and Java supports complete user facing and library development API’s. Python and R are more restricted for user facing API’s only. This discrepancy exist as adding support for new API in a language is lot of work. So the only essential API’s are ported to all languages.
What if we want to add support to new language for spark? It will be a lot of work in traditional approach. But with GraalVM we can have access to complete set of spark library in completely new language with minimum effort.
GraalVM is a polyglot VM which allows user to run multiple languages on same VM. Not only it supports multiple languages, it allows user to bring the libraries from different languages to single platform. You can read more about graalvm here.
Let’s see how we go about it.
Setup for Running Node.js on GraalVM
This section of the post we discuss how to setup the Node.js on GraalVM.
Download GraalVM Binaries
To run Node.js programs on GraalVM, we need to download the graalvm binaries. You can download the appropriate one from below link
Start Node.js Interpreter
Once you downloaded the graalvm, you can start the Node.js interpreter using below command
The –jvm option says that we want to run on JVM mode. If we don’t specify the mode, it will run in native mode which is more optimised but doesn’t have polyglot features.
Once you run above command you should show the below output
Run Sample Node Code
Once you have Node interpreter, you can run hello world code to see, are you really running a Node.js environment.
It will output
Now we are running Node.js on JVM.
Setting Up Spark for Node.js environment
Once we have setup the Node.js environment, we need to setup the Spark environment for the same. This section of the document talks about the various steps.
Download Spark Binary
We need to download Spark Binary from below link and setup it’s path as SPARK_HOME
You can check is SPARK_HOME is set or not using below command
Adding all the Spark JARS to the classpath
For accessing Spark from Node.js, we need to add all it’s jars to JVM classpath. Currently GraalVM doesn’t allow us to add a directory to it’s classpath. So we will use below shell script to generate a string which will have all the jars in spark binary.
The above command generates a string with all the jars and stores it in CLASSPATH environment variable
Passing Classpath to Node.js
Once the CLASSPATH variable is ready, we can pass the classpath to GraalVM as below
Now we have environment ready for the spark.
Loading SparkSession Class
First step of any spark program is to create a spark session.
Once the spark session is imported, now we can create the spark session using below code.
Once we have created spark session, now we can use it to load the data. Replace the path with a csv from your system.
Once the data is loaded, the above method is used to show sample of data.
Running the Example
Save above code in a file named server.js. Then run the below command
Now you can see that spark running inside the Node.js and printing sample of your csv.
Serving Schema Over Node.js http server
Till now, we have written only spark code. Let’s mix it with Node.js code. This shows the real power of the integration. The below code prints the schema of the dataframe when user makes a get request on Node.js
Adding above code to server.js and running it again, will start a web server in 8000 port. When you access the http://127.0.0.1:8000/ you will see the schema of your dataset.
This shows how we are mixing Node code with spark on same VM.
You can access complete the code on github.