MySQL table to JSON with 10 lines of Spark

Apache Spark is the de facto framework of the big data world. Any serious organization that’s dealing with big data uses Spark almost exclusively. Though, it has some caveats. For the starter, it’s hard to use. And it’s very confusing to get started with, even for those with a solid technical background like me. That’s very true even for simple use cases like basic data transformation. Hence, I decided to write this to go through this simple, yet very common scenario. In this article, I explain how to transform data from MySQL table with millions of records to JSON with only 10 lines of Apache Spark coding.

A year ago I wrote an article about High performance data streaming with JPA. The post has gained some popularity and since its publication, thousands of people read it every month. Recently I had some extra time to revisit that post. One of the suggestions I made there was using Apache Spark to deal with big data. But I didn’t explain how.

So here we are, after more than a year, I finally had the time. So I walk you through the process of achieving high-performance big data transformation with Apache Spark. And surprisingly in 10 lines of code.

Let’s assume that we have a flat MySQL table that has millions of records. We want to transform the table content to JSON in a reasonable timeframe. To make things more clear, I use the Library app analogy.

Assuming we have a flat books table from a library app with the following data structure,

We want to convert this to JSON with the following structure,

The workflow looks like this,

Adding Spark and MySQL dependencies

The first thing is to add Spark and MySQL dependencies. For Maven projects, it would be,

Configuring the Spark cluster

Now we need to configure the Spark cluster. For simplicity’s sake, we use the local cluster. So the configuration would be,

Implementing data transformation

This is where we need to write the 10 liners code. First, we have to craft the query. Since we need all data except inserted the query would like this:

Keep in mind that we need to have Order by which is necessary for partitioning and ensuring the data is retrieved correctly. Otherwise, MySQL returns data in no orders which messes up the entire transformation process.

Then we need to configure which data source Spark needs to read the data from:

Here there is a couple of fields that I need to explain:

format: Points to the data source type, here jdbc
driver: Database driver, here is MySQL com.mysql.jdbc.Driver
url: JDBC url
username: MySQL database username
password: MySQL database password
numPartitions: In how many partitions the data should be retrieved, here 10
partitionColumn: Which column to use to partition data. Type can be either number or datetime
lowerBound: The lowest value of the partitionColumn. For datetime pass a string like, 2010-01-01 00:00:00
upperBound: The highest value of the partitionColumn
dbtable: JDBC table to read the data from. Since we want to read full table we need to wrap it as a subquery.

Then the line 17 specifies the output format, the insertion mode append if the data exists, and the path to save the data.

That’s all, as you can see we managed to transform a MySQL table to JSON with 10 lines of Spark coding. It’s super easy for simple transformations.

You can find the fully working example on GitHub at the link below,
https://github.com/kasramp/sample-spark-java

The example first loads 3 million records to the database and then starts the transformation process. Of course, the complete project is more 10 lines of code 😉