Apache Spark is the de facto framework of the big data world. Any serious organization that’s dealing with big data uses Spark almost exclusively. Though, it has some caveats. For the starter, it’s hard to use. And it’s very confusing to get started with, even for those with a solid technical background like me. That’s very true even for simple use cases like basic data transformation. Hence, I decided to write this to go through this simple, yet very common scenario. In this article, I explain how to transform data from MySQL table with millions of records to JSON with only 10 lines of Apache Spark coding.
A year ago I wrote an article about High performance data streaming with JPA. The post has gained some popularity and since its publication, thousands of people read it every month. Recently I had some extra time to revisit that post. One of the suggestions I made there was using Apache Spark to deal with big data. But I didn’t explain how.
So here we are, after more than a year, I finally had the time. So I walk you through the process of achieving high-performance big data transformation with Apache Spark. And surprisingly in 10 lines of code.
Let’s assume that we have a flat MySQL table that has millions of records. We want to transform the table content to JSON in a reasonable timeframe. To make things more clear, I use the Library app analogy.
Assuming we have a flat books
table from a library app with the following data structure,
CREATE TABLE IF NOT EXISTS books (
id INTEGER AUTO_INCREMENT PRIMARY KEY,
title VARCHAR(512) NOT NULL,
description VARCHAR(2048) NOT NULL,
author VARCHAR(256) NOT NULL,
year INTEGER NOT NULL,
edition INTEGER NOT NULL,
publisher VARCHAR(512) NOT NULL,
inserted DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP
) engine=InnoDB;
We want to convert this to JSON with the following structure,
{
"id":10,
"title":"2a151700-e491-4548-af1f-100ecf7f4d80",
"description":"56ee059d-f213-4410-b421-dc6202451ddf",
"author":"21cbd76e-2dd5-4659-aff2-073a065acc1b",
"year":2014,
"edition":5,
"publisher":"ebdf07ee-a7b4-4bea-984c-ae9e09b4815b"
}
The workflow looks like this,

Adding Spark and MySQL dependencies
The first thing is to add Spark and MySQL dependencies. For Maven projects, it would be,
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>2.4.5</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>2.4.5</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>8.0.19</version>
</dependency>
</dependencies>
Configuring the Spark cluster
Now we need to configure the Spark cluster. For simplicity’s sake, we use the local
cluster. So the configuration would be,
SparkSession spark = SparkSession.builder()
.config("spark.master", "local")
.appName("Java Spark SQL basic example").getOrCreate();
Implementing data transformation
This is where we need to write the 10 liners code. First, we have to craft the query. Since we need all data except inserted
the query would like this:
SELECT id, title, description, author, year, edition, publisher FROM books order by id
Keep in mind that we need to have Order by
which is necessary for partitioning and ensuring the data is retrieved correctly. Otherwise, MySQL returns data in no orders which messes up the entire transformation process.
Then we need to configure which data source Spark needs to read the data from:
private void transformToJson() {
SparkSession spark = SparkSession.builder()
.config("spark.master", "local")
.appName("Java Spark SQL basic example").getOrCreate();
String query = "SELECT id, title, description, author, year, edition, publisher FROM books order by id";
Dataset<Row> jdbcDF = spark.read().format("jdbc")
.option("url", propertyLoader.getJdbcConnectionString())
.option("user", propertyLoader.getDatabaseUserName())
.option("password", propertyLoader.getDatabasePassword())
.option("numPartitions", "10")
.option("partitionColumn", "id")
.option("lowerBound", "0")
.option("upperBound", Long.MAX_VALUE)
.option("driver", "com.mysql.jdbc.Driver")
.option("dbtable", String.format("(%s) AS tmp", query))
.load();
jdbcDF.write().format("json").mode("append").save("books_json");
spark.close();
}
Here there is a couple of fields that I need to explain:
format
: Points to the data source type, herejdbc
driver
: Database driver, here is MySQLcom.mysql.jdbc.Driver
url
: JDBC urlusername
: MySQL database usernamepassword
: MySQL database passwordnumPartitions
: In how many partitions the data should be retrieved, here10
partitionColumn
: Which column to use to partition data. Type can be eithernumber
ordatetime
lowerBound
: The lowest value of thepartitionColumn
. Fordatetime
pass a string like,2010-01-01 00:00:00
upperBound
: The highest value of thepartitionColumn
dbtable
: JDBC table to read the data from. Since we want to read full table we need to wrap it as a subquery.
Then the line 17
specifies the output format, the insertion mode append
if the data exists, and the path to save the data.
That’s all, as you can see we managed to transform a MySQL table to JSON with 10 lines of Spark coding. It’s super easy for simple transformations.
You can find the fully working example on GitHub at the link below,
https://github.com/kasramp/sample-spark-java
The example first loads 3 million records to the database and then starts the transformation process. Of course, the complete project is more 10 lines of code 😉
Read more: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html