Java 8 has brought a number of functional goodies to the platform. One of the most useful features is
Like many other frameworks, Spring also started to utilize Java Stream and soon adopted it. The most notable Streaming feature in Spring is Data JPA Stream in my opinion. It brings a significant performance boost to data fetching particularly in databases with millions of record.
In this post, I discuss how to use Spring Boot for high-performance data fetching using Spring Data JPA Stream and pull millions of records from MySQL in a short period of time. But before that, I explain a little about the traditional approaches and their shortcomings.
Traditional large data fetching approaches
#1 Pull all data once in memory
In most of the small to medium size databases, this is the de facto approach. Just pull all the necessary data in memory and lazy loading the relations. For instance, in a small library system we can pull all the books in the memory and using a simple cashing mechanism like memoize and lazy load author object.
When it comes to large databases with millions of records, bringing everything in memory will result in out of memory exception. In such a case using a distributed caching system such as Redis or Hazelcast seems reasonable. But what if we just want to read all these data from source A (MySQL for instance) and do a simple transformation and put in source B (Elasticsearch). In this case caching is not a good solution as we aim to read data once only.
#2 Results pagination
A remedy of out of memory error in the previous approach is pagination. We can rely on the
offset combination to read many rows in pages. This is a very reasonable approach. Spring Data also supports it. All we need is to change the intended repository to
The return type of our method is
Page<Book> which is capable of returning the results as well as the total number of pages. However, keep in mind that knowing the total number of pages comes at a price. For each page that Hibernate executes to get a result, it executes an additional count query to calculate the total pages. This slows down the process significantly. So to speed things up we simply can change the return type
But things are not ending here. Pagination has a huge disadvantage in large databases. As time passes the speed of processing data will be slower. This is the database issue that it takes a longer time to return the result of queries with a larger offset. So if time is a big constraint, this approach will not work well.
#3 Apache Spark
A reliable big data solution such as Apache Spark is a very suitable candidate to stream millions of rows easily. It’s resilient and highly scalable. The only disadvantage is setting up a Spark cluster and knowing what to do exactly. I personally highly recommend this approach and believe that the initial time invested in learning about it will pay off well in the long run as the ecosystem grows.
Streaming data using Spring Data JPA
However, if you don’t want to invest in Spark or having other constraints and/or reservation there is one last option left which is Spring Data JPA Stream which I’m going to discuss it in the following section.
In order to make Spring Data JPA Stream to work, we need to modify our
Repository file to return
Stream<Book> like this:
Pay attention that if run the code like this, it will not work. It will throw
@QueryHints on top, like below:
If you pay attention, you will see that we disable second level caching and hint Hibernate that the entities will be read only. In this example, we assumed that we just need to read each row once and then process it and save it somewhere else. If your requirement is different, make sure to change those settings accordingly.
Now we can start reading the database records like this:
@Transactional annotation is necessary to make streaming work. Otherwise, Spring throws
readOnly to the annotation since we are not aiming to modify the entity.
But that’s not all of it. If we run the code for 30 minutes, depending on the heap size, we get out of memory exception. That is because Hibernate keeps the entities and their reference in memory. To fix that issue after processing each row we need to detach the entity so garbage collector can reclaim the memory.
The final version of the code will look like that:
In this post, we discussed high-performance data fetching using Spring Data JPA Stream API. If you are interested to know how to do batch insertion using Spring Data have a look at this post.