How to increase insert performance of Amazon AWS Redshift using JDBC

How to increase insert performance of Amazon AWS Redshift using JDBC

Redshift is one of the Amazon AWS services suitable for Big data, BI, and analytical purposes. Though, sometimes it is difficult to get used to it due to its limitations compared to other databases like PostgreSQL. Nonetheless, Redshift provides tremendous benefits for large datasets. In this post, we demonstrate how to increase insert performance of Amazon AWS Redshift using JDBC in Java.

Redshift copy command

As it is stated in Redshift documentation, the best way to insert large datasets to Redshift is to use the copy command in which one must first dump data in a file and upload it to an S3 bucket. Then use the copy command to insert it into Redshift.

We benchmarked the copy command with a 12K dataset. The insertion process took less than ten seconds. We highly recommend using this approach if you can export your data to a file and then an S3. It is the fastest and the most efficient.

Fast insertion with JDBC connection

On the other hand, we cannot always rely on the copy command, especially when we would like to insert live data into Redshift. In that case, we have to establish a connection to a Redshift instance from the application and insert data in a usual fashion as if the Redshift instance is an RDBMS.

For simplicity’s sake, we selected the Java JDBC driver to avoid driver configuration complications. Additionally, we intend to demonstrate how taking into account some programming best practices can significantly boost performance even when using simple tools.

In the next section, we will demonstrate five approaches to inserting data into a Redshift instance in Java via a JDBC connection. Each has some pros and cons. Some work better than in some scenarios. Hence, it is vital to be fully aware of your requirements. Then use the one that suits your use case the best.

#1 Single insert with getting a connection for each insert

In this approach, we get a connection for each insert from the connection pool or worse establish a new connection to Redshift.

public void insertStudents(List<Student> students) {
    String statement = "INSERT INTO student VALUES(?,?,?)";
    for (int i = 0; i < students.size(); i++) {
        Student student = students.get(i);
        try (Connection connection = config.getConnection()) {
            try (PreparedStatement ps = connection.prepareStatement(statement)) {
                int j = 0;
                ps.setString(++j, student.getGuid());
                ps.setString(++j, student.getName());
                ps.setInt(++j, student.getId());
                ps.executeUpdate();
            }
        } catch (SQLException e) {
            throw new RuntimeException(e);
        }
    }
}

While this approach is simple, it does not scale since getting a new connection is costly. Additionally, there is a database round trip for each insertion. You should refrain from using this solution if you intend to insert more than a handful of records into the database.

#2 Single insert with keeping a connection alive for all insertion

From the previous example, if we change the positioning of the for-loop and move it after the prepared statement, we achieve a better throughput since we do not establish a new connection for each insertion.

public void insertStudents(List<Student> students) {
    String statement = "INSERT INTO student VALUES(?,?,?)";
    try (Connection connection = config.getConnection()) {
        try (PreparedStatement ps = connection.prepareStatement(statement)) {
            for (int i = 0; i < students.size(); i++) {
                Student student = students.get(i);
                int j = 0;
                ps.setString(++j, student.getGuid());
                ps.setString(++j, student.getName());
                ps.setInt(++j, student.getId());
                ps.executeUpdate();
            }
        }
    } catch (SQLException e) {
        throw new RuntimeException(e);
    }
}

However, the benefit of this approach is still negligible when using a connection pool. Furthermore, we still have the round trip issue as one record per transaction is added into Redshift.

#3 Batch insertion

To avoid database round trip and minimize overhead, we can batch multiple records and then persist them. That way, we achieve much higher throughput.

public void insertStudents(List<Student> students) {
    String statement = "INSERT INTO student VALUES(?,?,?)";
    try (Connection connection = config.getConnection()) {
        try (PreparedStatement ps = connection.prepareStatement(statement)) {
            for (int i = 0; i < students.size(); i++) {
                Student student = students.get(i);
                int j = 0;
                ps.setString(++j, student.getGuid());
                ps.setString(++j, student.getName());
                ps.setInt(++j, student.getId());
                ps.addBatch();
                if (i % 1000 == 0) {
                    ps.executeBatch();
                }
            }
            ps.executeBatch();
            System.out.println("Batch successfully inserted");
        }
    } catch (SQLException e) {
        throw new RuntimeException(e);
    }
}

The above code is still not as performant as the copy command. Additionally, since JDBC autocommit flag is true by default, each statement in the batch will be treated as a separate transaction as depicted below,

-- Begin of the batch
BEGIN;

BEGIN;
INSERT INTO student (guid, name, id) VALUES (value1_1, value1_2, value1_3);
COMMIT;

BEGIN;
INSERT INTO my_table (guid, name, id) VALUES (value2_1, value2_2, value2_3);
COMMIT;

BEGIN;
INSERT INTO my_table (guid, name, id) VALUES (value3_1, value3_2, value3_3);
COMMIT;

-- ... add more statements for the remaining rows ...

-- End of the batch
COMMIT;

That means that, for the above code (batch size of 1000), using the default JDBC autocommit configuration, 1000 transactions will be executed.

#4 Multi-row batch insertion

Amazon Redshift documentation recommends inserting multiple rows in a single transaction. However, JDBC autocommit flag is true by default. Hence, for each insert, it creates a separate SQL insert statement. To boost the performance, we can utilize multi-row batch insertion. Meaning inserting multiple records in a chunky SQL insert statement. A multi-valued insert statement looks as follows,

INSERT INTO student (guid, name, id) 
VALUES 
  (value1_1, value1_2, value1_3),
  (value2_1, value2_2, value2_3),
  (value3_1, value3_2, value3_3),
  ...
  (valueN_1, valueN_2, valueN_3);

To achieve the above, we need to set autocommit to false, ensure using prepared statement and set batch insertion related flags in the JDBC connection as follows,

datasource.url=jdbc:mysql://localhost:3306/studiesdb?reWriteBatchedInserts=true;reWriteBatchedInsertsSize=1024

Due to unclear reasons, unfortunately, JDBC prepared statement multi-row does not perform fast enough with Redshift. See here. Hence, we must handcraft the SQL query, which is prone to SQL injections.

public void insertStudents(List<Student> students) {
    String initialInsertStatement = "INSERT INTO student VALUES";
    StringBuilder query = new StringBuilder(initialInsertStatement);
    try (Connection connection = config.getConnection()) {
        try (Statement statement = connection.createStatement()) {
            for (int i = 0; i < students.size(); i++) {
                if (i == 0) {
                    query.append(asSqlQuery(students.get(i)));
                } else if (i % 200 != 0) {
                    query.append(",")
                        .append(asSqlQuery(students.get(i)));
                } else {
                    statement.addBatch(query.toString());
                    query = new StringBuilder(initialInsertStatement);
                    query.append(asSqlQuery(students.get(i)));
                }
                if (i % 1000 == 0) {
                    statement.executeBatch();
                    System.out.println("Batch successfully inserted");
                }
                i++;
            }
            statement.addBatch(query.toString());
            statement.executeBatch();
        }
    } catch (SQLException e) {
        throw new RuntimeException(e);
    }
}

private static String asSqlQuery(Student student) {
        return String.format("('%s','%s', %s)", student.getGuid(), student.getName(), student.getId());
}

Do NOT use the above code for production since it has an SQL injection issue. User input must be escaped accordingly. Otherwise, a malicious actor can infiltrate the database.

With multi-row insert, we achieve a significant performance boost at the cost of handcrafting the SQL query manually which is error-prone.

#5 Unnest insertion (for single column table)

PostgreSQL (Redshift ancestor) has a function called unnest. It allows us to insert an array with a single insert statement neatly and cleanly. However, the biggest drawback of unnest is that it only works with a single-column table. The following code demonstrates how to use unnest in Java.

public void insertNames(List<String> names) {
    try (Connection connection = config.getConnection()) {
        PreparedStatement statement = connection.prepareStatement("INSERT INTO student (name) SELECT * FROM UNNEST(?)");
        statement.setArray(1, connection.createArrayOf("text", names.toArray(new String[names.size()])));
        statement.execute();
    } catch (SQLException e) {
        throw new RuntimeException(e);
    }
}

This approach has the best performance compared to the other four. Additionally, it does not require any handcrafted SQL statement. Our benchmark reveals that we could insert 100K records per minute with unnest.

Conclusion

In this article, we covered how to increase the insert performance of Amazon AWS Redshift using JDBC. We presented four approaches and discussed the pros and cons of each. In our experience, the multi-row inserts approach provides the best performance when inserting into tables with more than one column. The downside of the multi-row inserts is that it is prone to SQL injection and requires manual escape SQL. As for a single-column table, unnest approach yields the best performance.

References

Inline/featured images credits

2 thoughts on “How to increase insert performance of Amazon AWS Redshift using JDBC

  1. Hi Kasra,

    How do we capture errored statement in case batch fails ? .executeBatch of redshift jdbc driver does not throw BatchUpdateException

Comments are closed.