Persisting fast in database: LOAD DATA and COPY

By Jerónimo López

In previous articles, I talked about how to persist a lot of information as quickly as possible when you are constrained to JPA and how to do it only with JDBC.

Although JDBC is the basic piece of communication with the database, there are ways to skip it and, without leaving the JVM, persist the information even faster with the help of some non-standard methods of the JDBC drivers of each database.

The language for communicating with databases is SQL, and 99.99% of the time is the mechanism we use to communicate with them.

When we make a sentence with multiple inserts in this way

INSERT INTO persona (ssn, name, age) VALUES
('987-65-4321', 'Alberto Zaplana', 26),
('123-45-6789', 'Zoe Alarcón', 62);

if we remove the SQL syntax, what we have left is a CSV

'987-65-4321','Alberto Zaplana',26
'123-45-6789','Zoe Alarcón',62

which is much easier to parse, interpret and insert into a table with the same structure.

That’s the reason why databases prefer data in CSV format in big imports, more than as a query. To import files in CSV format in MySQL we have the LOAD DATA command, in Postgres the COPY command and in SQL Server, the BULK INSERT command.

The biggest problem is setting the format of the CSV, because it is necessary to setup things like:

  • Which will be the column separator character
  • Whether strings need to be enclosed in quotation marks and with which character
  • Which will be the character of line/record change
  • Which will be the character represents the null value (in CSV no value is an empty string, different from the null value)
  • Which will be the character of escape when we find one of the previous characters

An example of each command with the dataset used in the past articles, and the same table would be:

  • Over a file myfile.csv in CSV format ready to be consumed:
(tripduration, starttime, stoptime, start_station_id, start_station_name, start_station_latitude, start_station_longitude, end_station_id, end_station_name, end_station_latitude, end_station_longitude, bike_id, user_type, birth_year, gender)
COPY bike_trip (tripduration, starttime, stoptime, start_station_id,
start_station_name, start_station_latitude, start_station_longitude, end_station_id, end_station_name, end_station_latitude, end_station_longitude, bike_id, user_type, birth_year, gender)

Although they are special commands to be executed in their console command lines, both implementations of the JDBC drivers have included it as an extension out of the standard.

Its functionality is the same in all cases: we execute with the usual JDBC Connection a sentence similar to the one we would write in the query console, and we attach an InputStream with the CSV:

In each case, you need to do different actions on the connection, but the best way is to see it directly with real code. This time we’ll have more code than usual.

The code will do the same as in the previous posts: persist the information in blocks of 1000 records with commits in each batch (in this case it will be in autocommit mode). To be able to compare, we will measure the number of persisted records per second.

From the connection returned by the data source, we must first obtain an object with MySQL’s own com.mysql.jdbc.Connection interface. Somehow we do a casting to this interface. If JDBC couldn't it would throw an exception.

The MySQL Connection interface has a method called setAllowLoadLocalInfile, which enables the driver to use the command LOAD DATA.

Then we only need to create the CSV string from the objects we have in memory according to the format we defined in the LOAD DATA sentence:

When we reach the corresponding batch size we send the CSV using a special method of the MySQL Statement implementation, which needs to be casted:

The method setLocalInfileInputStream stores in an internal variable of the object Statementthe InputStream from where to read the CSV to send to the database. Because it is information generated on the fly, I put it in an InputStream in memory. You can also store it in a file and use a FileInputStream.

Finally, we execute the LOAD DATA statement as an usual JDBC Statement. The database will internally respond that it needs the information, and the driver will finally send the entire content of the InputStream to the database.

In the Postgres case, its implementation is better and we don’t have to modify a special Connection flag to do something exceptional in some sentences. We just have to request for a special object that takes care of all: the CopyManager.

We also have to create the CSV with the values to persist according to the selected format (in this example I will use the TEXT format):

and finally when we reach the number of records in the batch, we send the CSV to the database with the CopyManager, using again an InputStream in memory:

After executing the code (available here) with the same methodology than past articles, we get the following values:

For the first time we break the 50,000 records per second barrier!

But in order to know how much we have managed to improve, let’s compare with the results of the last posts:

Not bad! around 50% performance improvement compared to the more complex version of JDBC, and between 80% and 130% faster than JPA.

There are more techniques to improve the insertion speed, and some that I’ve been seeing around (and I haven’t tried all of them) are:

  • Test with different batch sizes. Each table will ask you for a different size, depending on its data types and sizes.
  • If you are loading in an empty table, deactivate the indexes at the beginning and activate them at the end. Creating indexes over the entire table costs less than creating them while inserting the data.
  • The same applies to Foreign Keys: better not to have them when you’re making inserts. If your business logic allows it, you will save a lot of time if you don’t have to validate each reference.
  • If your data already contains the Primary Key, MySQL prefers that you insert it sorted by the Primary Key.

You can tune database configuration to minimize disk writes, or memory dedicated to specific insert tasks:

Regarding the last two points, be well informed and test before doing anything, because the consequences can be catastrophic.

I also recommend you to review your database configuration with someone who knows about the subject, because usually the default configuration that comes when you do the typical apt-get install .... is very conservative, and needs to be adapted to the memory and CPU of your server, the type of workload, and even the file system you have below.