Although JDBC is the basic piece of communication with the database, there are ways to skip it and, without leaving the JVM, persist the information even faster with the help of some non-standard methods of the JDBC drivers of each database.
The language for communicating with databases is SQL, and 99.99% of the time is the mechanism we use to communicate with them.
When we make a sentence with multiple inserts in this way
INSERT INTO persona (ssn, name, age) VALUES
('987-65-4321', 'Alberto Zaplana', 26),
('123-45-6789', 'Zoe Alarcón', 62);
if we remove the SQL syntax, what we have left is a CSV
which is much easier to parse, interpret and insert into a table with the same structure.
That’s the reason why databases prefer data in CSV format in big imports, more than as a query. To import files in CSV format in MySQL we have the
LOAD DATA command, in Postgres the
COPY command and in SQL Server, the
BULK INSERT command.
The biggest problem is setting the format of the CSV, because it is necessary to setup things like:
- Which will be the column separator character
- Whether strings need to be enclosed in quotation marks and with which character
- Which will be the character of line/record change
- Which will be the character represents the null value (in CSV no value is an empty string, different from the null value)
- Which will be the character of escape when we find one of the previous characters
An example of each command with the dataset used in the past articles, and the same table would be:
- Over a file
myfile.csvin CSV format ready to be consumed:
LOAD DATA LOCAL INFILE 'myfile.csv' INTO TABLE bike_trip CHARACTER SET UTF8 FIELDS TERMINATED BY '\t' ENCLOSED BY '' ESCAPED BY '\\' LINES TERMINATED BY '\n' STARTING BY ''
(tripduration, starttime, stoptime, start_station_id, start_station_name, start_station_latitude, start_station_longitude, end_station_id, end_station_name, end_station_latitude, end_station_longitude, bike_id, user_type, birth_year, gender)
COPY bike_trip (tripduration, starttime, stoptime, start_station_id,
start_station_name, start_station_latitude, start_station_longitude, end_station_id, end_station_name, end_station_latitude, end_station_longitude, bike_id, user_type, birth_year, gender)
FROM 'myfile.csv' WITH (FORMAT TEXT, ENCODING 'UTF-8', DELIMITER '\t', HEADER false)
Although they are special commands to be executed in their console command lines, both implementations of the JDBC drivers have included it as an extension out of the standard.
Its functionality is the same in all cases: we execute with the usual JDBC Connection a sentence similar to the one we would write in the query console, and we attach an
InputStream with the CSV:
In each case, you need to do different actions on the connection, but the best way is to see it directly with real code. This time we’ll have more code than usual.
The code will do the same as in the previous posts: persist the information in blocks of 1000 records with commits in each batch (in this case it will be in autocommit mode). To be able to compare, we will measure the number of persisted records per second.
From the connection returned by the data source, we must first obtain an object with MySQL’s own
com.mysql.jdbc.Connection interface. Somehow we do a casting to this interface. If JDBC couldn't it would throw an exception.
The MySQL Connection interface has a method called
setAllowLoadLocalInfile, which enables the driver to use the command
Then we only need to create the CSV string from the objects we have in memory according to the format we defined in the LOAD DATA sentence:
When we reach the corresponding batch size we send the CSV using a special method of the MySQL
Statement implementation, which needs to be casted:
setLocalInfileInputStream stores in an internal variable of the object
InputStream from where to read the CSV to send to the database. Because it is information generated on the fly, I put it in an
InputStream in memory. You can also store it in a file and use a
Finally, we execute the
LOAD DATA statement as an usual JDBC
Statement. The database will internally respond that it needs the information, and the driver will finally send the entire content of the
InputStream to the database.
In the Postgres case, its implementation is better and we don’t have to modify a special Connection flag to do something exceptional in some sentences. We just have to request for a special object that takes care of all: the
We also have to create the CSV with the values to persist according to the selected format (in this example I will use the TEXT format):
and finally when we reach the number of records in the batch, we send the CSV to the database with the
CopyManager, using again an
InputStream in memory:
For the first time we break the 50,000 records per second barrier!
But in order to know how much we have managed to improve, let’s compare with the results of the last posts:
Not bad! around 50% performance improvement compared to the more complex version of JDBC, and between 80% and 130% faster than JPA.
There are more techniques to improve the insertion speed, and some that I’ve been seeing around (and I haven’t tried all of them) are:
- Test with different batch sizes. Each table will ask you for a different size, depending on its data types and sizes.
- If you are loading in an empty table, deactivate the indexes at the beginning and activate them at the end. Creating indexes over the entire table costs less than creating them while inserting the data.
- The same applies to Foreign Keys: better not to have them when you’re making inserts. If your business logic allows it, you will save a lot of time if you don’t have to validate each reference.
- If your data already contains the Primary Key, MySQL prefers that you insert it sorted by the Primary Key.
You can tune database configuration to minimize disk writes, or memory dedicated to specific insert tasks:
- In Postgres modifying the values of
max_wal_size, deactivating WAL writing or creating
- In MySQL you can tune the variables
innodb_autoinc_lock_modeor the size of the log file
Regarding the last two points, be well informed and test before doing anything, because the consequences can be catastrophic.
I also recommend you to review your database configuration with someone who knows about the subject, because usually the default configuration that comes when you do the typical
apt-get install .... is very conservative, and needs to be adapted to the memory and CPU of your server, the type of workload, and even the file system you have below.