blog dds


2018.08.05

I was trying to run a simple join query on MariaDB (MySQL) and its performance was horrendous. Here's how I cut down the query's run time from over 380 hours to under 12 hours by executing part of it with two simple Unix commands.

Below is the query, with forms part of a more complex GHTorrent analysis I implemented using the simple-rolap relational online analytical processing framework.

select distinct project_commits.project_id, date_format(created_at, '%x%v1') as week_commit from project_commits left join commits on project_commits.commit_id = commits.id;

Both join fields are indexed. However, MariaDB implements the join with a full scan of project_commits and an index lookup on commits. This can be seen in the output of EXPLAIN.

+------+-------------+-----------------+--------+---------------+---------+ | id | select_type | table | type | possible_keys | key | +------+-------------+-----------------+--------+---------------+---------+ | 1 | SIMPLE | project_commits | ALL | NULL | NULL | | 1 | SIMPLE | commits | eq_ref | PRIMARY | PRIMARY | +------+-------------+-----------------+--------+---------------+---------+ +---------+-------------------------------------+------------+-----------------+ | key_len | ref | rows | Extra | +---------+-------------------------------------+------------+-----------------+ | NULL | NULL | 5417294109 | Using temporary | | 4 | ghtorrent.project_commits.commit_id | 1 | | +---------+-------------------------------------+------------+-----------------+

The sizes of the two tables are relatively large: project_commits contains 5 billion rows and commits 847 million rows. Also the server's memory size is relatively small (16GB). This probably means that index lookups hit the (unfortunately magnetic) disk, and therefore performance took a heavy hit. According to the output of pmonitor run on the generated temporary table, the query, which at that point had already run for more than half a day, would take another 373 hours to complete.

 /home/mysql/ghtorrent/project_commits#P#p0.MYD 6.68% ETA 373:38:11

To me this seemed excessive, because the I/O time requirements for a sort-merge join were orders of magnitude below the projected execution time. ( An answer I got from a question I posted on dba.stackexchange.com helpfully offered numerous things to try, but no conviction regarding their efficacy. I tried the first suggestion, but the results weren't promising. As experimenting with each suggestion could easily take at least half a day, I proceeded with a way I knew would work efficiently and reliably.

I exported the two tables into files, joined them with with the Unix join command, piped the result to uniq to remove the duplicate rows, and imported the result back into the database. The process started at 20:41 at the import (including index building) had finished by 9:53 of the next day. Here are the precise steps I followed.

1. Export the database tables as text files

I first exported the fields of the two tables I wanted to join sorted on the join field. To ensure the sort order was compatible with that used by the Unix tools, I cast the field into a character type.

I saved the output of the following SQL query into the file commits_week.txt.

select cast(id as char) as cid, date_format(created_at, '%x%v1') as week_commit from commits order by cid;

I also saved the output of the following SQL query into the file project_commits.txt:

select cast(commit_id as char) as cid, project_id from project_commits order by cid;

This generated the following files.

-rw-r--r-- 1 dds dds 15G Aug 4 21:09 commits_week.txt -rw-r--r-- 1 dds dds 93G Aug 5 00:36 project_commits.txt

Crucially, I run the mysql client with the --quick option to avoid running out of memory as the client tried to gather all results before outputting them.

Second, I joined the two text files using the Unix join command. This scans linearly through both files and combines the records whose first field matches. As the files are already sorted this can be done very efficiently: at the speed of I/O. I also piped the output of join into uniq to eliminate duplicate records. This handled the SQL distinct clause in the original query. Again, on already sorted output this can be done through a simple linear scan.

Here is the Unix shell command I run.

join commits_week.txt project_commits.txt | uniq >joined_commits.txt

After the processing, which took just an hour, I had the desired result in a file.

-rw-r--r-- 1 dds dds 133G Aug 5 01:40 joined_commits.txt

3. Import the text file back into the database

Finally, I imported the text file back into the database as a table.

create table half_life.week_commits_all ( project_id INT(11) not null, week_commit CHAR(7)) ENGINE=MyISAM; load data local infile 'joined_commits.txt' into table half_life.week_commits_all fields terminated by ' ';

Moral

Ideally, MariaDB should support sort-merge joins and its optimizer should employ them when the runtime of alternative strategies is projected to be excessive. Until that time, using Unix shell commands designed in the 1970s can provide a huge performance boost.

Read and post comments, or share through