This post is the final part of a three-part miniseries that looks at how we improved join performance in the CrateDB 3.0 release.
In part one of this miniseries, I went over the reasons we chose to implement the hash join algorithm as an alternative to the nested loop algorithm. With that initial set of changes in place, we were able to make joins up to two thousand times faster.
In part two, I explained how we addressed the memory limitations of the basic hash join algorithm with a switch to block-based processing. That is, dividing a large dataset up into smaller blocks that can be worked on separately. This change improved our performance gains by another 50%.
This brings us to the final set of changes.
With the work so far, the CrateDB node that handles the client request is responsible for planning and executing the entire join.
However, because we now have blocks that can be worked on separately, we can, in theory, distribute those blocks across the CrateDB cluster and execute the join in parallel using multiple nodes for increased performance and load distribution.
To put it another way, what we needed to do was modify our existing implementation so that it (as evenly as possible) distributes the work required for the block hash join.
In this post, I will show you how we modified our existing work to create a distributed block hash join algorithm. I will then present the results of our final benchmarks.
Please note: join algorithms and their performance characteristics are common knowledge for RDBMS developers. This miniseries is written for people who want a behind the scenes look as we implement these algorithms and adapt them for CrateDB and our distributed SQL query execution engine.
We identified two candidate approaches for this work:
- The locality approach
- The modulo distribution approach
In this section, I will explain each approach, and then present the results of our prototyping benchmarks.
The Locality Approach
In an ideal situation, a CrateDB table is split up into primary shards that are distributed evenly across each node in the cluster.
If we have a table with eight shards, and a cluster with three nodes, we might expect a distribution that looks like this:
Ideally, tables also have replication configured, which means that each primary shard should have one or more replicas, which can also be used for read operations. However, I have omitted this from the diagram above to keep the explanation simple.
If you are joining two tables, you have two such tables, and two sets of shards that are distributed across the cluster.
For distributed query execution, ideally, each node in the cluster should perform a portion of the work required for the join.
One of the consequences of the block hash join algorithm (as detailed in part two) is that the left (outer) table is split into blocks and the right (inner) table must be read once for every block. To minimize the total number of rows read, we swap the tables if necessary to ensure that the right-hand table is the smallest of the two.
So, for the locality approach, each node must:
Identify the left-hand table shards that are kept locally and then split those shards into blocks.
Broadcast the right-hand table shards that are kept locally to all other nodes.
Use the complete set of rows for the right-hand table, taken from the shards residing locally on the node and the ones received from other nodes.
Produce a partial result set by running the hash join algorithm on the local left-hand blocks using the local copy of the right-hand table.
Partial result sets are then transmitted to the handling node for merging and further processing (e.g., applying operations like
OFFSET, and so on) before being returned to the client.
Here's what that looks like:
The Modulo Distribution Approach
The modulo the hashes to partition the matching rows across the cluster.
This method ensures that every right-hand row is located on the node that also holds every potentially matching left-hand row. As a result, every node can split its assigned subset of left-hand rows into blocks and execute a hash join using only its assigned subset of right-hand rows.
Take the following query:
SELECT * FROM t1 JOIN t2 ON t1.a = t2.a + t2.b AND t1.b < t2.c
As discussed in part one, we're interested in equality operators. So in this example, this is what we care about:
t1.a = t2.a + t2.b
Using this equality operator, we would calculate the hash for rows in
t1 using the value of
a, and a hash for rows in
t2 using the value of
a + b.
Because we're using an equality operator, matching rows from
t2 will have identical hashes.
From here, we can assign each row to a node using the modulo operator, like so:
node = hash_value % total_nodes
Statistically speaking, the modulo operator will assign approximately the same number of rows to each node in the cluster.
So, to summarize, for the modulo distribution approach, each node must:
Calculate the appropriate hash for every row for every local shard for both tables participating in the join.
Assign every row to a node by taking a modulo of the hash.
Transmit rows to their assigned nodes, as necessary.
Assemble the subset of assigned left-hand and right-hand rows from the rows held locally and the rows received from other nodes in the cluster.
Produce a partial result set by running the block hash join algorithm on the assigned subsets of left-hand and right-hand rows.
As before, partial result sets are then transmitted to the handling node for merging and further processing before being returned to the client.
Here's what the whole thing looks like:
Making a Decision
To test the two approaches we benchmarked two prototyped implementations on a five node CrateDB cluster.
For our benchmarks, the left (outer) table,
t1, held 20 million rows, and the right (inner) table,
t2, held five million rows.
We ran the following query:
SELECT * FROM t1 INNER JOIN t2 ON t1 = t2
Using three algorithms:
- The original single-node block hash join algorithm
- The locality prototype
- The modulo distribution prototype
With four different table configurations:
- Three primary shards, no replicas
- Five primary shards, no replicas
- Five primary shards, one replica each
- 10 primary shards, one replica each
And three different LIMIT values:
Here's what we saw:
|Execution Time (seconds)|
|Three primaries, no replicas||1,000||1.465||5.658||1.210|
|Five primaries, no replicas||1,000||1.610||3.985||1.332|
|Five primaries, one replica each||1,000||1.488||5.999||1.409|
|10 primaries, one replica each||1,000||1.413||3.286||1.324|
From this, we can observe that:
The locality approach would not improve performance over the original single-node block hash join algorithm. In fact, it worsens performance.
The modulo distribution approach improves performance for all tested configurations.
The results showed that the locality approach wouldn’t improve performance over the single node execution but on the contrary, would make it slower. On the other hand, the modulo distribution would improve performance for all tested configurations over the single node execution.
This was in line with what we were expecting, for two reasons:
The locality approach can easily generate a lot of internal traffic because every node must receive all right-hand rows that exist on other nodes.
The modulo distribution approach reduces the total number of row reads because each node only has to compare its assigned subsets of left-hand and right-hand rows. Contrast this with the locality approach which still requires the hash join algorithm to compare against the whole right-hand table for every block.
With these results in hand, we decided to go with the modulo distribution approach.
Thankfully for us, CrateDB's existing distributed query infrastructure made light work of this implementation, so details of the code changes we made can be skipped for this post.
A Precautionary Limitation
We identified one potential issue during implementation.
If you are joining a subselect, and that subselect has a
LIMIT or an
OFFSET clause, the distributed algorithm may perform worse than the single-node version.
To understand why let's take an example query:
SELECT * FROM ( SELECT a FROM t1 LIMIT 100) AS tt1, JOIN t2 ON tt1.a = t2.b
When a single node is executing this query, it only needs to collect 100 rows for
t1 before being able to complete subselect and proceed with the join.
Here's what that might look like:
However, we can't do this with the distributed block hash join algorithm, because each node is only processing a subset of the data.
For us to complete the subselect as before, one of the nodes would have to complete the subselect by receiving all data for
t1 and apply the limit of 100 before distributing the rows across the cluster to proceed with the execution of the distributed block hash join algorithm.
We decided this was out of the scope for this work, and so instead we applied a limitation so that the single-node version of the block hash join algorithm is used instead of the distributed version for any query joining a subselect with a
LIMIT or an
With the implementation of the distributed block hash join algorithm complete, we performed a final set of benchmarks to compare it with the single-node version.
Like before, we used a five node CrateDB cluster.
This time, we created two tables,
t2, configured with five primary shards each and no replica shards.
For the first test iteration, both tables started out with 10 thousand rows, and this was increased for every subsequent iteration.
For each iteration, we tested two queries using three algorithms:
- The original nested loop algorithm
- The single-node block hash join algorithm
- The distributed block hash join algorithm
The first query matches all rows:
SELECT COUNT(*) FROM t1 INNER JOIN t2 ON t1.a = t2.a
The second query was designed to match one-fifth of all rows using a second column that was populated accordingly:
SELECT COUNT(*) FROM t1 INNER JOIN t2 ON t1.a = t2.a AND t1.b = t2.b
Every combination of query, table size, and algorithm was run multiple times and the average execution time was taken.
The results were as follows:
|Execution Time (seconds)||Modulo Improvement|
|Rows||Match||Nested Loop||Single Node||Modulo||vs. Single Node||vs. Nested Loop|
Note: tests marked N/A ran for too long and were terminated.
From this, we can observe that:
The original nested loop algorithm performs reasonably for tables with up to 500 thousand rows.
For tables with more than 50 thousand rows, the distributed block hash join algorithm is significantly faster than the single-node block hash join algorithm.
For tables with less than 50 thousand rows, the performance benefits of the distributed block hash join algorithm do not outweigh much the associated performance costs, i.e., distributing rows internally, and so on.
The distributed block hash join algorithm makes join operations up to 23 thousand times faster than they are with the nested loop algorithm.
We didn't want to stop here. So, we devised some additional benchmarks.
Specifically, we used the same basic setup for the tests, but this time we ran the queries individually and then 10 times simultaneously to simulate multiple concurrent users. We also increased the size of the test tables in steps.
The results were as follows:
|Rows||Match||Concurrency||Execution Time |
Interpreting these results is more subjective because we're not comparing them with any previous results. We are, however, happy with this performance profile because many of the typical real-world CrateDB deployments have tables with millions of rows.
In parts one and two of this miniseries, we walked you through our implementation of the hash join algorithm, and how we subsequently improved on that work with the block hash join algorithm.
In this post, I showed you how we took the block hash join algorithm and modified it to run in parallel on multiple nodes.
We shipped this work in CrateDB 3.0, and the result is that inner joins with equality operators are significantly faster in all cases.
23 thousand times faster in some cases.
There are some potential improvements we could make:
We don't have anything roadmapped at the moment, but as always, stay tuned.