How to Read 100s of Millions of Records per Second from a Single Disk

By clemenswinter

This article gives an overview of the implementation and performance of two recent additions to LocustDB, an extremely fast open-source analytics database built in Rust. The first addition is support for persistent storage, the second is an lz4 compression pass for data stored on disk or cached in memory. Benchmarking LocustDB on a dataset of 1.46 billion taxi rides demonstrates excellent performance for cold queries, reaching > 95% of sequential read speed on SSD and > 70% of sequential read speed on HDD. Comparing to ClickHouse, queries with similar data volume are as fast or faster, and disk space usage is reduced by 40%.

Implementation

A large fraction of the work went into finding a better abstraction for in-memory data and encoding schemes that allows for easy serialization. LocustDB splits up tables into a set of partitions with a fixed number of rows. Within each partition, the data for each column is stored sequentially in one contiguous block of memory/disk. LocustDB automatically applies a variety of compression schemes to data which among other benefits allows it to store datasets with billions of entries completely within RAM.

Before this work, I abstracted over the in-memory encoding schemes using Column and Codec traits that had a distinct implementation for each different encoding scheme. I was able to simplify this significantly, and now both Column and Codec are just basic structs that can be easily serialized and actually make it even easier to support any number of different encoding schemes.

The Column struct consists of some metadata (name, length of column, range of possible values) a list of data sections (an enum for a predefined set of Vecs storing primitive values) and a Codec. The Codec struct provides a uniform and flexible way of defining a variety of different compression schemes. Basically, it is a simple stack machine program composed of a list of well-defined CodecOP instructions that can be converted into a query plan that will decompress the data.

To illustrate, we might encode an integer column by subtracting 413372 from all values, casting them into 2 byte unsigned integers (u16), and then applying a final lz4 pass. The decoder can be expressed as [LZ4(output_type=u16), Add(413372, input_type=u16)]. Other codecs might read additional data sections. For example, dictionary compressed string columns, which store each value as an integer indexing into the string dictionary, have 3 different data sections. The first is the (lz4 compressed) list of indices for each value, the second stores the length and offset for each string, and the third is an array of bytes that contains the actual string data. This is how we can express the decoding operation as a stack machine: [LZ4(output_type=u8), PushDataSection(1), PushDataSection(2), DictLookup(index_type=u8)].

This design has a number of nice properties:

  1. Columns use essentially the same format on disk and in memory and can be straighforwardly and cheaply serialized and deseralized.
  2. It is possible to add any number of new encoding schemes while still maintaining backwards compatibility for existing data.
  3. Decoders can be turned into the same class of operations as user queries, and can benefit from any existing and future optimization passes applied by the query engine.
  4. Often we only want to partially decode data. For example, you might have a query like SELECT country, count(0) FROM users that counts the number of users broken down by a dictionary encoded country column. In this case we can compute the counts directly on the dictionary indices and only convert back to string values at the very end, which speeds up the query by more than an order of magnitude. This kind of optimization can now be automatically derived for any decoding scheme, all we need to do is specify what operations commute with each decoding layer (e.g. does it preserve width? ordering? summation?).

Background: RocksDB

RocksDB is an embeddable persistent key-value store for fast storage based on log structured merge trees. It has replaced InnoDB as the storage backend for all of Facebook’s MySQL clusters, resulting in significant performance gains, and is employed across a wide range of applications and workloads within and outside of Facebook. The interface exposed by RocksDB is very simple at it’s core and has three key operations that can be simplified as follows (&[u8] is Rust for a reference to an array of bytes):

fn put(key: &[u8], value: &[u8]);
fn get(key: &[u8]) -> &[u8];
fn scan(first_key: &[u8]) -> Iterator<Item=(&[u8], &[u8])>;

The put and get operations store and retrieve a value at a key respectively, where both keys and values are arbitrary sequences of bytes. RocksDB internally stores data sorted by key (within each level), and scan provides an efficient way to iterate over ranges of adjacent keys and their respective values.

Disk storage schema

The storage backend for LocustDB is very simple (or rather, all of the actual complexity is taken care for us by RocksDB). Point in case, the RocksDB specific code fits into a single 350 line file, 250 lines of which are imports and serialization boilerplate.

LocustDB uses two separate key spaces. The first one stores only meta data about what partitions and columns exist. This is a small amount of data that can be loaded quickly in full at start-up. The second key space stores the actual column data sorted first by column name and then by a monotonically increasing partition id. On SSD, the data for a column/partition is currently just loaded with a point lookup (get) as required by query workers, which works as long as partition sizes are large enough. On HDD, a single reader thread loads column data in larger contiguous chunks (scan). There’s a little bit of additional complexity in scheduling disk reads and synchronizing worker threads with minimal locking, but Rust’s strong support for concurrency makes this quite manageable. To allow for small partition sizes on SSDs and make good use of very fast HDDs or systems with multiple disks this approach will require some refinement, but at least on my hardware it already works well enough.

Performance

For a variety of reasons, the results I present should be taken with a grain of salt. LocustDB is optimized for bulk loads, while ClickHouse will also work for continuous ingestion. The data set and queries I used for the benchmarks is not particularly suited to evaluate cold reads or compression, it’s just what I had handy. The ingestion processes for LocustDB and especially Clickhouse apply various data transformations to the 52 different columns, and I have not verified that the resulting tables are completely comparable in every respect.

If you want to know what performance looks like on your own data (or just enjoy cryptic error messages and program crashes), you will be pleased to hear that LocustDB can now be downloaded as a binary that will run on most x64 Linux systems without any additional setup.

Setup

All results were produced on a dataset of 1.46 billion taxi rides. For more details on the dataset and queries see this blogpost. The HDD was a 2TB Seagate BarraCuda and the SSD a 2014 Crucial MX100 512GB. For additional details, see the methodology section.

Cold read performance

Measurements of query runtime when none of the data is cached in memory. On HDD and after purging caches LocustDB can take a few seconds to start, and the ClickHouse server similarly can take a few seconds to become responsive after restart. This time is not included in the results. For ClickHouse there are two sets of HDD results collected with--max-threads=1 and --max-threads=12. The former significantly improves cold query times on HDD (presumably because it reduces the number of separate data streams and therefore disk seeks), but would of course hurt queries for data cached in-memory in a major way. A superficial scan of ClickHouse docs didn’t turn up any way to just limit the number of disk-reading threads, but I may well have missed the right setting.

#QueryLocustDB (SSD)/sClickHouse (SSD)/sLocustDB (HDD)/sClickHouse (HDD, 1 thread)/sClickHouse (HDD, 12 threads)/s
1Simple Count0.661.00.786.46.3
2Average6.89.02739128
3Count by 2114.1443528
4Count by 318116876147
5Sparse Filter9.4203852268
6Top N13185156236
7Reducible Cardinality6.78.1176498

Drawing strong conclusions from these results is inadvisable because they conflate ingestion schema differences, compression ratios and querying efficiency. Query 1 is very fast because it reads from a single highly compressible column that is reduced to a couple of megabytes by lz4. Queries 2 and 3 are reading a year from a timestamp column, and judging from bandwidth reported by ClickHouse and dstat ClickHouse seems to be able to read a dedicated, highly compressible 2 byte width year column whereas LocustDB reads the full UNIX timestamp. LocustDB enjoys an advantage on query 5 which reads from a delta encoded trip_id column. Data volume on the other queries should be roughly the same, but probably somewhat smaller for LocustDB for which various floating point columns were (mostly losslessly) transformed into integers at ingestion time which helps compression, and which is also able to automatically choose a different compression scheme for each partition.

The main bottleneck for cold queries is IO, and we can judge the storage backend by looking at how close it comes to achieving sequential read speeds (550MiB/s for the SSD and 180MiB/s for the HDD according to fio). Below table shows (over)estimates of the read speed of LocustDB during the queries, computed as the size LocustDB reports for the referenced columns divided by runtime:

Query1234567
Bandwidth (SSD)/MB/s11580560560580580350
Bandwidth (HDD)/MB/s10150140150140150140
Approximate data volume/GiB0.0073.75.99.35.07.12.2

The bandwidths are off by about 7%, maybe because of memory allocator overhead or some other inaccuracy in how LocustDB reports memory usage. The actual bandwidths as gleaned from looking at dstat output are closer to 530MB/s on SSD and 130MB/s on HDD. Either way, these calculations do show that query time is directly proportional to data volume and at least on SSD we are very close to hitting the IO limit. The only outliers are query 1 which reads very little data, and query 7 which reads 2.2GiB of data but is computationally expensive.

Size on disk

The full uncompressed data set encoded as CSV is 602GiB, and reduces to 114GiB after applying gzip. According to du, the dataset takes up 80GiB  and 132GiB on disk when converted to the internal storage formats for LocustDB and ClickHouse respectively. This is roughly as expected. Both LocustDB and ClickHouse use efficient columnar data layouts which are further compressed with lz4. In addition to this, LocustDB automatically applies dictionary encoding to string columns as appropriate which results in significant additional gains, and an additional 5GiB is shaved off by delta encoding the trips column.

Imagine you are running a 100 node ClickHouse cluster. Look at your cluster, now look at me, now back to your cluster. It is now a 60 node LocustDB cluster. You’re welcome, and please send 10% of your cash savings my way. It’s only fair! (Of course you might be constrained on something other than disk space, and in any case LocustDB is not and may never be production ready.)

LocustDB implements a :memtree command that displays full breakdowns of memory usage and encoding schemes. The full dataset does not quite fit into memory on my machine and the output below is from a non-representative sample of 45% of partitions (click the tabs for increasingly detailed breakdowns):

trips 32.0GiB 1464785771 rows
trips 32.0GiB 1464785771 rows
├─ average_wind_speed 45% resident 0.61GiB 1.9% 0.99B/row
├─ cab_type 45% resident 3.2MiB 0.01% 5.1mB/row
├─ dropoff 45% resident 5.9GiB 19% 9.8B/row
├─ dropoff_borocode 45% resident 40MiB 0.12% 65mB/row
├─ dropoff_boroct2010 45% resident 0.18GiB 0.55% 0.29B/row
├─ dropoff_boroname 45% resident 41MiB 0.12% 65mB/row
├─ dropoff_cdeligibil 45% resident 51MiB 0.15% 81mB/row
├─ dropoff_ct2010 45% resident 0.17GiB 0.53% 0.28B/row
├─ dropoff_ctlabel 45% resident 0.13GiB 0.39% 0.21B/row
├─ dropoff_datetime 45% resident 2.3GiB 7.2% 3.8B/row
├─ dropoff_latitude 45% resident 11MiB 0.03% 18mB/row
├─ dropoff_longitude 45% resident 40MiB 0.12% 63mB/row
├─ dropoff_ntacode 45% resident 84MiB 0.26% 0.13B/row
├─ dropoff_ntaname 45% resident 86MiB 0.26% 0.14B/row
├─ dropoff_nyct2010_gid 45% resident 0.13GiB 0.40% 0.21B/row
├─ dropoff_puma 45% resident 92MiB 0.28% 0.15B/row
├─ ehail_fee 45% resident 3.1MiB 0.01% 5.0mB/row
├─ extra 45% resident 0.41GiB 1.3% 0.67B/row
├─ fare_amount 45% resident 1.1GiB 3.3% 1.8B/row
├─ improvement_surcharge 45% resident 6.9MiB 0.02% 11mB/row
├─ max_temperature 45% resident 0.45GiB 1.4% 0.73B/row
├─ min_temperature 45% resident 0.44GiB 1.4% 0.73B/row
├─ mta_tax 45% resident 15MiB 0.04% 24mB/row
├─ passenger_count 45% resident 0.30GiB 0.95% 0.50B/row
├─ payment_type 45% resident 0.31GiB 0.96% 0.50B/row
├─ pickup 45% resident 5.9GiB 19% 9.8B/row
├─ pickup_borocode 45% resident 0.11GiB 0.35% 0.18B/row
├─ pickup_boroct2010 45% resident 1.1GiB 3.5% 1.8B/row
├─ pickup_boroname 45% resident 0.11GiB 0.35% 0.18B/row
├─ pickup_cdeligibil 45% resident 0.17GiB 0.52% 0.27B/row
├─ pickup_ct2010 45% resident 1.1GiB 3.5% 1.8B/row
├─ pickup_ctlabel 45% resident 0.89GiB 2.8% 1.5B/row
├─ pickup_datetime 45% resident 2.1GiB 6.6% 3.4B/row
├─ pickup_latitude 45% resident 16MiB 0.05% 25mB/row
├─ pickup_longitude 45% resident 0.17GiB 0.55% 0.29B/row
├─ pickup_ntacode 45% resident 0.45GiB 1.4% 0.73B/row
├─ pickup_ntaname 45% resident 0.45GiB 1.4% 0.74B/row
├─ pickup_nyct2010_gid 45% resident 0.89GiB 2.8% 1.5B/row
├─ pickup_puma 45% resident 0.45GiB 1.4% 0.73B/row
├─ precipitation 45% resident 0.36GiB 1.1% 0.58B/row
├─ rate_code_id 45% resident 63MiB 0.19% 0.10B/row
├─ snow_depth 45% resident 94MiB 0.29% 0.15B/row
├─ snowfall 45% resident 71MiB 0.22% 0.11B/row
├─ store_and_fwd_flag 45% resident 0.16GiB 0.51% 0.27B/row
├─ tip_amount 45% resident 0.88GiB 2.7% 1.4B/row
├─ tolls_amount 45% resident 0.14GiB 0.45% 0.23B/row
├─ total_amount 45% resident 1.4GiB 4.4% 2.3B/row
├─ trip_distance 45% resident 1.6GiB 5.0% 2.7B/row
├─ trip_id 45% resident 26MiB 0.08% 41mB/row
├─ trip_type 45% resident 51KiB 0.00% 81μB/row
└─ vendor_id 45% resident 0.30GiB 0.94% 0.50B/row
trips 32.0GiB 1464785771 rows
├─ average_wind_speed 45% resident 0.61GiB 1.9% 0.99B/row
│ ├─ LZ4(U16) ToI64(U16) 0.61GiB 100% 0.99B/row 100%
│ └─ LZ4(U8) Add(U8) 2.0KiB 0.00% 6.6mB/row 0.05%
├─ cab_type 45% resident 3.2MiB 0.01% 5.1mB/row
│ └─ LZ4(U8) Data(1) Data(2) Dict(U8) 3.2MiB 100% 5.1mB/row 100%
├─ dropoff 45% resident 5.9GiB 19% 9.8B/row
│ ├─ LZ4(U8) StrHexUnpack 5.9GiB 100% 13B/row 73%
│ ├─ LZ4(U16) Data(1) Data(2) Dict(U16) 8.2MiB 0.13% 8.2B/row 0.16%
│ └─ 12KiB 0.00% 72μB/row 27%
├─ dropoff_borocode 45% resident 40MiB 0.12% 65mB/row
│ ├─ LZ4(U8) ToI64(U8) 40MiB 100% 88mB/row 73%
│ └─ 14KiB 0.03% 80μB/row 27%
├─ dropoff_boroct2010 45% resident 0.18GiB 0.55% 0.29B/row
│ ├─ LZ4(U32) ToI64(U32) 0.18GiB 100% 0.39B/row 73%
│ └─ 16KiB 0.01% 96μB/row 27%
├─ dropoff_boroname 45% resident 41MiB 0.12% 65mB/row
│ ├─ LZ4(U8) Data(1) Data(2) Dict(U8) 41MiB 100% 89mB/row 73%
│ └─ 14KiB 0.03% 80μB/row 27%
├─ dropoff_cdeligibil 45% resident 51MiB 0.15% 81mB/row
│ ├─ LZ4(U8) Data(1) Data(2) Dict(U8) 51MiB 100% 0.11B/row 73%
│ └─ 16KiB 0.03% 96μB/row 27%
├─ dropoff_ct2010 45% resident 0.17GiB 0.53% 0.28B/row
│ ├─ LZ4(U32) ToI64(U32) 0.17GiB 100% 0.38B/row 73%
│ └─ 14KiB 0.01% 80μB/row 27%
├─ dropoff_ctlabel 45% resident 0.13GiB 0.39% 0.21B/row
│ ├─ LZ4(U16) ToI64(U16) 0.13GiB 100% 0.28B/row 73%
│ └─ 14KiB 0.01% 80μB/row 27%
├─ dropoff_datetime 45% resident 2.3GiB 7.2% 3.8B/row
│ ├─ ToI64(U32) 1.8GiB 76% 4.0B/row 72%
│ ├─ LZ4(U32) ToI64(U32) 0.53GiB 23% 3.2B/row 27%
│ ├─ LZ4(I64) 20MiB 0.85% 5.0B/row 0.64%
│ └─ Add(U32) 8.0MiB 0.34% 4.0B/row 0.32%
├─ dropoff_latitude 45% resident 11MiB 0.03% 18mB/row
│ ├─ LZ4(U8) ToI64(U8) 5.3MiB 47% 22mB/row 39%
│ ├─ LZ4(U16) Add(U16) 2.5MiB 22% 33mB/row 12%
│ ├─ LZ4(U8) Add(U8) 2.4MiB 21% 23mB/row 16%
│ ├─ LZ4(U16) ToI64(U16) 1.1MiB 9.9% 28mB/row 6.4%
│ └─ 14KiB 0.12% 80μB/row 27%
├─ dropoff_longitude 45% resident 40MiB 0.12% 63mB/row
│ ├─ LZ4(U8) Add(U8) 26MiB 66% 80mB/row 53%
│ ├─ LZ4(U16) Add(U16) 13MiB 33% 0.10B/row 20%
│ └─ 16KiB 0.04% 96μB/row 27%
├─ dropoff_ntacode 45% resident 84MiB 0.26% 0.13B/row
│ ├─ LZ4(U8) Data(1) Data(2) Dict(U8) 84MiB 100% 0.18B/row 73%
│ └─ 14KiB 0.02% 80μB/row 27%
├─ dropoff_ntaname 45% resident 86MiB 0.26% 0.14B/row
│ ├─ LZ4(U8) Data(1) Data(2) Dict(U8) 86MiB 100% 0.19B/row 73%
│ └─ 14KiB 0.02% 80μB/row 27%
├─ dropoff_nyct2010_gid 45% resident 0.13GiB 0.40% 0.21B/row
│ ├─ LZ4(U16) ToI64(U16) 0.13GiB 100% 0.29B/row 73%
│ └─ 16KiB 0.01% 96μB/row 27%
├─ dropoff_puma 45% resident 92MiB 0.28% 0.15B/row
│ ├─ LZ4(U16) ToI64(U16) 91MiB 99% 0.20B/row 73%
│ └─ LZ4(U8) ToI64(U8) 0.83MiB 0.90% 5.0mB/row 27%
├─ ehail_fee 45% resident 3.1MiB 0.01% 5.0mB/row
│ └─ LZ4(U8) ToI64(U8) 3.1MiB 100% 5.0mB/row 100%
├─ extra 45% resident 0.41GiB 1.3% 0.67B/row
│ ├─ LZ4(U16) ToI64(U16) 0.19GiB 46% 0.87B/row 35%
│ ├─ LZ4(U16) Add(U16) 0.18GiB 45% 0.52B/row 58%
│ ├─ LZ4(U32) Add(U32) 20MiB 4.9% 1.1B/row 2.9%
│ ├─ LZ4(U8) ToI64(U8) 16MiB 3.9% 0.62B/row 4.2%
│ └─ LZ4(U8) Add(U8) 0.22MiB 0.05% 0.33B/row 0.11%
├─ fare_amount 45% resident 1.1GiB 3.3% 1.8B/row
│ ├─ LZ4(U32) Add(U32) 0.63GiB 59% 1.8B/row 59%
│ ├─ LZ4(U16) ToI64(U16) 0.41GiB 39% 1.8B/row 39%
│ ├─ LZ4(U16) Add(U16) 20MiB 1.8% 1.8B/row 1.8%
│ ├─ LZ4(U32) ToI64(U32) 3.5MiB 0.32% 1.8B/row 0.32%
│ └─ Add(U16) 0.16MiB 0.01% 2.1B/row 0.01%
├─ improvement_surcharge 45% resident 6.9MiB 0.02% 11mB/row
│ ├─ LZ4(U8) ToI64(U8) 4.0MiB 58% 15mB/row 44%
│ ├─ LZ4(U8) Add(U8) 2.9MiB 41% 8.3mB/row 55%
│ └─ LZ4(U16) Add(U16) 48KiB 0.68% 12mB/row 0.64%
├─ max_temperature 45% resident 0.45GiB 1.4% 0.73B/row
│ ├─ ToI64(U8) 0.45GiB 100% 1.0B/row 73%
│ └─ LZ4(U8) ToI64(U8) 1.1MiB 0.24% 6.7mB/row 27%
├─ min_temperature 45% resident 0.44GiB 1.4% 0.73B/row
│ ├─ ToI64(U8) 0.41GiB 92% 1.0B/row 67%
│ ├─ LZ4(U8) ToI64(U8) 22MiB 4.9% 0.12B/row 31%
│ └─ Add(U8) 12MiB 2.7% 1.0B/row 1.9%
├─ mta_tax 45% resident 15MiB 0.04% 24mB/row
│ ├─ LZ4(U16) Add(U16) 7.6MiB 52% 30mB/row 40%
│ ├─ LZ4(U8) ToI64(U8) 3.7MiB 25% 15mB/row 39%
│ ├─ LZ4(U8) Add(U8) 3.4MiB 23% 26mB/row 21%
│ └─ LZ4(U16) ToI64(U16) 40KiB 0.27% 20mB/row 0.32%
├─ passenger_count 45% resident 0.30GiB 0.95% 0.50B/row
│ └─ LZ4(U8) ToI64(U8) 0.30GiB 100% 0.50B/row 100%
├─ payment_type 45% resident 0.31GiB 0.96% 0.50B/row
│ └─ LZ4(U8) Data(1) Data(2) Dict(U8) 0.31GiB 100% 0.50B/row 100%
├─ pickup 45% resident 5.9GiB 19% 9.8B/row
│ ├─ LZ4(U8) StrHexUnpack 5.9GiB 100% 13B/row 73%
│ ├─ LZ4(U16) Data(1) Data(2) Dict(U16) 8.2MiB 0.13% 8.2B/row 0.16%
│ └─ 12KiB 0.00% 72μB/row 27%
├─ pickup_borocode 45% resident 0.11GiB 0.35% 0.18B/row
│ ├─ LZ4(U8) ToI64(U8) 0.11GiB 100% 0.25B/row 73%
│ └─ 14KiB 0.01% 80μB/row 27%
├─ pickup_boroct2010 45% resident 1.1GiB 3.5% 1.8B/row
│ ├─ LZ4(U32) ToI64(U32) 1.1GiB 100% 2.5B/row 73%
│ └─ 16KiB 0.00% 96μB/row 27%
├─ pickup_boroname 45% resident 0.11GiB 0.35% 0.18B/row
│ ├─ LZ4(U8) Data(1) Data(2) Dict(U8) 0.11GiB 100% 0.25B/row 73%
│ └─ 14KiB 0.01% 80μB/row 27%
├─ pickup_cdeligibil 45% resident 0.17GiB 0.52% 0.27B/row
│ ├─ LZ4(U8) Data(1) Data(2) Dict(U8) 0.17GiB 100% 0.37B/row 73%
│ └─ 16KiB 0.01% 96μB/row 27%
├─ pickup_ct2010 45% resident 1.1GiB 3.5% 1.8B/row
│ ├─ LZ4(U32) ToI64(U32) 1.1GiB 100% 2.5B/row 73%
│ └─ 14KiB 0.00% 80μB/row 27%
├─ pickup_ctlabel 45% resident 0.89GiB 2.8% 1.5B/row
│ ├─ ToI64(U16) 0.89GiB 100% 2.0B/row 73%
│ ├─ LZ4(U16) ToI64(U16) 0.19MiB 0.02% 0.19B/row 0.16%
│ └─ 14KiB 0.00% 80μB/row 27%
├─ pickup_datetime 45% resident 2.1GiB 6.6% 3.4B/row
│ ├─ ToI64(U32) 1.8GiB 85% 4.0B/row 73%
│ ├─ LZ4(U32) ToI64(U32) 0.32GiB 15% 1.9B/row 27%
│ ├─ LZ4(U16) Add(U16) 0.36MiB 0.02% 1.2B/row 0.05%
│ └─ Add(U16) 0.16MiB 0.01% 2.1B/row 0.01%
├─ pickup_latitude 45% resident 16MiB 0.05% 25mB/row
│ ├─ LZ4(U8) ToI64(U8) 9.4MiB 60% 34mB/row 45%
│ ├─ LZ4(U8) Add(U8) 3.5MiB 22% 33mB/row 17%
│ ├─ LZ4(U16) Add(U16) 2.6MiB 16% 40mB/row 10%
│ ├─ LZ4(U16) ToI64(U16) 0.25MiB 1.6% 35mB/row 1.1%
│ └─ 14KiB 0.08% 80μB/row 27%
├─ pickup_longitude 45% resident 0.17GiB 0.55% 0.29B/row
│ ├─ LZ4(U8) Add(U8) 0.13GiB 74% 0.37B/row 57%
│ ├─ LZ4(U16) Add(U16) 47MiB 26% 0.46B/row 16%
│ └─ 14KiB 0.01% 80μB/row 27%
├─ pickup_ntacode 45% resident 0.45GiB 1.4% 0.73B/row
│ ├─ Data(1) Data(2) Dict(U8) 0.45GiB 100% 1.0B/row 73%
│ ├─ LZ4(U8) Data(1) Data(2) Dict(U8) 99KiB 0.02% 97mB/row 0.16%
│ └─ 14KiB 0.00% 80μB/row 27%
├─ pickup_ntaname 45% resident 0.45GiB 1.4% 0.74B/row
│ ├─ Data(1) Data(2) Dict(U8) 0.45GiB 100% 1.0B/row 73%
│ └─ LZ4(U8) Data(1) Data(2) Dict(U8) 0.94MiB 0.20% 5.6mB/row 27%
├─ pickup_nyct2010_gid 45% resident 0.89GiB 2.8% 1.5B/row
│ ├─ ToI64(U16) 0.89GiB 100% 2.0B/row 73%
│ ├─ LZ4(U16) ToI64(U16) 0.19MiB 0.02% 0.19B/row 0.16%
│ └─ 16KiB 0.00% 96μB/row 27%
├─ pickup_puma 45% resident 0.45GiB 1.4% 0.73B/row
│ ├─ LZ4(U16) ToI64(U16) 0.45GiB 100% 1.00B/row 73%
│ └─ LZ4(U8) ToI64(U8) 0.83MiB 0.18% 5.0mB/row 27%
├─ precipitation 45% resident 0.36GiB 1.1% 0.58B/row
│ ├─ LZ4(U16) ToI64(U16) 0.35GiB 98% 0.66B/row 87%
│ └─ LZ4(U8) ToI64(U8) 6.1MiB 1.7% 76mB/row 13%
├─ rate_code_id 45% resident 63MiB 0.19% 0.10B/row
│ └─ LZ4(U8) ToI64(U8) 63MiB 100% 0.10B/row 100%
├─ snow_depth 45% resident 94MiB 0.29% 0.15B/row
│ ├─ LZ4(U16) ToI64(U16) 92MiB 98% 0.56B/row 27%
│ └─ LZ4(U8) ToI64(U8) 2.3MiB 2.4% 5.0mB/row 73%
├─ snowfall 45% resident 71MiB 0.22% 0.11B/row
│ ├─ LZ4(U16) ToI64(U16) 66MiB 92% 0.42B/row 25%
│ └─ LZ4(U8) ToI64(U8) 5.4MiB 7.6% 12mB/row 75%
├─ store_and_fwd_flag 45% resident 0.16GiB 0.51% 0.27B/row
│ └─ LZ4(U8) Data(1) Data(2) Dict(U8) 0.16GiB 100% 0.27B/row 100%
├─ tip_amount 45% resident 0.88GiB 2.7% 1.4B/row
│ ├─ LZ4(U16) Add(U16) 0.46GiB 52% 1.5B/row 49%
│ ├─ LZ4(U16) ToI64(U16) 0.30GiB 34% 1.3B/row 39%
│ └─ LZ4(U32) Add(U32) 0.12GiB 14% 1.7B/row 11%
├─ tolls_amount 45% resident 0.14GiB 0.45% 0.23B/row
│ ├─ LZ4(U32) Add(U32) 59MiB 40% 0.30B/row 31%
│ ├─ LZ4(U16) ToI64(U16) 47MiB 32% 0.19B/row 41%
│ ├─ LZ4(U16) Add(U16) 38MiB 26% 0.22B/row 27%
│ └─ LZ4(U32) ToI64(U32) 2.2MiB 1.5% 0.28B/row 1.3%
├─ total_amount 45% resident 1.4GiB 4.4% 2.3B/row
│ ├─ LZ4(U32) Add(U32) 0.92GiB 65% 2.5B/row 60%
│ ├─ ToI64(U16) 0.47GiB 33% 2.0B/row 38%
│ ├─ LZ4(U32) ToI64(U32) 13MiB 0.88% 2.5B/row 0.81%
│ └─ Add(U16) 6.3MiB 0.44% 2.0B/row 0.50%
├─ trip_distance 45% resident 1.6GiB 5.0% 2.7B/row
│ ├─ LZ4(U32) ToI64(U32) 1.3GiB 80% 2.5B/row 84%
│ ├─ LZ4(I64) 0.32GiB 20% 3.5B/row 15%
│ ├─ ToI64(U16) 2.7MiB 0.16% 2.1B/row 0.20%
│ └─ LZ4(U32) Add(U32) 2.5MiB 0.15% 2.5B/row 0.16%
├─ trip_id 45% resident 26MiB 0.08% 41mB/row
│ ├─ LZ4(U32) Add(U32) Delta(I64) 26MiB 100% 41mB/row 100%
│ └─ LZ4(U32) Delta(U32) 1.6KiB 0.01% 21mB/row 0.01%
├─ trip_type 45% resident 51KiB 0.00% 81μB/row
│ └─ 51KiB 100% 81μB/row 100%
└─ vendor_id 45% resident 0.30GiB 0.94% 0.50B/row └─ LZ4(U8) Data(1) Data(2) Dict(U8) 0.30GiB 100% 0.50B/row 100%
trips 32.0GiB 1464785771 rows
├─ average_wind_speed 45% resident 0.61GiB 1.9% 0.99B/row
│ ├─ LZ4(U16) ToI64(U16) 0.61GiB 100% 0.99B/row 100%
│ └─ LZ4(U8) Add(U8) 2.0KiB 0.00% 6.6mB/row 0.05%
├─ cab_type 45% resident 3.2MiB 0.01% 5.1mB/row
│ └─ LZ4(U8) Data(1) Data(2) Dict(U8) 3.2MiB 100% 5.1mB/row 100%
│ ├─ .0 3.0MiB
│ ├─ .1 5.1KiB
│ └─ .2 5.1KiB
├─ dropoff 45% resident 5.9GiB 19% 9.8B/row
│ ├─ LZ4(U8) StrHexUnpack 5.9GiB 100% 13B/row 73%
│ ├─ LZ4(U16) Data(1) Data(2) Dict(U16) 8.2MiB 0.13% 8.2B/row 0.16%
│ │ ├─ .0 0.19MiB
│ │ ├─ .1 1.0MiB
│ │ └─ .2 7.0MiB
│ └─ 12KiB 0.00% 72μB/row 27%
├─ dropoff_borocode 45% resident 40MiB 0.12% 65mB/row
│ ├─ LZ4(U8) ToI64(U8) 40MiB 100% 88mB/row 73%
│ └─ 14KiB 0.03% 80μB/row 27%
├─ dropoff_boroct2010 45% resident 0.18GiB 0.55% 0.29B/row
│ ├─ LZ4(U32) ToI64(U32) 0.18GiB 100% 0.39B/row 73%
│ └─ 16KiB 0.01% 96μB/row 27%
├─ dropoff_boroname 45% resident 41MiB 0.12% 65mB/row
│ ├─ LZ4(U8) Data(1) Data(2) Dict(U8) 41MiB 100% 89mB/row 73%
│ │ ├─ .0 41MiB
│ │ ├─ .1 30KiB
│ │ └─ .2 30KiB
│ └─ 14KiB 0.03% 80μB/row 27%
├─ dropoff_cdeligibil 45% resident 51MiB 0.15% 81mB/row
│ ├─ LZ4(U8) Data(1) Data(2) Dict(U8) 51MiB 100% 0.11B/row 73%
│ │ ├─ .0 50MiB
│ │ ├─ .1 15KiB
│ │ └─ .2 3.8KiB
│ └─ 16KiB 0.03% 96μB/row 27%
├─ dropoff_ct2010 45% resident 0.17GiB 0.53% 0.28B/row
│ ├─ LZ4(U32) ToI64(U32) 0.17GiB 100% 0.38B/row 73%
│ └─ 14KiB 0.01% 80μB/row 27%
├─ dropoff_ctlabel 45% resident 0.13GiB 0.39% 0.21B/row
│ ├─ LZ4(U16) ToI64(U16) 0.13GiB 100% 0.28B/row 73%
│ └─ 14KiB 0.01% 80μB/row 27%
├─ dropoff_datetime 45% resident 2.3GiB 7.2% 3.8B/row
│ ├─ ToI64(U32) 1.8GiB 76% 4.0B/row 72%
│ ├─ LZ4(U32) ToI64(U32) 0.53GiB 23% 3.2B/row 27%
│ ├─ LZ4(I64) 20MiB 0.85% 5.0B/row 0.64%
│ └─ Add(U32) 8.0MiB 0.34% 4.0B/row 0.32%
├─ dropoff_latitude 45% resident 11MiB 0.03% 18mB/row
│ ├─ LZ4(U8) ToI64(U8) 5.3MiB 47% 22mB/row 39%
│ ├─ LZ4(U16) Add(U16) 2.5MiB 22% 33mB/row 12%
│ ├─ LZ4(U8) Add(U8) 2.4MiB 21% 23mB/row 16%
│ ├─ LZ4(U16) ToI64(U16) 1.1MiB 9.9% 28mB/row 6.4%
│ └─ 14KiB 0.12% 80μB/row 27%
├─ dropoff_longitude 45% resident 40MiB 0.12% 63mB/row
│ ├─ LZ4(U8) Add(U8) 26MiB 66% 80mB/row 53%
│ ├─ LZ4(U16) Add(U16) 13MiB 33% 0.10B/row 20%
│ └─ 16KiB 0.04% 96μB/row 27%
├─ dropoff_ntacode 45% resident 84MiB 0.26% 0.13B/row
│ ├─ LZ4(U8) Data(1) Data(2) Dict(U8) 84MiB 100% 0.18B/row 73%
│ │ ├─ .0 82MiB
│ │ ├─ .1 0.95MiB
│ │ └─ .2 0.47MiB
│ └─ 14KiB 0.02% 80μB/row 27%
├─ dropoff_ntaname 45% resident 86MiB 0.26% 0.14B/row
│ ├─ LZ4(U8) Data(1) Data(2) Dict(U8) 86MiB 100% 0.19B/row 73%
│ │ ├─ .0 82MiB
│ │ ├─ .1 0.95MiB
│ │ └─ .2 2.3MiB
│ └─ 14KiB 0.02% 80μB/row 27%
├─ dropoff_nyct2010_gid 45% resident 0.13GiB 0.40% 0.21B/row
│ ├─ LZ4(U16) ToI64(U16) 0.13GiB 100% 0.29B/row 73%
│ └─ 16KiB 0.01% 96μB/row 27%
├─ dropoff_puma 45% resident 92MiB 0.28% 0.15B/row
│ ├─ LZ4(U16) ToI64(U16) 91MiB 99% 0.20B/row 73%
│ └─ LZ4(U8) ToI64(U8) 0.83MiB 0.90% 5.0mB/row 27%
├─ ehail_fee 45% resident 3.1MiB 0.01% 5.0mB/row
│ └─ LZ4(U8) ToI64(U8) 3.1MiB 100% 5.0mB/row 100%
├─ extra 45% resident 0.41GiB 1.3% 0.67B/row
│ ├─ LZ4(U16) ToI64(U16) 0.19GiB 46% 0.87B/row 35%
│ ├─ LZ4(U16) Add(U16) 0.18GiB 45% 0.52B/row 58%
│ ├─ LZ4(U32) Add(U32) 20MiB 4.9% 1.1B/row 2.9%
│ ├─ LZ4(U8) ToI64(U8) 16MiB 3.9% 0.62B/row 4.2%
│ └─ LZ4(U8) Add(U8) 0.22MiB 0.05% 0.33B/row 0.11%
├─ fare_amount 45% resident 1.1GiB 3.3% 1.8B/row
│ ├─ LZ4(U32) Add(U32) 0.63GiB 59% 1.8B/row 59%
│ ├─ LZ4(U16) ToI64(U16) 0.41GiB 39% 1.8B/row 39%
│ ├─ LZ4(U16) Add(U16) 20MiB 1.8% 1.8B/row 1.8%
│ ├─ LZ4(U32) ToI64(U32) 3.5MiB 0.32% 1.8B/row 0.32%
│ └─ Add(U16) 0.16MiB 0.01% 2.1B/row 0.01%
├─ improvement_surcharge 45% resident 6.9MiB 0.02% 11mB/row
│ ├─ LZ4(U8) ToI64(U8) 4.0MiB 58% 15mB/row 44%
│ ├─ LZ4(U8) Add(U8) 2.9MiB 41% 8.3mB/row 55%
│ └─ LZ4(U16) Add(U16) 48KiB 0.68% 12mB/row 0.64%
├─ max_temperature 45% resident 0.45GiB 1.4% 0.73B/row
│ ├─ ToI64(U8) 0.45GiB 100% 1.0B/row 73%
│ └─ LZ4(U8) ToI64(U8) 1.1MiB 0.24% 6.7mB/row 27%
├─ min_temperature 45% resident 0.44GiB 1.4% 0.73B/row
│ ├─ ToI64(U8) 0.41GiB 92% 1.0B/row 67%
│ ├─ LZ4(U8) ToI64(U8) 22MiB 4.9% 0.12B/row 31%
│ └─ Add(U8) 12MiB 2.7% 1.0B/row 1.9%
├─ mta_tax 45% resident 15MiB 0.04% 24mB/row
│ ├─ LZ4(U16) Add(U16) 7.6MiB 52% 30mB/row 40%
│ ├─ LZ4(U8) ToI64(U8) 3.7MiB 25% 15mB/row 39%
│ ├─ LZ4(U8) Add(U8) 3.4MiB 23% 26mB/row 21%
│ └─ LZ4(U16) ToI64(U16) 40KiB 0.27% 20mB/row 0.32%
├─ passenger_count 45% resident 0.30GiB 0.95% 0.50B/row
│ └─ LZ4(U8) ToI64(U8) 0.30GiB 100% 0.50B/row 100%
├─ payment_type 45% resident 0.31GiB 0.96% 0.50B/row
│ └─ LZ4(U8) Data(1) Data(2) Dict(U8) 0.31GiB 100% 0.50B/row 100%
│ ├─ .0 0.31GiB
│ ├─ .1 30KiB
│ └─ .2 11KiB
├─ pickup 45% resident 5.9GiB 19% 9.8B/row
│ ├─ LZ4(U8) StrHexUnpack 5.9GiB 100% 13B/row 73%
│ ├─ LZ4(U16) Data(1) Data(2) Dict(U16) 8.2MiB 0.13% 8.2B/row 0.16%
│ │ ├─ .0 0.19MiB
│ │ ├─ .1 1.0MiB
│ │ └─ .2 7.0MiB
│ └─ 12KiB 0.00% 72μB/row 27%
├─ pickup_borocode 45% resident 0.11GiB 0.35% 0.18B/row
│ ├─ LZ4(U8) ToI64(U8) 0.11GiB 100% 0.25B/row 73%
│ └─ 14KiB 0.01% 80μB/row 27%
├─ pickup_boroct2010 45% resident 1.1GiB 3.5% 1.8B/row
│ ├─ LZ4(U32) ToI64(U32) 1.1GiB 100% 2.5B/row 73%
│ └─ 16KiB 0.00% 96μB/row 27%
├─ pickup_boroname 45% resident 0.11GiB 0.35% 0.18B/row
│ ├─ LZ4(U8) Data(1) Data(2) Dict(U8) 0.11GiB 100% 0.25B/row 73%
│ │ ├─ .0 0.11GiB
│ │ ├─ .1 30KiB
│ │ └─ .2 30KiB
│ └─ 14KiB 0.01% 80μB/row 27%
├─ pickup_cdeligibil 45% resident 0.17GiB 0.52% 0.27B/row
│ ├─ LZ4(U8) Data(1) Data(2) Dict(U8) 0.17GiB 100% 0.37B/row 73%
│ │ ├─ .0 0.17GiB
│ │ ├─ .1 15KiB
│ │ └─ .2 3.8KiB
│ └─ 16KiB 0.01% 96μB/row 27%
├─ pickup_ct2010 45% resident 1.1GiB 3.5% 1.8B/row
│ ├─ LZ4(U32) ToI64(U32) 1.1GiB 100% 2.5B/row 73%
│ └─ 14KiB 0.00% 80μB/row 27%
├─ pickup_ctlabel 45% resident 0.89GiB 2.8% 1.5B/row
│ ├─ ToI64(U16) 0.89GiB 100% 2.0B/row 73%
│ ├─ LZ4(U16) ToI64(U16) 0.19MiB 0.02% 0.19B/row 0.16%
│ └─ 14KiB 0.00% 80μB/row 27%
├─ pickup_datetime 45% resident 2.1GiB 6.6% 3.4B/row
│ ├─ ToI64(U32) 1.8GiB 85% 4.0B/row 73%
│ ├─ LZ4(U32) ToI64(U32) 0.32GiB 15% 1.9B/row 27%
│ ├─ LZ4(U16) Add(U16) 0.36MiB 0.02% 1.2B/row 0.05%
│ └─ Add(U16) 0.16MiB 0.01% 2.1B/row 0.01%
├─ pickup_latitude 45% resident 16MiB 0.05% 25mB/row
│ ├─ LZ4(U8) ToI64(U8) 9.4MiB 60% 34mB/row 45%
│ ├─ LZ4(U8) Add(U8) 3.5MiB 22% 33mB/row 17%
│ ├─ LZ4(U16) Add(U16) 2.6MiB 16% 40mB/row 10%
│ ├─ LZ4(U16) ToI64(U16) 0.25MiB 1.6% 35mB/row 1.1%
│ └─ 14KiB 0.08% 80μB/row 27%
├─ pickup_longitude 45% resident 0.17GiB 0.55% 0.29B/row
│ ├─ LZ4(U8) Add(U8) 0.13GiB 74% 0.37B/row 57%
│ ├─ LZ4(U16) Add(U16) 47MiB 26% 0.46B/row 16%
│ └─ 14KiB 0.01% 80μB/row 27%
├─ pickup_ntacode 45% resident 0.45GiB 1.4% 0.73B/row
│ ├─ Data(1) Data(2) Dict(U8) 0.45GiB 100% 1.0B/row 73%
│ │ ├─ .0 0.45GiB
│ │ ├─ .1 0.93MiB
│ │ └─ .2 0.46MiB
│ ├─ LZ4(U8) Data(1) Data(2) Dict(U8) 99KiB 0.02% 97mB/row 0.16%
│ │ ├─ .0 96KiB
│ │ ├─ .1 2.0KiB
│ │ └─ .2 1.0KiB
│ └─ 14KiB 0.00% 80μB/row 27%
├─ pickup_ntaname 45% resident 0.45GiB 1.4% 0.74B/row
│ ├─ Data(1) Data(2) Dict(U8) 0.45GiB 100% 1.0B/row 73%
│ │ ├─ .0 0.45GiB
│ │ ├─ .1 0.93MiB
│ │ └─ .2 2.0MiB
│ └─ LZ4(U8) Data(1) Data(2) Dict(U8) 0.94MiB 0.20% 5.6mB/row 27%
│ ├─ .0 0.91MiB
│ ├─ .1 3.4KiB
│ └─ .2 4.0KiB
├─ pickup_nyct2010_gid 45% resident 0.89GiB 2.8% 1.5B/row
│ ├─ ToI64(U16) 0.89GiB 100% 2.0B/row 73%
│ ├─ LZ4(U16) ToI64(U16) 0.19MiB 0.02% 0.19B/row 0.16%
│ └─ 16KiB 0.00% 96μB/row 27%
├─ pickup_puma 45% resident 0.45GiB 1.4% 0.73B/row
│ ├─ LZ4(U16) ToI64(U16) 0.45GiB 100% 1.00B/row 73%
│ └─ LZ4(U8) ToI64(U8) 0.83MiB 0.18% 5.0mB/row 27%
├─ precipitation 45% resident 0.36GiB 1.1% 0.58B/row
│ ├─ LZ4(U16) ToI64(U16) 0.35GiB 98% 0.66B/row 87%
│ └─ LZ4(U8) ToI64(U8) 6.1MiB 1.7% 76mB/row 13%
├─ rate_code_id 45% resident 63MiB 0.19% 0.10B/row
│ └─ LZ4(U8) ToI64(U8) 63MiB 100% 0.10B/row 100%
├─ snow_depth 45% resident 94MiB 0.29% 0.15B/row
│ ├─ LZ4(U16) ToI64(U16) 92MiB 98% 0.56B/row 27%
│ └─ LZ4(U8) ToI64(U8) 2.3MiB 2.4% 5.0mB/row 73%
├─ snowfall 45% resident 71MiB 0.22% 0.11B/row
│ ├─ LZ4(U16) ToI64(U16) 66MiB 92% 0.42B/row 25%
│ └─ LZ4(U8) ToI64(U8) 5.4MiB 7.6% 12mB/row 75%
├─ store_and_fwd_flag 45% resident 0.16GiB 0.51% 0.27B/row
│ └─ LZ4(U8) Data(1) Data(2) Dict(U8) 0.16GiB 100% 0.27B/row 100%
│ ├─ .0 0.16GiB
│ ├─ .1 14KiB
│ └─ .2 5.1KiB
├─ tip_amount 45% resident 0.88GiB 2.7% 1.4B/row
│ ├─ LZ4(U16) Add(U16) 0.46GiB 52% 1.5B/row 49%
│ ├─ LZ4(U16) ToI64(U16) 0.30GiB 34% 1.3B/row 39%
│ └─ LZ4(U32) Add(U32) 0.12GiB 14% 1.7B/row 11%
├─ tolls_amount 45% resident 0.14GiB 0.45% 0.23B/row
│ ├─ LZ4(U32) Add(U32) 59MiB 40% 0.30B/row 31%
│ ├─ LZ4(U16) ToI64(U16) 47MiB 32% 0.19B/row 41%
│ ├─ LZ4(U16) Add(U16) 38MiB 26% 0.22B/row 27%
│ └─ LZ4(U32) ToI64(U32) 2.2MiB 1.5% 0.28B/row 1.3%
├─ total_amount 45% resident 1.4GiB 4.4% 2.3B/row
│ ├─ LZ4(U32) Add(U32) 0.92GiB 65% 2.5B/row 60%
│ ├─ ToI64(U16) 0.47GiB 33% 2.0B/row 38%
│ ├─ LZ4(U32) ToI64(U32) 13MiB 0.88% 2.5B/row 0.81%
│ └─ Add(U16) 6.3MiB 0.44% 2.0B/row 0.50%
├─ trip_distance 45% resident 1.6GiB 5.0% 2.7B/row
│ ├─ LZ4(U32) ToI64(U32) 1.3GiB 80% 2.5B/row 84%
│ ├─ LZ4(I64) 0.32GiB 20% 3.5B/row 15%
│ ├─ ToI64(U16) 2.7MiB 0.16% 2.1B/row 0.20%
│ └─ LZ4(U32) Add(U32) 2.5MiB 0.15% 2.5B/row 0.16%
├─ trip_id 45% resident 26MiB 0.08% 41mB/row
│ ├─ LZ4(U32) Add(U32) Delta(I64) 26MiB 100% 41mB/row 100%
│ └─ LZ4(U32) Delta(U32) 1.6KiB 0.01% 21mB/row 0.01%
├─ trip_type 45% resident 51KiB 0.00% 81μB/row
│ └─ 51KiB 100% 81μB/row 100%
└─ vendor_id 45% resident 0.30GiB 0.94% 0.50B/row └─ LZ4(U8) Data(1) Data(2) Dict(U8) 0.30GiB 100% 0.50B/row 100% ├─ .0 0.30GiB ├─ .1 10KiB └─ .2 5.1KiB

What does mB stand for you ask? Why, millibyte of course. Encountering this exotic unit is a surprise, to be sure, but a welcome one.

LZ4 decompression speed

LocustDB is able to keep memory-cached data lz4 compressed as well, which significantly reduces memory usage at the cost of query times. The following table show results for query runtimes in LocustDB for memory-cached data with and without lz4 compression:

Query1234567
Runtime (no lz4)/s0.250.550.543.11.60.224.2
Runtime (lz4)/s0.310.880.723.82.41.34.8

LZ4 overhead is fairly modest for most queries. Queries 5 and 6 select a small number of rows using filter/top n which can be done with a simple index lookup when the referenced columns are fixed width, but requires decoding the entire column when they are lz4 encoded.

Startup time

Loading all 80GiB of data from disk in bulk takes only 164s on SSD (523MB/s) and 625s on HDD (137MB/s). This is very unfortunate because it deprives me of the opportunity to write a paper about fancy techniques for quickly restarting your database when reading from disk is too slow.

Limitations and future work

The storage backend already works well enough for bulk load, but to support continuous ingestion (at reasonable latencies) we would need to store much smaller partitions when data first comes in, have an additional mechanism to periodically merge small partitions into larger ones, time partition data to allow it to be expired, and maybe find ways to reduce overhead from RocksdDB compactions which is quite significant. The fact that obtaining full sequential read speed on HDD has so far eluded me is very trying, but my resolution not to write a custom persistent database remains strong (how hard can it be though, really?). Some additional tuning and other improvements could probably close the gap, and either way RocksDB already provides a strong baseline.

Smarter partitioning

At least on SSD, we are already close to maxing out sequential read speed. So are there still any avenues for improvement (other than better compression)? The answer is yes, there are many situations where we can avoid reading a partition entirely. For example, you might have a query that filters down to the last hour of data in which case scanning a full 30 days is quite wasteful. This can be avoided by partitioning the data by timestamp, and storing the range of timestamps as part of the column metadata. Similarly, we might use bloom filters to quickly check whether a specific string we are filtering on is present in a column.

Compression

The compression ratios in LocustDB are starting to approach the limits of what can be achieved with current methods. I see three remaining avenues for further improvements.

1. Entropy encoding

LZ4 only eliminates redundancy at the byte level, and additional compression can be achieved by applying an entropy encoder that uses less bits for common symbols (Huffman coding or ANS seem like good candidates). This starts to add significant decoding overhead, but might still be strictly beneficial for data stored on disk. In fact, ClickHouse already has (experimental) support for compressing data stored on disk with Zstandard.

2. LZ4 high compression

LZ4 has a configurable parameter that governs the trade off between CPU time spent on compression and compression ratio. LocustDB currently uses the default, but using the high compression option might be reasonable if you’re not bottlenecked by ingestion (decoding speed is unaffected).

3. Domain specific compression

Since LocustDB has very flexible support for compression, it would be quite reasonable to add additional compression schemes for a variety of data distributions. These would only apply to specific datasets, but yield spectacular gains when they do. Examples include the timeseries compression scheme used by Gorilla, and the way stack traces are encoded in Scuba.

What’s next

Compression and disk storage were the two remaining features I wanted to prove out. LocustDB is still missing any support for running on (large) clusters, but getting this to production quality for general use cases would likely require a small team working on LocustDB full-time for a few months and actually operating a deployment. This exceeds the resources available to me. So my next goal will be to add the polish and missing features that will turn LocustDB into a useful command line tool, a sort of Swiss Army knife for ad-hoc data analysis. Think initial configuration overhead of awk or grep, speed of optimized Rust, expressive power of SQL, and the ability to scale to multi-terabyte datasets on a single machine. If you like the sound of that and are interested in speeding development along, or want to cheer from the sidelines, then join us on github.

Methodology

Summary of pertinent facts:

  • In between cold queries, I restarted the db processes and ran sudo sh -c "sync; sudo echo 1 > /proc/sys/vm/drop_caches" to clear OS page caches.
  • Results for cold query times are the medians of three runs.
  • The results for lz4/no lz4 comparison were run with sudo cpupower frequency-set -g performance.
  • Results for lz4/no lz4 comparison are the medians of the last three of four runs.
  • I used a LocustDB partition size of 65536 for HDD results and 1048576 for SSD and in-memory results (the larger partition size results in modest speedups for memory-cached queries, and significant speedups for reading from SSD using point lookups).
  • After ingestion, I restarted LocustDB and waited a few minutes for RocksDB background compactions to complete before collecting results.
  • I used ClickHouse version 1.1.54388 in all benchmarks.

200K lines worth of transcripts for all the terminal sessions that produced the results, including printouts of ClickHouse config files, commit hashes and various system properties: