Another cardinal rule for a database: it cannot lose or corrupt your data. This is a dimension where there is a stark difference in the approaches TimescaleDB and InfluxDB have taken, which has implications for reliability.
At its start, InfluxDB sought to completely write an entire database in Go. In fact, it doubled down on this decision with its 0.9 release, which again completely rewrote the backend storage engine (the earlier versions of Influx were going in the direction of a pluggable backend with LevelDB, RocksDB, or others). There are indeed benefits from this approach, e.g., you can build domain-specific compression algorithms that are better suited for a particular use case, as InfluxDB has done with its use of Facebook’s Gorilla encoding.
Yet these design decisions have significant implications that affect reliability. First, InfluxDB has to implement the full suite of fault-tolerance mechanisms, including replication, high availability, and backup/restore. Second, InfluxDB is responsible for its on-disk reliability, e.g., to make sure all its data structures are both durable and resist data corruption across failures (and even failures during the recovery of failures).
Due to its architectural decisions, on the other hand, TimescaleDB instead relies on the 25+ years of hard, careful engineering work that the entire PostgreSQL community has done to build a rock-solid database that can support truly mission-critical applications.
In fact, this was at the core of my co-founder’s launch post about TimescaleDB: When Boring is Awesome. Stateless microservices may crash and reboot, or trivially scale up and down. In fact, this is the entire “recovery-oriented computing” philosophy, as well as the thinking behind the new “serverless” design pattern. But your database needs to actually persist data, and should not wake you up at 3am because it’s in some broken state.
So let us return to these two aspects of reliability.
First, programs can crash, servers can encounter hardware or power failures, disks can fail or experience corruption. You can mitigate but not eliminate this risk, e.g., robust software engineering practices, uninterrupted power supplies, disk RAID; it’s a fact of life for systems. In response, databases have been built with an array of mechanisms to further reduce such risk, including streaming replication to replicas, full-snapshot backup and recovery, streaming backups, robust data export tools, etc.
Given TimescaleDB’s design, it’s able to leverage the full complement of tools that the Postgres ecosystem offers and has rigorously tested, and all of these are available in open-source: streaming replication for high availability and read-only replicas, pg_dump and pg_recovery for full database snapshots, pg_basebackup and log shipping / streaming for incremental backups and arbitrary point-in-time recovery, WAL-E for continuous archiving to cloud storage, and robust
COPY FROM and
COPY TO tools for quickly importing/exporting data with a variety of formats.
InfluxDB, on the hand, has had to build all these tools from scratch. In fact, it doesn’t offer many of these capabilities even today. It initially offered replication and high availability in its open source, but subsequently pulled this capability out of open source and into its enterprise product. Its backup tools have the ability to perform a full snapshot and recover to this point-in-time, and only recently added some support for a manual form of incremental backups. (That said, its approach of performing incremental backups based on database time ranges seems quite risky from a correctness perspectness, given that timestamped data may arrive out-of-order, and thus the incremental backups
-since some time period would not reflect this late data.) And its ability to easily and safely export large volumes of data is also quite limited. We’ve heard from many users (including Timescale engineers in their past careers) that had to write custom scripts to safely export data; asking for more than a few 10,000s of datapoints would cause the database to out-of-memory error and crash.
Second, databases need to provide strong on-disk reliability and durability, so that once a database has committed to storing a write, it is safely persisted to disk. In fact, for very large data volumes, the same argument even applies to indexing structures, which could otherwise take hours or days to recover; there’s good reason that file systems have moved from painful fsck recovery to journaling mechanisms.
In TimescaleDB, we made the conscious decision not to change the lowest levels of PostgreSQL storage, nor interfere with the proper function of its write-ahead log. (The WAL ensures that as soon a write is accepted, it gets written to an on-disk log to ensure safety and durability, even before the data is written to its final location and all its indexes are safely updated.) These data structures are critical for ensuring consistency and atomicity; they prevent data from becoming lost or corrupted, and ensure safe recovery. This is something the database community (and PostgreSQL) has worked hard on: what happens if your database crashes (and will subsequently try to recover) while it’s already in the middle of recovering from another crash.
InfluxDB had to design and implement all this functionality itself from scratch. This is a notoriously hard problem in databases that typically takes many years or even decades to get correct. Some metrics stores might be okay with occasionally losing data; we see TimescaleDB being used in settings where this is not acceptable. In fact, across all our users and deployments, we’ve had only one report of data being corrupted, which on investigation turned out to be the fault of the commercial SAN the user was employing, not TimescaleDB (and their recovery from backup was successful). InfluxDB forums, on the other hand, are rife with such complaints: “DB lost after restart”, “data loss during high ingest rate”, “data lost from InfluxDB databases”, “unresponsive due to corruption after disk disaster”, “data messed up after restoring multiple databases”, and so on.
These challenges and problems are not unique to InfluxDB, and every developer of a reliable, stateful service must grapple with them. Every database goes through a period when it sometimes loses data because its really, really hard to get all the corner cases right. And eventually, all those corner cases come to haunt some operator. But PostgreSQL went through this period in the 1990s, while InfluxDB still needs to figure these things out.
These architectural decisions have thus allowed TimescaleDB to provide a level of reliability far beyond its years, as it stands on the proverbial “shoulders of giants”. Indeed, just one month after we first released TimescaleDB in April 2017, it was deployed at the operator-facing dashboards in 47 power plants across Europe and Latin America. And so while InfluxDB (2013) was released several years before TimescaleDB (2017), we believe it still has many years of dedicated engineering effort just to catch up, specifically because it was built from scratch.