Anyone who’s ever taken an information-theory class probably knows of
ACID properties and the CAP theorem, but most people seem to think that this is a problem that only distributed system engineers need to be concerned about…
After all isn’t my database supposed to guarantee ACID properties for reads and writes across time and space?
Why should I as an application developer concern myself with making sure that the transactions made by my application are propagated consistently and atomically across shards?
With changing data models however, the definition of what constitutes a “database” has shifted and the guarantees of ACID can no longer be taken for granted. Scale-out data models are faced with unique challenges with no perfect solutions and lots of trade-offs…and often the brunt of dealing with the consequences falls on the application programmer.
There are schools of thought on both sides of the equation, with eventual-consistency advocates arguing that not all applications need strong ACID guarantees across documents and the speed gains by giving up on ACID are worth it for a wide variety of use-cases. Other industry leaders and think-tanks advocate for strong-consistency in the data layer simply because it allows application developers to write scalable, bug free code without being distributed system experts themselves. In the MillWheel paper, Google advocates that the Data Layer must be strongly-consistent and be able to make distributed ACID transactions.
By allowing users to focus solely on their application logic, this kind of programming model allows users to reason about the semantics of their system without being distributed systems experts. In particular, users are able to depend on framework-level correctness and fault-tolerance guarantees as axiomatic, vastly restricting the surface area over which bugs and errors can manifest.
But why is this such a hard problem?
Well, one of the problems is time itself.. it’s hard to get different systems to agree on a time, physical clocks hardly ever agree with each other.. and using networked time means that it takes time to send time. How do you maintain a sequential execution schedule when the nodes cannot reach a consensus regarding the order of operations? Logical clocks can be useful in globally sequencing events but often have no relation to physical time.
Your start-up is now a unicorn and you start seeing orders coming in from all over the world. To cater to the growing customer-base, you scale out your cool NoSQL engine to additional regions and maybe even across public-clouds. Furthermore, to ensure faster reads across a single document, your database is now partitioned across clusters. How do you ensure that a transaction being committed on a node is propagated atomically to other nodes and no conflicting transactions are happening on the other node at the same time? How do you trade-off between availability and consistency in case of network-partition (node being down or heart-beat timeout)?
Multi-phase commit protocols and pessimistic read-write locking are solutions for ensuring atomicity in distributed transactions but they don’t help you much in case of network partitions.
Let us consider a sample application with both optimistic and pessimistic data models.
Earth-to-Echo Travel Services — providing holiday packages to premium Martian destinations.
Your micro-services based app allows honeymooners and business travelers alike to book shuttle tickets to Martian resorts. Obviously since Bitcoin is the currency that is acceptable in both Earth and Mars, you allow bookings to be made via Bitcoin.
In the optimistic data model, the application asynchronously fires updates on the payments and tickets database and marks the transaction as success without waiting for any confirmations from the blockchain or the ticketing service.
The pessimistic data model creates a blocking operation which holds the lock until the required number of confirmations have been received from the blockchain and then further holds it until ticketing confirmation has been received. If any one of the operation fails, It rolls-back the transaction and issues a refund if part 1 has already finished.
Practical models lie somewhere in between where provisional records created while transaction is still pending may allow nested transactions to continue while waiting for confirmation.
All Or Nothing guarantees.
Every business event must result in a single synchronous commit and the database must be the source of truth(Or as close to the truth as possible). In order for a Distributed Transaction to be atomic, it must be rolled-back if consensus cannot be achieved, so as to not leave the system in an inconsistent state.
Or how to read a record without holding a lock on the artifact.
MVCC is a protocol that allows multiple timestamped versions of the same artifact to be stored in the db. This in turn allows consistent reads from snapshots at the most recent timestamps to be made without holding locks on the writes. The most commonly used isolation model used with MVCC is the Snapshot Isolation Model.
By reconciling the time received between Atomic Clocks and GPS systems, the database can significantly reduce the bounded error on a global synchronized clock. A daemon in every node is checking in with masters trying to reach a consensus about the global time. A method known as Commit-Wait, then ensures that commits of related transactions are separated by at-least the bounded error. This method is known as TrueTime — a flagship idea of Google.
Another common method of achieving globally-consistent timestamp consensus on distributed systems is by using Hybrid Time which combines the advantages of physical and logical clocks.
The Hybrid Time algorithm does not need to commit-wait for error bound in most cases and ensures that —
Events connected by a causal chain of the form “A happens before B on the same server” or “A happens on one server, which then sends an RPC to another server, where B happens”, always get assigned hybrid timestamps in an increasing order.
All nodes must see the same results of an operation at any given time.
Distributed Transactions on eventually-consistent models have rightly been called the icebergs of micro-services. As written by the author —
For those who think mostly about the happy paths in their system, distributed transactions will lie in wait, showing no harm day by day, until a little glitch happens in your network and suddenly all hell breaks loose.
In an excellent explanation about the pitfalls of eventually-consistent and timeline-consistent models, Karthik, CTO of YugaByte and one of the original authors of Cassandra dispels myths about scalability of eventual-consistent models and an even bigger myth that strongly consistent (or CP systems) sacrifice performance.
Let’s consider an example from the banking domain. The Iron-Bank of Braavos was known as the richest bank ever. We draw reference from the conversations between Cercei and Tycho. The premise is that Iron Bank wanted confirmation that Jamie had arrived with the gold before increasing Cercei’s credit line and was simply not willing to take Cercei’s promise that the gold was on it’s way and will be repaid soon.
The fact that the banker was not willing to make a weakly-consistent transaction that could violate the ACID properties of the ledger prevented what could have possibly led to the downfall of the Lannisters and a swift ending to the war for the Iron-Throne :)
Let us now consider what would have happened if the Iron Bank had an weak consistency model. Using Quorum Reads and Writes in an eventually consistent model will give inconsistent results in failure scenarios.
Consensus is the process by which multiple nodes agree on a single result to guarantee consistency among them. In the paper Impossibility of distributed consensus with one faulty process the authors state that no asynchronous protocol can always reach consensus in a bounded time, in the event of even a single node failure.
Perhaps one of the most widely used distributed consensus algorithms, this contains two phases — The Voting Phase and the Commit phase.
In the voting phase, the transaction manager asks every node for the result, then decides the correct value based on majority consensus and gives it back to the nodes. If everyone agrees, the transaction manager contacts every participants again to let them know about the final value. Otherwise, contact every participant to abort the consensus. The participants do not have to agree on a value but have to agree on whether or not to agree on the value provided by the Transaction manager.
PAXOS is a method of achieving consensus in a network with unreliable nodes. It contains the following roles — a Client which issues a request to the distributed system, Acceptors which form a quorum, a Proposer is an advocate for the client request trying to convince acceptors to accept it, a learner takes action once a request has been accepted, a leader is who is also a Distinguished Proposer and a Distinguished Learner. The leader then chooses the “valid result” and sends it to all the nodes, the nodes can reply with an accept or a reject. If a majority of nodes accept then the value is committed. Apache Zookeeper and Google Spanner use the PAXOS algorithm to achieve consensus.
The RAFT consensus algorithm is similar to PAXOS in guarantees of performance and efficiency. It contains three actors — leader, followers and candidates. Leader election happens through heartbeats and random timeouts. The leader accepts log entries from clients and propagates them across clusters. YugaByte uses RAFT to achieve consensus and arbitrarily add, remove or change nodes in the cluster in an online fashion. It can also tolerate the failure of a minority partition as long as enough nodes are online to hold a leader-election.
PBFT is a consensus protocol that can tolerate Byzantine faults. Byzantine faults occur due to system failures or when all the actors are not behaving in altruistic ways. For eg. a byzantine fault may occur when an actor does not return any result, a deliberately misleading result or returns different results to different parts of the system.
In a POW type algorithm, a random miner tries to mine the next block by spending huge computing resources. The block is then accepted and added to the chain. The majority is decided by the longest chain. The difficulty of mining is constantly changing to control block generation rate. Bitcoin, Ethereum and countless other digital currencies acquire consensus amongst nodes via POW. Since this a 1-CPU-1-Vote model, the attacker needs 51% of computing power to compromise the network.
The more stake a candidate has the greater the chances of him/her mining the next block. POS implementations can be either Chain based on BFT based. Neo and Cardano are two currencies using POS to achieve consensus in the network. There are also proposals to move Ethereum to a POS consensus system in the Casper update.
- Snapshot Isolation
This isolation level guarantees that all the reads made on a particular timestamp are consistent across nodes and the transaction itself will successfully commit only if no conflicting updates are being made. MongoDB version 4.2 is likely to support Snapshot Isolation for consistent reads across documents, culminating a multi-year engineering effort.
- Serializable Isolation
This isolation level guarantees that all transactions would run in a linear(serializable) schedule.
YugaByte supports snapshot-isolation with serializable isolation support being on the roadmap.
For durability in distributed systems, the TimeStamps (Hybrid or Atomic) of each update need to be recorded, so that it is possible to recover the state of the document from any point in the past. These versioned updates need to subsequently be garbage collected when there are no transactions reading a snapshot at which the old value would be visible.
YugaByte uses a Log Structure Merge Tree based key-value store with updates being replicated via the RAFT protocol. The mutation of records are versioned using timestamps and uses bloom-filters and range-query optimization for faster reads.
In the article Practical Tradeoffs in Google Cloud Spanner, Azure Cosmos DB and YugaByte DB, the author does quite well in explaining the trade-offs that need to be considered when choosing between the three scale-out databases. Here I will summarize some highlights from the text briefly —
Google Spanner internally divides the data into chunks called splits and assigns each split to a different node. It then replicates each split and distributes them amongst the nodes using the PAXOS consensus algorithm.
Global clock synchronization is done via TrueTime which provides minimum bounds of timestamp error. Using commit-wait for error bounds, it is able to achieve External-Consistency meaning all transactions will follow a global serializable schedule even on nodes in different regions. Google Spanner however, is proprietary, works on specialized GCP hardware (cloud lock-in) and only supports a single API (SQL).
Azure Cosmos DB is a multi-model, polyglot distributed data solution with 5 different consistency levels to choose from. It supports the Cassandra, Mongo and Graph APIs amongst others and uses proprietary replication and consensus algorithms. To prevent write unavailability in the case of partitions for the Strong-consistency mode of operations, it limits accounts with strong consistency guarantees to a single Azure region. It is not open-source and works only on Azure hardware (Cloud lock-in).
YugaByte is an infrastructure-agnostic, distributed data-store which can be deployed across Public Cloud and OnPrem datacenters. YugaByte DB shards the data into a configurable number of tablets, which are then distributed evenly amongst the nodes in the cluster. Each tablet is further replicated in the other nodes via the RAFT consensus algorithm.
It provides Strong Consistency guarantees across regions and zones, uses hybrid timestamps for global timekeeping and provides snapshot isolation via MVCC with serializable isolation being in the works. The codebase is open-source and compatible with Redis, Cassandra and Gremlin APIs and SQL support being in the works.
Any distributed data-store — be it a database, a messaging queue or a blockchain must behave in consistent and predictable ways even in cases of network failures and as application developers, we must be mindful of the choices that we make when choosing a data model because it will come to define the life of the product you are building and everything surrounding it.
If you liked this article there’s 50 ways(claps) to show you appreciation :)