Building on the ideas of Leslie Lamport in his seminal 1982 paper “The Byzantine Generals Problem” (also recently discussed by Vitalik Buterin on his website), Radix uses a consensus algorithm that can withstand up to a theoretical 99% dishonest nodes, while still being able to determine the truthful events from the malicious ones.
Our last article looked at how the shards and data architecture of Radix allows for linear scalability. While this is important, scalability is useless if we cannot reach consensus. This is what the next two articles will focus on, first looking at simple conflict resolution before moving on to discuss more exotic conflict resolution scenarios.
As mentioned in the previous article, the permissionless consensus system employed by Radix applies to every transaction at the shard level. Radix uses logical clocks to order and validate transactions, as well as to determine the earliest transaction in the event of two (or more) conflicting transactions, with the conflicting transaction(s) ordered and discarded.
We call this process Asynchronous Byzantine Fault Detection - a review of the basics and why Fault Detection (when properly implemented) is so effective, please see this excellent paper by Haeberlen et Al - The Case for Byzantine Fault Detection
This relies on the network always being able to identify the difference between a main net and a sub net - the details of how we achieve this will be the subject of a later blog post. This blog post will simply give an overview of how we achieve event ordering on a shard, assuming no malicious subnets.
What are logical clocks?
In a Radix Universe, an “event” is a general term for something that the ledger has to do at the request of a user. This could be a transaction, it could be a message, or it could be the updating of a state in a Smart Contract. A Node is a computer running part or all of the Radix universe - we will discuss these components in more depth in another article.
Logical clocks are essentially a counter measuring events seen by nodes; every time a node sees an unseen and valid event it adds ‘1’ to its logical clock counter as follows:
- Node receives a new event
- Node checks its own database to check if it has seen it before, as well as doing basic cryptographic validity checks
- If it is a new event for the Node, it increments its logical clock by 1 (e.g. from 0 to 1 if it is the first new transaction it has ever seen) and associates that value with the event in it’s own database
- It then stamps the event with that logical clock value, as well as its own signature (so that it cannot be forged by other nodes) and then passes that event on to other nodes in the network
- As the event moves through the network it collects the signatures and Logical Clock time stamps of all the Nodes that it touches until a threshold (depending on network size)
To jump into this a bit more, see this explainer video by Dan:
Determining relative order
Because the way the ledger is constructed, a spend from an address can NEVER be from two different shards - a wallet address determines the shard that wallet may spend from. This means that all related events (such as a double spend) MUST always be on the same shard.
Nodes are constantly gossiping these events to each other. This gossiping of logical clock values is used in conjunction with a node’s own logical clock values to allow nodes to adjust the absolute order of transactions and slot in new events into the right place.
In most scenarios, no further action is required by the Node. A Node hears about a new event, adds that event to their database with their current logical clock value, and gossips that out. If they hear no conflicting information back, then the transaction may be considered to be truthful. It is an incredibly efficient, passive consensus mechanism that only requires the comparison of two events in the case that there are two events that conflict.
A simple example of a conflict would be two spends of the same consumable (e.g. a double spend). In 95% of cases, this is simply a matter of comparing the collected logical clock values from all the nodes both transactions touched, and the transactions that came before and after that specific transaction.
For example, event X and Y conflict. If a node knows transaction X took place before transaction Z, but does not know when Y occured. It simply has to find a node that saw both Y and Z. If Y is found to have occured after Z, then Y may be discarded as the later transaction and the invalid second spend.
If Z does not exist, or Y occured before Z, then the second stages of Radix consensus are engaged. This second stage is called Mass, and will be the subject of future blog posts, but a brief introduction can be found at the end of our Alpha launch video here:
This system also allows new nodes to ascertain the correct order quickly. Imagine Node A has a logical clock value of 10 and the newly joined Node B has seen no events. Node B would ask Node A for the values from its ledger and therefore see all of the events Node A has in the order they occurred. It would then add that data to its own ledger. This can be cross checked against multiple other Nodes, allowing Node B to quickly build a complete relative ordering of past events before it can start participating in the new event consensus.
Keeping nodes honest
Although this allows us to establish relative order, it does not prevent nodes from lying about when they saw a transaction. Each logical clock is locally owned and the node could falsify the timing or the order of events.
To solve this, actions are associated with the node responsible for creating it to prove the node is faulty in the event of a failure (which provides aBFD). Radix uses what are known as Commitments to audit the logical clock output of nodes, ensure accountability and prevent tampering. A Commitment is a unique identifier created by five bits of information:
- The node’s logical clock value at time of Commitment
- The ID of the node submitting the Commitment
- The timestamp for that node at time of Commitment
- A signature from the node to prove it produced the Commitment
- A Merkle hash
The initial four are fairly self-evident but the Merkle hash may require explanation. Each protocol event will have a unique reference known as a hash (a long string of text and numbers). We essentially add a number of these hashes together to give us a unique Merkle hash.
We do this as it allows for nodes to check the work of the submitting node. Not all nodes will have seen the relevant events included but still need to know if the Commitment is valid. To make sure the node isn’t lying, other nodes need a means to examine the data. The Merkle Hash is what allows them to accomplish this.
For example, assume a node is reviewing a Commitment from another node which has four events in it, but the reviewing node only knows about Event A. Before validating the Commitment, it first asks neighboring nodes for information on Events B, C and D. This will allow the node to know the order events were seen in (by their logical clock values).
The node can then check the Merkle Hash to make sure that the first node was telling the truth about when Event A occurred and that the four events occurred in that order. If these conditions aren’t met then the Merkle Hash will be different and as such so will the Commitment. This means the submitting node may have lied or tampered with the logical clocks of when it had seen events.
Furthermore, each new Commitment incorporates the information from the previous Commitment into it, which makes it harder to lie about past transactions. Ultimately, this simple technique means that all nodes MUST strictly increment their logical clocks by 1 for each new event as anything else is trivially detectable by the other nodes.
To dive into this a bit more, here is Dan again!
The final problem we face is that nodes could pick and choose when to submit Commitments or not produce Commitments at all. This would make it harder to know when they actually saw events. This is solved by forced periodic Commitments where nodes must share information. This means a node is unable to say it saw Event Z at a particular moment in time and then later lie, ensuring logical clocks increment strictly in order.
As mentioned above, this is a very passive consensus mechanism. It only needs to operate in the case of a conflict. In a network not under attack, this is less than 1 in 10,000 transactions (non-malicious faults can still cause issues). In 95% of cases, the above simple consensus mechanism resolves the conflict in a asynchronous byzantine fault detecting way.
The remaining 5% of non-malicious faults (1 in 200,000) or in the case of some forms of malicious attacks require a more exotic conflict resolution using Mass. This will be the subject of a later blog post.
To learn more about the simple Radix consensus mechanism, please see our white paper here: https://papers.radixdlt.com/tempo/latest/