Understanding How Apache Pulsar Works


Let’s look at the RabbitMQ and Kafka acknowledged write message loss scenarios and see if they apply to Pulsar.

RabbitMQ split-brain with either Ignore or Autoheal mode.

The losing side of the partition loses any messages delivered since the partition began that were not consumed.

Split-brain on the storage layer is not theoretically possible with Apache Pulsar.

Apache Kafka, acks=1 and broker with leader replica dies.

Fail-over to a follower in the ISR occurs with potential message loss as an ack was sent once the leader had persisted the message but potentially before the follower was able to fetch it.

Apache Pulsar has no leader storage node. Given a replication factor (Qw) of 2 or more, there simply is no way for a single node failure to cause message loss.

Let’s consider two scenarios.

Scenario 1. E = 3, Qw =2, Qa = 1. The broker sends the write to two bookies. Bookie 1 and Bookie 2 return an ack to the broker who then sends an ack to its client. Now to produce message loss we’ll need both the Bookie 1 and Bookie 2 to fail. If any single bookie dies then the Auto Recovery protocol will kick in.

Scenario 2. E = 3, Qw =2, Qa = 1. The broker sends the write to two bookies. Bookie 1 returns an ack to the broker who then sends an ack to its client. Bookie 2 has not responded yet. Now to produce message loss we’ll need both the broker and Bookie 1 to fail and Bookie 2 to have no successfully made the write. If only Bookie 1 dies, then the broker will still end up writing the message to a second bookie in the end.

The only way a failure of a single node could cause message loss is if the Qw is 1 (which means no redundancy). Then the only copy of a message could be lost when its bookie fails. So if you want to avoid message loss, make sure you have redundancy (Qw >= 2).

Apache Kafka. Node with leader partition is isolated from ZooKeeper.

This causes short-term split-brain in Kafka.

With acks=1, the leader will continue to accept writes until it realizes it cannot talk to ZooKeeper at which point it will stop accepting writes. Meanwhile a follower got promoted to leader. Any messages persisted to the original leader during that time are lost when the original leader becomes a follower.

With acks=all, if the followers fall behind and get removed from the ISR, then the ISR consists only of the leader. Then the leader becomes isolated from ZooKeeper and continues to accept acks=all messages for a short while even after a follower got promoted to leader. The messages received in that short time window are lost when the leader becomes a follower.

Apache Pulsar cannot have split-brain of the storage layer. If the current broker owner gets isolated from ZooKeeper, or suffers a long GC, or its VM gets suspended, and another broker becomes the owner, then still only one broker can write to the topic. The new owner will fence off the ledger and prevent the original leader from making any writes that could get lost.

Apache Kafka. Acks=all with leader failure.

The followers fall behind and get removed from the ISR. Now the ISR consists of a single replica. The leader continues to ack messages even with acks=all. The leader dies. All unreplicated messages are lost.

Apache Pulsar uses a quorum based approach where this cannot happen. An ack can only be sent once Qa bookies have persisted the message to disk.

Apache kafka - Simultaneous Loss of Power (Data Center Outage)

Kafka will acknowledge a message once written to memory. It fsyncs to disk periodically. When a data center suffers a power loss, all servers could go offline at the same time. A message might only be in memory on all replicas. That message is now lost.

Apache Pulsar only acks a message once Qa bookies have acknowledged the message. A bookie only acknowledges an entry once it is persisted to its journal file on disk. Simultaneous power loss to all servers should not lose messages unless multiple disk failures also occur.

So far Apache Pulsar is looking pretty robust. We’ll have to see how it fares in the chaos testing.

Conclusion

There are more details that I have either missed out or don’t yet know about. Apache Pulsar is significantly more complicated than Apache Kafka in terms of its protocols and storage model.

The two stand-out features of a Pulsar cluster are:

  • Separation of brokers from storage, combined with BookKeepers fencing functionality, elegantly avoids split-brain scenarios that could provoke data loss.

  • Breaking topics into ledgers and fragments, and distributing those across a cluster allow Pulsar clusters to scale out with ease. New data automatically starts getting written to new bookies. No rebalancing is required.

Plus I haven’t even gotten to geo-replication and tiered storage which are also amazing features.

My feeling is that Pulsar and BookKeeper are part of the next generation of data streaming systems. Their protocols are well thought out and rather elegant. But with added complexity comes added risk of bugs. In the next post we’ll start chaos testing an Apache Pulsar cluster and see if we can identify weaknesses in the protocols, and any implementation bugs or anomalies.