Advertisment

Understanding Paxos

author-image
PCQ Bureau
New Update

Paxos is a protocol for achieving consensus in a distributed system. Consensus is an important problem in distributed systems. By consensus, we mean agreeing on a single outcome and Paxos is an algorithm that helps in reaching an agreement for a single value at any point in time. The world today is abound with several varieties of distributed systems — the Facebook servers, Twitter systems and the new breed of cloud computing systems like Windows Azure, Google AppEngine and Amazon Elastic Cloud are just a few examples wherein large scale distributed systems and architecture are deployed. Central to these systems are several class of algorithms which when implemented produce a single system consisting of a massive collection of computing nodes which is not only scalable, but self reliant, fault tolerant and also highly available. In a distributed system, an individual functional entity- for example a file is distributed across a set of nodes. Distribution increases availability, fault-tolerance and also helps in balancing loads. But mere distribution is not enough to build a completely reliable system. To understand the problem of consensus and why it is required, let us think of distributing a single variable across a set of nodes. Several distinctive clients, each a different process, are interested in updating this single variable. Clearly, since the request for updates is from several places and there is no direct control on ordering of the updates, the single variable in question loses the ability to maintain consistency, resulting in unpredictable results. Consensus algorithms help in alleviating problems of such kind by automatically ordering the set of updates even under worst conditions of network failure or when processes die unexpectedly so that updates occurring to a single variable distributed across the set of nodes all appear same.

Advertisment

Paxos was originally proposed by Leslie Lamport. He is one of the forerunners in the areas of distributed system and to his credit lies several different algorithms distributed system related to Time and Clocks, Byzantine Fault tolerance, consistency, and snapshots among others. He is also particularly noted for his contribution to LATex. Paxos was invented while he was on in his vacation in the beautiful Greek inland, Paxos and the original paper actually is a description of a parliament session organized by a lost civilization. The characters in the papers are named after several computer scientists and depicted in a humorous way, unlike other serious computer science papers. It is however filled with complex maths and only after several years people started showing interest in this seminal work. Consequently, Lamport wrote one more paper titled 'Paxos Made Simple” and explained the core ingredients of the algorithm using every day English and minimalistic proof. In this article, we will try to understand what Paxos is all about and what the different varieties of Paxos that exists today are. And to do so we will proceed incrementally; understanding the simplest problem first and then gradually deep dive into the complexities that a distributed system poses. But before we move ahead, there are a few set of terminologies that we have to learn.

In Paxos, there are three prominent and different kinds of roles: 'Proposer', 'Acceptor' and 'Learner'. A Proposer proposes a new value, Acceptors accept the value sent by the Proposer or reject it and Learner simply learns from the acceptor that a new value is accepted. In most cases, however, the three roles as defined may not be separate processes in different machines but rather separate processes in the same machine. For example, acceptors may be learners as well and acceptors can be proposers also. In case of latter, this means that updates may arise from multiple different locations and the consensus algorithm some how has to select one among many. Apart from these roles, there is yet another role known as a 'leader'. A leader is a distinguished entity whose primary job is to act as a point of contact for other members of the same community and responsible for accepting, informing and executing instructions. For example, in case of a leader of the proposer community, it is the responsibility of the leader to choose the update among the set of updates received to the set of proposers and inform the acceptors. Similarly, in case of the learning community, a leader may be the sole point of contact which first receives an instruction to learn and subsequently convey it to its group members. Though, leadership brings about an aura of Server Centric computing but designers often choose to have leaders to reduce the complication of involving multiple partners of equality. Further, leaders are selected or elected by an election mechanism where participating members select one among a few based on some known criteria. We will not jump into this topic as this itself is a significant topic on its own and calls for a separate publication.

Advertisment

In an ideal scenario, there is a single proposer leader, a few acceptors and a set of learners. For the sake of simplicity, we will create a leader out of the acceptor set and a distinguished leader learner from the learner set. Once, we understand the basics we will relax these constraints and see how and where things can go wrong. We also assume that they are separate processes running in different workstations.

Figure 1 illustrates a very simple Paxos implementation. A proposer leader is assigned the task for all of the clients (C1, C2, Cn). The proposer sends a single proposal to a single Acceptor which carries the right of either rejecting a proposal or accepting it. And if accepted, it passes on the new agreed upon value to the learner leader. In this setup and also as per the basic Paxos protocol, there are two different passes required for each proposal to be agreed upon. First, the proposer proposes, the acceptor accepts or rejects and if accepted the proposer sends the value that it wants the learner to adapt to. The following defines the two different phases.

Advertisment

Phase I:

1. Proposer selects a Number N and sends a Prepare request to the majority of Acceptors.

2. When an acceptor accepts a Prepare message it promises to not accept any more request less than N but is free to accept messages above N. It send back a negative acknowledgment if it doesn't accept a prepare request or sends a positive acknowledgement with the highest number accepted so far along with its value.

Advertisment

Phase II:

1. Proposer collects all the messages and looks for majority. If the majority is not formed it can proceed and has to try once again. If the majority of the Acceptor has accepted the Prepare request, it selects the highest value so obtained in the previous promise message or chooses a new value if none such exists. Proposer sends an Accept! Message to the set of acceptors.

Advertisment

2. Acceptor receives the message and if it has still not answered a high ordered Prepare message, send the positive Accepted Message and broadcast to the set of learners of this new value. If, on the other hand, a high ordered Prepare message is accepted, a negative acknowledgment is sent and the proper is set to start once again the sequence.

Thus, for each proposal, the Proposer (1, in Figure 2) selects a natural number in increasing order and sends the request to the Acceptor (2, in Figure 2).

Advertisment

ending of request is known as 'Prepare'. The acceptor can only accept request in increasing order only and rejects proposals which are less than the accepted number. That is, if the proposal accepts a proposal with number N and subsequently the acceptor accepts a new proposal number M, where M < N, then the acceptor rejects the new proposal with number M. On, the other hand, if it is greater, it can accept and send a positive acknowledgment to the proposer. This positive acknowledgment is a 'Promise' and it includes the highest value seen so far. In Phase 2, after receiving a positive acknowledgment, the “Acceptor” either selects the highest value or chooses a new value and passes it on to the acceptor. The acceptor acknowledges the receipt, creates an Accepted Message and passes it on to the learners (3, in Figure 2) and to the proposer. A new value is learned now and subsequent reads will yield this new value.

Surprisingly, this is simple. But as we know, distributed system doesn't include a single proposer, acceptor or a learner. For a single system defined above, it is just the four tier architecture that we generally encounter in several different client server based enterprise architecture designs. There are failure points though, like for example what if the acceptor goes down after the first phase/round is complete and what if the proposer itself goes down after sending the accept message. However, such complications can be trivially resolved if we choose to use a stable storage to record what is sent, received, time outs and gracefully allow each of these entities to recover from a failure. But in a distributed system design, it is not that easy. There is no single acceptor, learner and a proposer but rather a multiple of them co-exists together. Generally, each of these group forms a 'Quorum' and every protocol use the majority to accept and deny much like the democratic society. In Paxos, such a group exists too and so when a proposer wants to send a prepare message, it sends them to all of acceptors (see figure 2). This is a selective broadcast and each such acceptor when receiving a message returns the same set of messages as described in the two phases above. If however, there are multiple proposers and each one of them starts proposing values at the same time, the original prepare fails and a new round for prepare-promise-accept-accepted needs to be initiated. Traditionally, this could go on and on but using a few tricks and modifying the existing protocol, this cycle of update dilemma can be solved. For every prepare message received by an acceptor and if there are conflicts, the proposers themselves can form a quorum and selectively order the set of prepare by electing a leader among themselves. Note the figure 2, there are a series of messages that get's generated and sometimes Paxos may thus create too much of messages for the two rounds as described above.

There are a large number of Paxos implementations that exists today and some of them are open source one's as well. Further, several companies and individuals have put forth their experience reports and results of implementing Paxos in their own environment. One interesting aspect is that there are several optimizations that exist today and these optimizations either reduce the time to reach consensus or deal with other aspect of distributed systems when Paxos is used. Accordingly, there are Cheap Paxos, which loosens the liveliness on and the number of active processes required, Fast Paxos which reduces the delay, the number of messages, Byzantine Paxos which takes care of faulty nodes etc. In most deployments however, a new optimized form of Paxos called Multi Paxos is used. Multi Paxos relaxes the stringent requirements by relaxing and eliminating a few message types and thus reducing the delay as well as the number of messages so generated. In fact, in practice the original Paxos is hardly used, most engineers prefer to use Multi-Paxos along with some more optimizations. In the references section, I have listed a few papers which are worth reading when you try to write your own Paxos system. In real life, you will find implementations of Paxos in several different products. Chubby (Google), ZooKeeper (Yahoo!), AutoPilot (Microsoft), Keyspace are some of the distinguishing products where Paxos is used in several forms.

One interesting aspect however that we want to put forth is the agility and principle topic which Paxos intends to solve. More or less, the consensus agreement is a social problem and we encounter such kind of problems sometimes in the meetings we do to reach a conclusion over a particular topic where a number of solutions exists and proposed by distinct individuals. Distributed Systems are like this natural systems only; a lot of problem statements still exists to improve and understanding of our every day life and the nature around us may perhaps provide us an interesting insight to the solutions for the hard problems we are trying to solve!.

Advertisment