Cheap Paxos. Leslie Lamport and Mike Massa. Appeared in The International Conference on Dependable Systems and Networks (DSN 2004 ) - PDF

Please download to get full document.

View again

of 9
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report

Books - Non-fiction


Views: 11 | Pages: 9

Extension: PDF | Download: 0

Related documents
Cheap Paxos Leslie Lamport and Mike Massa Appeared in The International Conference on Dependable Systems and Networks (DSN 2004 ) Cheap Paxos Leslie Lamport and Mike Massa Microsoft Abstract Asynchronous
Cheap Paxos Leslie Lamport and Mike Massa Appeared in The International Conference on Dependable Systems and Networks (DSN 2004 ) Cheap Paxos Leslie Lamport and Mike Massa Microsoft Abstract Asynchronous algorithms for implementing a faulttolerant distributed system, which can make progress despite the failure of any F processors, require 2F + 1 processors. Cheap Paxos, a variant of the Paxos algorithm, guarantees liveness under the additional assumption that the set of nonfaulty processors does not jump around too fast, but uses only F + 1 main processors that actually execute the system and F auxiliary processors that are used only to handle the failure of a main processor. The auxiliary processors take part in reconfiguring the system to remove the failed processor, after which they can remain idle until another main processor fails. 1 Introduction The state-machine approach consists of describing a system as a state machine that takes as input a sequence of client commands and produces a sequence of states and outputs [4, 10]. The state machine is implemented by a collection of servers. It reduces the problem of implementing a distributed system to that of having the servers choose a sequence of commands. Making the system reliable requires that all processors agree on each command in the sequence, despite the failure of some components. For asynchronous systems, we require that consistency be maintained in the face of any number of non-malicious (non-byzantine) failures, and that progress be ensured when enough processors are nonfaulty and can communicate with one another in a timely manner [2]. The classic Paxos algorithm is an efficient, practical algorithm for achieving this [1, 5, 7]. Consider the problem of implementing a distributed system that can make progress if all but one processor is working. Previous algorithms, such as classic Paxos, require three processors. Only two of those processors need maintain the system state; but a third processor must participate in choosing the sequence of commands. The following argument shows that this third processor is necessary. Suppose the system is implemented by only two processors, p and q, and suppose that q fails. The requirement that the system continue to make progress despite a single failed processor means that p must continue operating the system. Now suppose that p fails and then q is repaired. Since there is only one failed processor, q must be able to resume operating the system. But this is clearly impossible, since q does not know the current state of the system because it does not know what p did after q failed. Some third processor is needed for example, a disk that can be accessed by both p and q. Suppose we are willing to weaken the liveness requirement, so that if q fails and then p fails before q is repaired, then the system may halt until p is repaired. Two processors are still not enough if we require that consistency be maintained despite communication failure. With only two processors p and q, one processor cannot distinguish failure of the other processor from failure of the communication medium. Since consistency is lost if each processor continues operating the system by itself, the system cannot allow each processor to continue just because it thinks that the other processor has failed. A third processor is needed. However, that third processor does not have to participate in choosing the sequence of commands. It must take action only in case p or q fails, after which it does nothing while either p or q continues to operate the system by itself. The third processor can therefore be a small/slow/cheap one, or a processor primarily devoted to other tasks. This argument suggests that there exists a method of implementing a one-fault tolerant system, satisfying the consistency property of classic Paxos and a weaker liveness property, using two main processors plus a third auxiliary processor. This paper describes Cheap Paxos, a generalization of such an algorithm that tolerates F faults with F + 1 main processors and F auxiliary processors. It maintains liveness under a sort of amoeba assumption [3], under which the subnetwork of working main processors does not move around too quickly. The assumption can be described as follows. A nonfaulty processor maintains a certain knowledge of the system s state. When a faulty processor is repaired, it can, in a finite length of time, re-acquire this knowledge from any other processor that possesses it. Liveness is maintained as long as there is at least one main processor with knowledge of the system state and F + 1 processors (main or auxiliary) that are nonfaulty and can communicate with one another in a timely manner. Consistency is always maintained (assuming non-malicious failures). There are two threads of previous work that superficially resemble Cheap Paxos. The first is the use of main processors that are replaced by spares if they fail [8]. Indeed, classic Paxos requires only F + 1 working processors to operate a system that tolerates F faults; the remaining F processors can be used as spares. However, unlike the auxiliary processors of Cheap Paxos, spares must have the necessary computing power to replace a failed main processor. The second thread is the use of dynamic quorum algorithms for maintaining multiple copies of a database. These algorithms can employ witness processors that need not maintain the data [9]. However, unlike the auxiliary processors of Cheap Paxos, these witnesses participate in each operation. Two moderately recent developments in computing may make Cheap Paxos useful. First, improvements to hardware and operating systems make computers less likely to crash. The weaker liveness guarantee of Cheap Paxos may therefore still provide sufficient reliability. Second, the widespread use of computers makes it more likely that an organization will have additional machines from which cycles can be stolen to implement the auxiliary processors. One might think that the low cost of computers would make Cheap Paxos uninteresting. However, we have observed that people are no more willing to use extra hardware to make a system simpler and more reliable than they were 40 years ago, even though that hardware has become orders of magnitude cheaper. The following section reviews Paxos, and Section 3 describes Cheap Paxos. The obligatory conclusion follows. 2 A Review of Paxos The Paxos algorithm for implementing a distributed state machine was introduced in [5]. We consider two versions of Paxos. In the basic version, to which we give the name Static Paxos, the set of servers is fixed. A variation that we call Dynamic Paxos, mentioned briefly in [5], uses state machine commands to change the set of servers. We begin by considering Static Paxos; Dynamic Paxos is explained in Section The Paxos Consensus Algorithm To implement a distributed system as a state machine, the processors of the system must choose a sequence of commands. This is done by executing a sequence of instances of a consensus algorithm, the i th instance choosing the i th command in the sequence. We now review the Paxos consensus algorithm. The goal of a consensus algorithm is for a collection of processes to agree upon a value. It is most convenient to phrase the consensus problem in terms of three classes of agents: proposers that propose values, acceptors that cooperate to choose a single proposed value, and learners that must learn what value has been chosen [6]. A single processor can act as more than one kind of agent. The safety properties that a consensus algorithm must satisfy are: Nontriviality Consistency Only a value that has been proposed may be chosen, Only a single value may be chosen. Conservatism Only a chosen value may be learned. There is also a liveness requirement that we do not try to state precisely; it is discussed informally below. The Paxos consensus algorithm has been discussed elsewhere [1, 5, 6, 7], so we do not explain here exactly how it works. Instead, we just describe its actions. Paxos assumes an underlying procedure for selecting a leader. Safety is guaranteed even if no leader or multiple leaders are selected, but a unique leader is required to ensure progress. Proposers send their proposals to the leader. The consensus algorithm assumes predefined sets of acceptors called quorums. The only requirement on the quorums is that any two quorums have at least one acceptor in common. Paxos also assumes a set of ballot numbers, which for simplicity we take to be the natural numbers. The ballot numbers are partitioned among potential leaders, each possible leader having its own disjoint set of ballot numbers. The consensus algorithm has two phases, each with two subphases. The algorithm s actions are described below. The algorithm sends messages between learners and acceptors, and from acceptors to learners. Since the same processor may be playing multiple roles, it can send messages to itself. Phase1a(l, b) Leader l chooses a number b from among its ballot numbers and sends 1a, b messages to the acceptors. Phase1b(a, b) When acceptor a receives a 1a, b message from a leader l, if it has not received any message with a ballot number greater than b, then 2 it replies to l with a 1b, b,... message, where the precise contents of the message do not concern us. If a has received a message with ballot number greater than b, it sends a reply to l indicating that it is ignoring the 1a, b message. (Upon receiving that message, l will perform a Phase1a(l, b ) action for b b, if it still believes itself to be the leader.) Phase2a(l, b) If leader l has received 1b, b,... messages from a quorum of acceptors, then it sends a 2a, b, v message to the acceptors where, depending on the contents of those 1b messages, either: The value of v is determined by the 1b messages, or l chooses v arbitrarily from among the proposals it has received. This action may not be performed twice for different values of v (with the same b). Phase2b(a, b, v) If acceptor a receives a 2a, b, v message and it has not already received any message with a ballot number greater than b, it sends a 2b, b, v message to every learner. Learn(r, v, b) If learner r has received 2b, b, v messages from a quorum of acceptors, then it learns that the value v has been chosen. In normal execution, the actions occur in the order listed above, starting with the leader s Phase1a action. However, processes may fail, messages may be lost or delivered out of order, and several processors could simultaneously think they are the leader, causing 1a and 2a messages for several different ballot numbers to be sent concurrently. Nevertheless, the algorithm maintains its three safety properties, nontriviality, consistency, and conservatism. (We are assuming non-byzantine failures in which a process can halt, but does not perform incorrect actions.) Moreover, if there is a single working processor l that believes itself to be the leader, has received a proposal, and can communicate with a quorum of acceptors, then some value will eventually be chosen. Any learner that can communicate with this quorum of acceptors will then learn the chosen value. We can allow failed processes to be restarted if they have stable storage that survives a failure. Processes must maintain the following amounts of information in stable storage: an acceptor must keep two ballot numbers and one proposed value, and a leader must keep one ballot number (the largest one for which it has performed a Phase2a action). As described here, the algorithm never terminates. A leader can at any time perform a Phase1a action for a new ballot number. In an application, there will be some point at which enough processes have learned the chosen value, after which processes can forget all about this instance of the algorithm, erasing any information about it from their stable storage. For later reference, we make the following observations. O1. We can save messages at the cost of an extra message delay by having a single distinguished learner that informs the other learners when it finds out that a value has been chosen. Acceptors then send 2b messages only to the distinguished learner. In most applications, the roles of leader and distinguished learner are performed by the same processor. O2. A leader can send its 1a and 2a messages just to a quorum of acceptors. As long as all acceptors in that quorum are working and can communicate with the leader and the learners, there is no need for acceptors not in the quorum to do anything. O3. Acceptors do not care what value is chosen. They simply respond to 1a and 2a messages, using their stable storage to ensure that, despite failures, only a single value can be chosen. However, if an acceptor does learn what value has been chosen, it can store the value in stable storage and erase any other information it has saved there. If the acceptor later receives a 1a or 2a message, instead of performing its Phase1b or Phase2b action, it can simply inform the leader of the chosen value. O4. Instead of sending the value v, the leader can send a hash of v to some acceptors in its 2a messages. (A hash is a function H from values to a smaller set such that there is a negligible chance that H (v) equals H (v ) for two different values v and v.) A learner will learn that v is chosen if it receives 2b messages for either v or its hash from a quorum of acceptors, and at least one of those messages contains v rather than its hash. However, a leader could receive 1b messages that tell it the hash of a value v that it must use in its Phase2a action without telling it the actual value of v. If that happens, the leader cannot execute its Phase2a action until it communicates with some process that knows v. 3 2.2 Implementing a State Machine In the state machine approach, a set of servers execute commands submitted by clients. For simplicity, we assume that each server keeps in stable storage the entire sequence of state machine commands that have been chosen so far. In many applications, a server would keep only a recent checkpoint of the state machine s state and the commands after that checkpoint. In the traditional Paxos algorithm, the clients are the proposers and each server acts as an acceptor, a learner, and a potential leader in each instance of the consensus algorithm. A quorum consists of a majority of the servers. The leader receives client commands, assigns each one a number, and tries to get the i th command to be chosen by the i th instance of the Paxos consensus algorithm. To understand how Static Paxos works, suppose the system has been operating for a while when the leader fails. A new server l is then selected to be leader. Since l is a learner, it should know most of the commands that have already been chosen. Suppose it knows commands 1 134, 138, and 139 that is, the commands chosen in instances 1 134, 138, and 139 of the consensus algorithm. (Such a gap in its knowledge is possible because multiple instances of the consensus algorithm can be executed concurrently.) Server l chooses a ballot number b that it believes to be greater than any ballot number used by previous leaders. (The election algorithm can be used to choose b as well as l.) It then simultaneously executes Phase1a(b, l) for instances and for all instances greater than 139 of the consensus algorithm, sending 1a messages to all the servers. (Some of those messages are to itself, since the leader is chosen from among the servers.) It can obviously send these infinitely many virtual messages in a single physical message. Each server then simultaneously executes Phase1b actions in response to those virtual 1a messages, sending infinitely many virtual 1b messages back to l. Since those 1b messages contain information only for instances for which actions have been performed, those virtual messages will contain only a finite amount of information that can usually be fit into a single real message. By observation O3 above, if a server knows that a command was already chosen by some instance, it responds with the chosen command rather than a 1b message for that instance. Suppose that, from these messages, l learned: The command that was chosen in instance 135 (sent by some server instead of an instance 135 1b message). Commands v 137 and v 140 that it must use as the value v in its Phase2a(l, b) actions for instances 137 and 140, respectively. For instance 136 and for all instances greater than 140, it can use any proposed command v in its Phase2a(l, b) action. Leader l then does the following: It performs Phase2a(l, b) actions for instances 137 and 140, using the commands v 137 and v 140 determined by the 1b messages it received. It performs the Phase2a(l, b) action for instance 136, using as the command v a special no-op state machine command that does nothing. In some manner that does not concern us, it ensures that all servers know commands 1 135, 138, and 139. If a majority of the servers are working, they will perform Phase2b actions for instances 136, 137, and 140, and all servers will learn the commands chosen for all instances of the consensus algorithm. However, even before that has happened, leader l can resume normal operation. It assigns the number 141 to the first client command it receives, and it executes Phase2a(l, b) for instance 141 using that command as the value v. It assigns number 142 to the next client command and executes Phase2a(l, b) for that instance and that command as value v. And so on. Since each server is a learner, it learns the sequence of chosen commands. In most applications, the leader will act as the distinguished learner (mentioned in observation O1 above) to which 2b messages are sent. Once a server has learned what command the i th command is, it can delete all other information about the i th instance of the consensus protocol from its storage. When a failed server is repaired, it must be brought up to date so it knows all the commands that have already been chosen. In principle, this is a straightforward matter of having the newly repaired server obtain the information from some working server. If a server maintains only recent commands and a checkpoint of the state, then the repaired server must update its saved checkpoint. If the state machine maintains a large state, this must be done in such a way that only the part of the state that has changed is sent to the repaired server. 2.3 Dynamic Paxos So far, we have described Static Paxos, in which the set of acceptors and the quorums are constant and fixed in advance. A system that must continue working despite the failure of any F processors then requires 2F + 1 servers. For example, with Static Paxos, it takes seven 4 servers to tolerate three failures. In many systems, the best way to achieve the desired degree of fault tolerance is to reconfigure the system to replace failed servers by spares. With reconfiguration, a system that uses three active servers and two spares can tolerate a total of three failures, if a failed server can be replaced by a spare before another failure occurs. Reconfiguration therefore allows fewer processors to tolerate the same total number of failures, though not the same number of simultaneous failures. (In most systems, simultaneous failures are much less likely than successive ones.) In Dynamic Paxos, the set of acceptors and the quorums are determined by the state machine itself. Reconfiguration is performed by state machine commands. To explain how this works, let state k be the state machine s state after executing command k. For k 0, define state k to be the initial state. For some fixed constant α, we let the acceptors and quorums used for instance i of the consensus algorithm be determined by state i α. Before performing any action for instance i, a leader waits until it knows state i α. In other words, a leader must wait until it knows all commands through command n
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks