Raft: 寻找可理解的共识算法(1)

s09g

发布于 2022-07-06 15:05:15

4370

发布于 2022-07-06 15:05:15

文章被收录于专栏：s09g的技术博客

In Search of an Understandable Consensus Algorithm (Extended Version)

Diego Ongaro and John Ousterhout

Stanford University

Abstract

Raft is a consensus algorithm for managing a replicated log. It produces a result equivalent to (multi-)Paxos, and it is as efficient as Paxos, but its structure is different from Paxos; this makes Raft more understandable than Paxos and also provides a better foundation for building practical systems. In order to enhance understandability, Raft separates the key elements of consensus, such as leader election, log replication, and safety, and it enforces a stronger degree of coherency to reduce the number of states that must be considered. Results from a user study demonstrate that Raft is easier for students to learn than Paxos. Raft also includes a new mechanism for changing the cluster membership, which uses overlapping majorities to guarantee safety.

摘要

Raft是一种用于管理复制日志的共识算法。它产生的结果等同于（多）Paxos，而且与Paxos一样高效，但其结构与Paxos不同；这使得Raft比Paxos更容易理解，也为构建实用系统提供了更好的基础。为了提高可理解性，Raft分离了共识的关键要素，如领导者选举、日志复制和安全，而且它执行了更强的一致性，以减少必须考虑的状态数量。一项用户研究的结果表明，Raft比Paxos更容易让学生学习。Raft还包括一个改变集群成员的新机制，它使用重叠多数来保证安全。

1 Introduction

Consensus algorithms allow a collection of machines to work as a coherent group that can survive the failures of some of its members. Because of this, they play a key role in building reliable large-scale software systems. Paxos [15, 16] has dominated the discussion of consensus algorithms over the last decade: most implementations of consensus are based on Paxos or influenced by it, and Paxos has become the primary vehicle used to teach students about consensus.

共识算法允许一组机器作为一个连贯的团体工作，可以承受一些成员的故障。正因为如此，它们在构建可靠的大规模软件系统中发挥了关键作用。Paxos[15, 16]在过去十年中主导了关于共识算法的讨论：大多数共识的实现都是基于Paxos或受其影响，而Paxos已经成为教学生了解共识的主要工具。

Unfortunately, Paxos is quite difficult to understand, in spite of numerous attempts to make it more approachable. Furthermore, its architecture requires complex changes to support practical systems. As a result, both system builders and students struggle with Paxos.

不幸的是，尽管有许多尝试使它更容易实践，Paxos还是相当难理解。此外，它的架构需要复杂的改动来支持实际的系统。因此，系统建设者和学生都在为Paxos而头疼。

After struggling with Paxos ourselves, we set out to find a new consensus algorithm that could provide a better foundation for system building and education. Our approach was unusual in that our primary goal was understandability: could we define a consensus algorithm for practical systems and describe it in a way that is significantly easier to learn than Paxos? Furthermore, we wanted the algorithm to facilitate the development of intuitions that are essential for system builders. It was important not just for the algorithm to work, but for it to be obvious why it works.

在与Paxos纠结许久之后，我们开始寻找一种新的共识算法，为系统建设和教育提供一个更好的基础。我们的方法是不寻常的，因为我们的主要目标是可理解性：我们能否为实用系统定义一个共识算法，并以一种明显比Paxos更容易学习的方式来描述它？此外，我们希望该算法能够促进直觉的发展，这对系统建设者来说是至关重要的。重要的是，不仅要让算法起作用，而且要让它的作用显而易见。

The result of this work is a consensus algorithm called Raft. In designing Raft we applied specific techniques to improve understandability, including decomposition (Raft separates leader election, log replication, and safety) and state space reduction (relative to Paxos, Raft reduces the degree of nondeterminism and the ways servers can be in consistent with each other). A user study with 43 students at two universities shows that Raft is significantly easier to understand than Paxos: after learning both algorithms, 33 of these students were able to answer questions about Raft better than questions about Paxos.

这项工作的结果是一种叫做Raft的共识算法。在设计Raft时，我们应用了特定的技术来提高可理解性，包括分解（Raft将领导者选举、日志复制和安全分开）和减少状态空间（相对于Paxos，Raft减少了非确定性的程度以及服务器之间的一致方式）。对两所大学的43名学生进行的用户研究表明，Raft明显比Paxos更容易理解：在学习了两种算法后，这些学生中有33人能够更好地回答关于Raft的问题，而不是关于Paxos的问题。

Raft is similar in many ways to existing consensus algorithms (most notably, Oki and Liskov’s Viewstamped Replication [29, 22]), but it has several novel features:

Raft在许多方面与现有的共识算法（最值得注意的是Oki和Liskov的Viewstamped Replication[29, 22]）相似，但它有几个新的特点。

• Strong leader: Raft uses a stronger form of leadership than other consensus algorithms. For example, log entries only flow from the leader to other servers. This simplifies the management of the replicated log and makes Raft easier to understand.

强领导力。与其他共识算法相比，Raft使用了一种更强的领导形式。例如，日志条目只从领导者流向其他服务器。这简化了复制日志的管理，使Raft更容易理解。

• Leader election: Raft uses randomized timers to elect leaders. This adds only a small amount of mechanism to the heartbeats already required for any consensus algorithm, while resolving conflicts simply and rapidly.

领袖选举。Raft使用随机的计时器来选举领导人。这在任何共识算法已经需要的心跳上只增加了少量的机制，同时简单而迅速地解决了冲突。

• Membership changes: Raft’s mechanism for changing the set of servers in the cluster uses a new joint consensus approach where the majorities of two different configurations overlap during transitions. This allows the cluster to continue operating normally during configuration changes.

成员变化。Raft改变集群中的服务器集的机制使用了一种新的联合共识方法，其中两个不同配置的多数在过渡期间重叠。这使得集群在配置变化期间能够继续正常运行。

We believe that Raft is superior to Paxos and other consensus algorithms, both for educational purposes and as a foundation for implementation. It is simpler and more understandable than other algorithms; it is described completely enough to meet the needs of a practical system; it has several open-source implementations and is used by several companies; its safety properties have been formally specified and proven; and its efficiency is comparable to other algorithms.

我们认为Raft比Paxos和其他共识算法更胜一筹，无论是出于教育目的还是作为实施的基础。它比其他算法更简单，更容易理解；它的描述足够完整，可以满足实际系统的需要；它有几个开源的实现，并被几个公司使用；它的安全属性已经被正式规定和证明；它的效率与其他算法相当。

The remainder of the paper introduces the replicated state machine problem (Section 2), discusses the strengths and weaknesses of Paxos (Section 3), describes our general approach to understandability (Section 4), presents the Raft consensus algorithm (Sections 5–8), evaluates Raft (Section 9), and discusses related work (Section 10).

本文的其余部分介绍了复制的状态机问题（第2节），讨论了Paxos的优点和缺点（第3节），描述了我们对可理解性的一般方法（第4节），介绍了Raft共识算法（第5-8节），评估了Raft（第9节），并讨论了相关工作（第10节）。

2 Replicated state machines

Figure 1: Replicated state machine architecture. The consensus algorithm manages a replicated log containing state machine commands from clients. The state machines process identical sequences of commands from the logs, so they produce the same outputs.

图1：复制的状态机架构。共识算法管理着一个包含来自客户的状态机命令的复制日志。状态机处理来自日志的相同的命令序列，因此它们产生相同的输出。

Consensus algorithms typically arise in the context of replicated state machines [37]. In this approach, state machines on a collection of servers compute identical copies of the same state and can continue operating even if some of the servers are down. Replicated state machines are used to solve a variety of fault tolerance problems in distributed systems. For example, large-scale systems that have a single cluster leader, such as GFS [8], HDFS [38], and RAMCloud [33], typically use a separate replicated state machine to manage leader election and store configuration information that must survive leader crashes. Examples of replicated state machines include Chubby [2] and ZooKeeper [11].

共识算法通常出现在复制状态机的背景下[37]。在这种方法中，服务器集合上的状态机计算相同状态的相同副本，即使一些服务器停机，也能继续运行。复制的状态机被用来解决分布式系统中的各种容错问题。例如，拥有单一集群领导者的大规模系统，如GFS[8]、HDFS[38]和RAMCloud[33]，通常使用单独的复制状态机来管理领导者的选举，并存储必须在领导者崩溃后生存的配置信息。复制状态机的例子包括Chubby [2] 和 ZooKeeper [11]。

Replicated state machines are typically implemented using a replicated log, as shown in Figure 1. Each server stores a log containing a series of commands, which its state machine executes in order. Each log contains the same commands in the same order, so each state machine processes the same sequence of commands. Since the state machines are deterministic, each computes the same state and the same sequence of outputs.

复制的状态机通常使用复制的日志来实现，如图1所示。每台服务器存储一个包含一系列命令的日志，其状态机按顺序执行这些命令。每个日志都包含相同顺序的命令，所以每个状态机处理相同的命令序列。由于状态机是确定性的，每个状态机都计算出相同的状态和相同的输出序列。

Keeping the replicated log consistent is the job of the consensus algorithm. The consensus module on a server receives commands from clients and adds them to its log. It communicates with the consensus modules on other servers to ensure that every log eventually contains the same requests in the same order, even if some servers fail. Once commands are properly replicated, each server’s state machine processes them in log order, and the outputs are returned to clients. As a result, the servers appear to form a single, highly reliable state machine.

保持复制的日志的一致性是共识算法的工作。服务器上的共识模块接收来自客户端的命令，并将它们添加到其日志中。它与其他服务器上的共识模块进行通信，以确保每条日志最终都包含相同顺序的请求，即使一些服务器失败了。一旦命令被正确复制，每个服务器的状态机就会按照日志顺序处理它们，并将输出结果返回给客户。因此，这些服务器似乎形成了一个单一的、高度可靠的状态机。

Consensus algorithms for practical systems typically have the following properties:

实用系统的共识算法通常具有以下特性：

• They ensure safety (never returning an incorrect result) under all non-Byzantine conditions, including network delays, partitions, and packet loss, duplication, and reordering.

在所有非拜占庭的条件下，包括网络延迟、分区以及数据包丢失、重复和重新排序，它们都能确保安全（永远不会返回错误的结果）。

• They are fully functional (available) as long as any majority of the servers are operational and can communicate with each other and with clients. Thus, a typical cluster of five servers can tolerate the failure of any two servers. Servers are assumed to fail by stopping; they may later recover from state on stable storage and rejoin the cluster.

只要任何大多数的服务器都在运行，并且能够相互之间以及与客户进行通信，它们就能完全发挥作用（可用）。因此，一个典型的由五个服务器组成的集群可以容忍任何两个服务器的故障。通过停止服务器假装失效；它们后来可以从稳定存储的状态中恢复并重新加入集群。

• They do not depend on timing to ensure the consistency of the logs: faulty clocks and extreme message delays can, at worst, cause availability problems.

它们不依赖时间来确保日志的一致性：故障时钟和极端的消息延迟在最坏的情况下会导致可用性问题。

• In the common case, a command can complete as soon as a majority of the cluster has responded to a single round of remote procedure calls; a minority of slow servers need not impact overall system performance.

在普通情况下，只要集群中的大多数个体对单轮RPC作出反应，就可以完成一个命令；少数慢的服务器不需要影响整个系统的性能。

3 What’s wrong with Paxos?

Over the last ten years, Leslie Lamport’s Paxos protocol [15] has become almost synonymous with consensus: it is the protocol most commonly taught in courses, and most implementations of consensus use it as a starting point. Paxos first defines a protocol capable of reaching agreement on a single decision, such as a single replicated log entry. We refer to this subset as single-decree Paxos. Paxos then combines multiple instances of this protocol to facilitate a series of decisions such as a log (multi-Paxos). Paxos ensures both safety and liveness, and it supports changes in cluster membership. Its correctness has been proven, and it is efficient in the normal case.

在过去的十年里，Leslie Lamport的Paxos协议[15]几乎成了共识的代名词：它是课程中最常教授的协议，大多数共识的实现都以它为起点。Paxos首先定义了一个能够在单一决策上达成协议的协议，例如单一复制的日志条目。我们把这个子集称为单决策的Paxos。然后，Paxos结合这个协议的多个实例，以促进一系列的决定，如日志（多Paxos）。Paxos既保证了安全性，又保证了有效性，而且它支持集群成员的变化。它的正确性已被证明，而且在正常情况下是有效的。

Unfortunately, Paxos has two significant drawbacks. The first drawback is that Paxos is exceptionally difficult to understand. The full explanation [15] is notoriously opaque; few people succeed in understanding it, and only with great effort. As a result, there have been several attempts to explain Paxos in simpler terms [16, 20, 21]. These explanations focus on the single-decree subset, yet they are still challenging. In an informal survey of attendees at NSDI 2012, we found few people who were comfortable with Paxos, even among seasoned researchers. We struggled with Paxos ourselves; we were not able to understand the complete protocol until after reading several simplified explanations and designing our own alter native protocol, a process that took almost a year.

不幸的是，Paxos有两个显著的缺点。第一个缺点是，Paxos特别难理解。完整的解释[15]是出了名的不透明；很少有人能成功地理解它，而且是在付出巨大努力之后。因此，已经有一些尝试用更简单的术语来解释Paxos [16, 20, 21]。这些解释集中在单决策子集上，然而它们仍然很难。在对NSDI 2012的与会者进行的非正式调查中，我们发现很少有人对Paxos感到满意，即使在经验丰富的研究人员中。我们自己也在为Paxos头疼；在阅读了几个简化的解释和设计了我们自己修改的朴素协议之后，我们才能够理解完整的协议，这个过程花了几乎一年时间。

We hypothesize that Paxos’ opaqueness derives from its choice of the single-decree subset as its foundation. Single-decree Paxos is dense and subtle: it is divided into two stages that do not have simple intuitive explanations and cannot be understood independently. Because of this, it is difficult to develop intuitions about why the single decree protocol works. The composition rules for multi Paxos add significant additional complexity and subtlety. We believe that the overall problem of reaching consensus on multiple decisions (i.e., a log instead of a single entry) can be decomposed in other ways that are more direct and obvious.

我们假设，Paxos的不透明性来自于它选择单决策子集作为基础。单决策Paxos是密集而微妙的：它分为两个阶段，没有简单的直观解释，不能独立理解。正因为如此，很难发展出关于单决策协议为何有效的直觉。多Paxos的组成规则大大增加了复杂性和微妙性。我们认为，就多个决定达成共识的整体问题（即一个日志而不是一个条目）可以用其他更直接和明显的方式进行分解。

The second problem with Paxos is that it does not provide a good foundation for building practical implementations. One reason is that there is no widely agreed-upon algorithm for multi-Paxos. Lamport’s descriptions are mostly about single-decree Paxos; he sketched possible approaches to multi-Paxos, but many details are missing. There have been several attempts to flesh out and optimize Paxos, such as [26], [39], and [13], but these differ from each other and from Lamport’s sketches. Systems such as Chubby [4] have implemented Paxos-like algorithms, but in most cases their details have not been published.

Paxos的第二个问题是，它没有为建立实际的实现提供一个良好的基础。原因之一是没有广泛认同的多Paxos的算法。Lamport的描述主要是关于单决策Paxos的；他勾画了多Paxos的可能方法，但许多细节都没有。已经有一些尝试来充实和优化Paxos，如[26]、[39]和[13]，但这些尝试相互之间以及与Lamport的草图不同。像Chubby[4]这样的系统已经实现了类似Paxos的算法，但在大多数情况下，它们的细节还没有被公布。

Furthermore, the Paxos architecture is a poor one for building practical systems; this is another consequence of the single-decree decomposition. For example, there is little benefit to choosing a collection of log entries independently and then melding them into a sequential log; this just adds complexity. It is simpler and more efficient to design a system around a log, where new entries are appended sequentially in a constrained order. Another problem is that Paxos uses a symmetric peer-to-peer approach at its core (though it eventually suggests a weak form of leadership as a performance optimization). This makes sense in a simplified world where only one decision will be made, but few practical systems use this approach. If a series of decisions must be made, it is simpler and faster to first elect a leader, then have the leader coordinate the decisions.

此外，Paxos架构对于构建实用系统来说是一个糟糕的架构；这是单决策分解的另一个后果。例如，独立地选择一个日志条目集合，然后将它们拼接成一个连续的日志，这没有什么好处，只会增加复杂性。围绕日志设计一个系统才更简单、有效，新的条目是以受限的顺序依次追加的。另一个问题是，Paxos的核心是使用对称的P2P方法（尽管它最终建议采用弱的领导形式作为性能优化）。这在一个只做一个决定的简化世界中是有意义的，但很少有实际的系统使用这种方法。如果必须做出一系列的决定，首先选举一个领导者，然后让领导者协调这些决定，这样做更简单、快捷。

As a result, practical systems bear little resemblance to Paxos. Each implementation begins with Paxos, covers the difficulties in implementing it, and then develops a significantly different architecture. This is time consuming and error-prone, and the difficulties of understanding Paxos exacerbate the problem. Paxos’ formulation may be a good one for proving theorems about its correctness, but real implementations are so different from Paxos that the proofs have little value. The following comment from the Chubby implementers is typical:

因此，实际的系统与Paxos几乎没有相似之处。每个实现都是从Paxos开始的，涵盖了实现它的困难，然后开发出一个明显不同的架构。这样做既费时又容易出错，而理解Paxos的困难又加剧了这个问题。Paxos的公式化对于证明其定理的正确性来说可能是一个很好的表述，但是真正的实现与Paxos差别很大，所以证明的价值不大。以下是来自Chubby实现者的典型评论：

There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system. . . . the final system will be based on an unproven protocol [4].

在Paxos算法的描述和真实世界系统的需求之间存在着巨大的差距...最终的系统基于一个未经证实的协议[4]。

Because of these problems, we concluded that Paxos does not provide a good foundation either for system building or for education. Given the importance of consensus in large-scale software systems, we decided to see if we could design an alternative consensus algorithm with better properties than Paxos. Raft is the result of that experiment.

由于这些问题，我们得出结论，Paxos既不能为系统建设也不能为教育提供一个良好的基础。考虑到共识在大规模软件系统中的重要性，我们决定看看我们是否能设计出一种比Paxos具有更好特性的替代性共识算法。Raft就是这个实验的结果。

4 Designing for understandability

We had several goals in designing Raft: it must provide a complete and practical foundation for system building, so that it significantly reduces the amount of design work required of developers; it must be safe under all conditions and available under typical operating conditions; and it must be efficient for common operations. But our most important goal—and most difficult challenge—was understandability. It must be possible for a large audience to understand the algorithm comfortably. In addition, it must be possible to develop intuitions about the algorithm, so that system builders can make the extensions that are inevitable in real-world implementations.

我们在设计Raft时有几个目标：它必须为系统建设提供一个完整而实用的基础，以便大大减少开发人员所需的设计工作量；它必须在所有条件下都是安全的，并且在典型的操作条件下可用；它必须对普通操作高效。但我们最重要的目标和最困难的挑战是可理解性。必须让广大受众能够舒适地理解该算法。此外，还必须有可能发展关于该算法的直觉，以便系统构建者能够进行扩展，而这在现实世界的实施中是不可避免的。

There were numerous points in the design of Raft where we had to choose among alternative approaches. In these situations we evaluated the alternatives based on understandability: how hard is it to explain each alternative (for example, how complex is its state space, and does it have subtle implications?), and how easy will it be for a reader to completely understand the approach and its implications?

在Raft的设计中，有许多地方我们必须在备选方法中做出选择。在这些情况下，我们根据可理解性来评估替代方案：解释每个替代方案有多难（例如，它的状态空间有多复杂，是否有微妙的影响），以及读者完全理解该方法及其影响有多容易？

We recognize that there is a high degree of subjectivity in such analysis; nonetheless, we used two techniques that are generally applicable. The first technique is the well-known approach of problem decomposition: wherever possible, we divided problems into separate pieces that could be solved, explained, and understood relatively independently. For example, in Raft we separated leader election, log replication, safety, and membership changes.

我们认识到，这种分析有很大程度的主观性；然而，我们使用了两种普遍适用的技术。第一种技术是众所周知的问题分解方法：在可能的情况下，我们把问题分成独立的部分，可以相对独立地解决、解释和理解。例如，在Raft中，我们将领袖选举、日志复制、安全和成员变化分开。

Our second approach was to simplify the state space by reducing the number of states to consider, making the system more coherent and eliminating nondeterminism where possible. Specifically, logs are not allowed to have holes, and Raft limits the ways in which logs can become inconsistent with each other. Although in most cases we tried to eliminate nondeterminism, there are some situations where nondeterminism actually improves understandability. In particular, randomized approaches introduce nondeterminism, but they tend to reduce the state space by handling all possible choices in a similar fashion (“choose any; it doesn’t matter”). We used randomization to simplify the Raft leader election algorithm.

我们的第二个方法是通过减少需要考虑的状态数量来简化状态空间，使系统更加连贯，并尽可能消除非确定性。具体来说，不允许日志有漏洞，而且Raft限制了日志相互之间不一致的方式。尽管在大多数情况下，我们试图消除非确定性，但在某些情况下，非确定性实际上提高了可理解性。特别是，随机化的方法引入了非确定性，但它们倾向于通过以类似的方式处理所有可能的选择来减少状态空间（"choose any; it doesn’t matter 选择任何状态；这并不重要"）。我们使用随机化来简化Raft领袖选举算法。

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-08-14，如有侵权请联系 cloudcommunity@tencent.com 删除

raft