Raft

Raft is a consensus algorithm, it is equivalent to Paxos but is easier to understand.

Specifically, Raft considers a group of servers (called a Raft cluster) which collectively maintains a replicated log. The Raft algorithm is to make sure that the replicated log is consistent even if some of the servers fail.

From the systems perspective, a client sends commands to the Raft cluster, “the consensus module on a server receives commands from clients and adds them to its log. It communicates with the consensus modules on other servers to ensure that every log eventually contains the same requests in the same order, even if some servers fail”, “as a result, the servers appear to form a single and highly reliable state machine”.

A server in the Raft cluster can be in one of three states, leader, follower or candidate. There is one and only one leader in the cluster. The leader handle all client requests. If a follower receives the request from the client, it forwards the request to the leader. Raft algorithm ensures that there is at most one leader at a given time.

The Raft divides time into arbitrary-length units called terms. Each term begins with an election. Some elections fail, in which case the term ends without choosing a leader. “Different servers may observe the transitions between terms at different times, and in some situations a server may not observe an election or even entire terms. Each server stores a current term number, which increases monotonically over ti. Each server stores a current term number, which increases monotonically over time”

A follower can start a election by incrementing its current term and transitions to the candidate state. It then votes for itself and sends out a RequestVote RPC. The follower may go out of the candidate state if any of the following three cases happends: (a) it wins the election (b) another server wins the election (c) timeout. Once a candiate wins the election, it sends heartbeat messages to all of the other servers to establish its authority. If a candiate receives AppendEntries PRC, it means another server claims that it wins the election, the candiate then verify the case using comparing the number of the term. If the term number it receives is larger than the term number it maintains, then it consider the message is from a legitimate elected-leader and changes its state back to follower.

The safety (allow at most one winner per term) property is guranteed by the fact that each server gives only one vote per term and the majority is required to win the election. The liveness (some candidate must eventually win) is done by selecting election timeouts randomly.

The leader services its clients request once it is elected. The leader appends the command to the log. A log is called committed if the majority of the nodes have replicated the log. The leader will retry the AppendEntries PRC untill all followers succeed. If the log is not committed by a server, the server cannot be the future leader. To do that, each candidates must include index and term of last log entry in RequestVote RPCs, the voting server denies vote if its log is more up-to-date. The logs are ranked by <lastTerm, lastIndex>