Multi-Raft Consensus

NodeDB uses Multi-Raft — each vShard is its own independent Raft group with its own leader, log, and snapshot schedule. This avoids the bottleneck of a single Raft group for the entire cluster.

Per-vShard Raft

Each Raft group handles:

  • Leader election — automatic failover when the current leader becomes unreachable
  • Log replication — WAL entries replicated to followers before acknowledgement
  • Snapshots — periodic state snapshots to truncate the Raft log

Write Path (Replicated)

  1. Client sends write to the vShard leader
  2. Leader appends to local WAL
  3. Leader replicates to Raft followers
  4. Quorum acknowledges (majority of replicas)
  5. Leader commits and responds to client

Writes are linearizable within each Raft group.

Raft group kinds

NodeDB runs three kinds of Raft groups simultaneously:

KindPurposeCount
DataOne per vShard — replicates WAL entries for that shard's dataOne per vShard
MetaCluster membership, catalog, schemaOne per cluster
SequencerCross-shard transaction ordering (Calvin epoch log)One per cluster

Each kind has independent leader election. A sequencer leader failure does not affect data-group leaders, and vice versa.

Sequencer Raft group

The sequencer group exists solely to produce a globally-ordered log of cross-shard transaction batches (epochs). It has its own dedicated group ID outside the data-group range so it can never accidentally alias a vShard. See Cross-Shard Transactions for how the sequencer group interacts with the scheduler and executor.

Single-shard writes never touch the sequencer group — they go directly through the relevant data-group's Raft.

Advantages of Multi-Raft

  • Independent leaders — different vShards can have leaders on different nodes, distributing write load
  • Parallel commits — vShards commit independently, no global ordering bottleneck
  • Granular failover — a node failure only triggers leader election for the vShards it led, not the entire cluster
  • Failure isolation — sequencer leader election is independent of data and meta group elections
View page sourceLast updated on May 12, 2026 by Farhan Syah