Cross-Shard Transactions
When a write touches rows that hash to more than one vShard, NodeDB routes it through the Calvin sequencer — a dedicated Raft-backed coordination layer that guarantees all participating shards commit the transaction in the same order, with no mid-flight aborts.
Single-shard writes bypass this entirely and take the normal per-vShard Raft fast path.
Why Calvin, not 2PC
Two-phase commit with compensating actions was the earlier design. It was replaced because compensation creates observable intermediate states (a row exists, then disappears), compensation can itself fail, and every read path that crosses an in-flight transaction has to know about pending compensations. The concern spreads outward indefinitely.
Calvin (Thomson et al., SIGMOD 2012) eliminates the problem by validating the full read/write set before any shard touches the transaction. Every shard executes against a globally-ordered input log. There is no concept of "shard A committed, shard B failed" — either all shards execute the transaction or none do.
Architecture
Client / Control Plane
│ declares read/write set
▼
SEQUENCER (dedicated Raft group, Control Plane)
│ batches transactions into epochs (default 20 ms)
│ replicates each epoch via Raft — globally ordered
▼
SCHEDULER (per vShard, Control Plane)
│ acquires locks in deterministic global order
│ single-threaded per shard
▼
EXECUTOR (Data Plane, existing engine handlers)
│ executes deterministically using the sequenced batch
│ no application-level aborts — all constraint checks happen upstream
The sequencer Raft group
The sequencer runs as its own independent Raft group (SEQUENCER_GROUP_ID), separate from the per-vShard data groups and the metadata group. This means:
- Failure isolation — sequencer leader election doesn't disrupt metadata reads, schema operations, or data writes
- Independent tuning — epoch duration (default 20 ms) is tuned separately from data-group commit latency
- Geographic placement — the sequencer triad can sit on the lowest-latency nodes without constraining where data groups live
Epochs
The sequencer leader batches incoming transactions into epoch windows (default 20 ms). At the end of each window it:
- Runs a pre-validation pass — detects intra-batch write-set conflicts; admits the first txn for a conflicting key, rejects others with
SequencerConflictso the client retries - Proposes the validated batch to the sequencer Raft group
- Once committed, fans the epoch out to each participating vShard's scheduler
Determinism
All replicas must produce byte-identical WAL output for the same epoch. The executor is forbidden from using wall-clock time, non-seeded randomness, or non-deterministic map iteration on the cross-shard write path. System-time columns (bitemporal sys_from, KV TTL expiry, graph HLC ordinals) are seeded from the epoch timestamp so all replicas stamp the same value.
Session variable
-- Default: route multi-vShard writes through the sequencer (atomic)
SET cross_shard_txn = 'strict';
-- Explicit opt-out: each shard commits independently (NOT atomic)
SET cross_shard_txn = 'best_effort_non_atomic';
SHOW cross_shard_txn;
best_effort_non_atomic is intended for bulk loads where atomicity is not required. It is deliberately named to discourage accidental use.
EXPLAIN output
EXPLAIN reports whether a query routes through the sequencer:
EXPLAIN INSERT INTO orders VALUES (...);
-- cross-shard: sequenced | vshards: [3, 7] | epoch: <assigned at admission> | position: <assigned at admission>
EXPLAIN INSERT INTO local_cache VALUES (...);
-- single-shard: vshard 3
OLLP — value-dependent predicates
For UPDATE ... WHERE balance > 10000 where the write set depends on a scan result, NodeDB uses Optimistic Lock Location Prediction:
- The planner runs the query optimistically once to capture the predicted write set
- Submits to the sequencer with the predicted set
- At execution time the executor re-scans and compares — if the set changed due to a concurrent commit, it returns
OllpRetryRequiredand retries transparently
To prevent retry storms, each predicate class (the parsed predicate AST, ignoring bound parameters) has:
- Adaptive backoff — starts at 10 ms, doubles up to 5 s
- Circuit breaker — opens after >50% retry rate over a 60 s window; half-opens after 30 s; closes after 4 consecutive successes
- Per-tenant budget — 1000 retries/min per tenant; excess returns
OllpTenantBudgetExceeded
Static predicates (WHERE id IN (...), WHERE id = ?) compute a deterministic write set at parse time and never enter the OLLP path.
Admission limits
The sequencer enforces caps to prevent a single transaction from monopolising the cluster:
| Cap | Default | Error on exceed |
max_plans_bytes_per_txn | 1 MiB | TxnTooLarge |
max_participating_vshards_per_txn | 64 | FanoutTooWide |
max_txns_per_epoch | 1024 | queued to next epoch |
max_bytes_per_epoch | 16 MiB | queued to next epoch |
| Per-tenant inbox quota | inbox / 8 | TenantQuotaExceeded |
Failure modes
| Failure | Behaviour |
| Sequencer leader dies | Raft re-elects; in-flight inbox submissions dropped; clients see Unavailable and retry |
| Sequencer follower dies | No client-visible impact; quorum remains |
| Scheduler shard crashes | Rebuilt from sequencer log on restart; shard unavailable until rebuild completes |
| Executor panic mid-apply | Locks held; shard restarts and replays the txn from WAL (determinism ensures identical result) |
| Network partition | Partitioned shard stops applying new epochs; serves reads with snapshot semantics; catches up on heal |
Metrics
| Metric | Type | Description |
nodedb_sequencer_epochs_total | counter | Epochs proposed |
nodedb_sequencer_epoch_duration_ms | histogram | Time to drain + propose each epoch |
nodedb_sequencer_admitted_txns_total{outcome} | counter | Per-outcome admission counts (admitted, rejected_conflict, rejected_inbox_full, rejected_txn_too_large, rejected_fanout_too_wide, rejected_tenant_quota, rejected_not_leader) |
nodedb_sequencer_inbox_depth | gauge | Pending txns in the inbox |
nodedb_calvin_scheduler_lock_wait_ms_total{vshard} | counter | Cumulative lock-wait time per shard |
nodedb_calvin_executor_txn_duration_ms{vshard} | histogram | Per-shard execution time |
nodedb_calvin_ollp_retries_total{predicate_class,outcome} | counter | OLLP retry outcomes |
nodedb_calvin_ollp_circuit_state{predicate_class} | gauge | 0=closed, 1=half-open, 2=open |
nodedb_calvin_ollp_backoff_ms{predicate_class} | gauge | Current OLLP backoff delay |
nodedb_calvin_infra_abort_total{reason} | counter | Infrastructure-level aborts (disk error, OOM, etc.) |