Production Checklist

Infrastructure

  • Linux kernel 5.11+ (full io_uring support)
  • NVMe storage for Data Plane I/O
  • Sufficient locked memory (ulimit -l unlimited)
  • Memory limit set appropriately (NODEDB_MEMORY_LIMIT)
  • Data Plane cores configured (NODEDB_DATA_PLANE_CORES)

Security

  • TLS enabled for all protocols
  • Authentication configured (SCRAM, JWT, API keys, or mTLS)
  • RBAC roles defined and assigned
  • RLS policies for multi-tenant data
  • Audit logging level set (at least standard)
  • Default passwords changed

Replication

  • Replication factor >= 3 for production data
  • Cluster has odd number of nodes (3, 5, 7) for Raft quorum
  • Cross-region learner replicas if needed

Monitoring

  • Prometheus scraping /metrics
  • Grafana dashboards configured
  • Health check endpoint monitored (/health/ready)
  • Alerting on key metrics (WAL fsync latency, replication lag, memory pressure)

Backup

  • Regular tenant backups scheduled
  • Backup validation (DRY RUN) tested
  • Restore procedure documented and tested
  • WAL archiving configured for PITR

Operations

  • Rolling upgrade procedure documented
  • Shard rebalancing tested
  • Failure recovery tested (single node, minority failure)