Monitoring
Prometheus Metrics
curl http://localhost:6480/metrics
70+ system metrics: per-engine, per-core, connection, query, replication, and storage. Latency histogram with 13 buckets.
PromQL Engine
Full Prometheus query engine at /obsv/api. Point Grafana at this URL as a Prometheus data source. Supports all Tier 1+2+3 functions (rate, irate, delta, histogram_quantile, holt_winters, etc.).
Prometheus Remote Write/Read
Use NodeDB as a long-term Prometheus storage backend:
remote_write:
- url: "http://nodedb:6480/obsv/api/v1/write"
remote_read:
- url: "http://nodedb:6480/obsv/api/v1/read"
OpenTelemetry
- OTLP ingest — Metrics, traces, and logs via HTTP (4318) and gRPC (4317)
- OTLP export — Push NodeDB's own traces/metrics to any OTLP collector
Feature-gated: --features otel, --features promql, --features monitoring.
Health Checks
curl http://localhost:6480/healthz # k8s readiness — 503 until startup completes
curl http://localhost:6480/health/live # liveness probe
curl http://localhost:6480/health/ready # WAL recovered, ready for queries
Key Metrics
| Metric Category | Examples |
| Engine | Per-engine memory, query count, latency |
| Core | Per-core CPU, queue depth, io_uring completions |
| Connection | Active connections, auth failures |
| Replication | Raft log lag, replication latency |
| Storage | WAL fsync latency, segment count, compaction debt |
| Database | Per-database resource usage and performance |
| Tenant | Per-tenant query rates and memory usage |
Per-Database & Per-Tenant Metrics
Database-scoped and tenant-scoped metrics enable precise monitoring of multi-tenant deployments.
Per-Database Metrics
All metrics labeled with database="<name>":
| Metric | Type | Description |
nodedb_database_qps | counter | Queries per second for the database |
nodedb_database_memory_bytes | gauge | Memory currently used by the database |
nodedb_database_storage_bytes | gauge | Storage footprint of all collections in the database |
nodedb_database_connections | gauge | Active connections to the database |
nodedb_database_bridge_queue_depth | gauge | Pending requests in SPSC bridge for the database |
nodedb_database_wal_commit_latency_p99 | histogram | 99th percentile WAL commit latency for writes to the database |
nodedb_database_maintenance_cpu_seconds | counter | Cumulative CPU time spent on background maintenance tasks |
nodedb_database_mirror_lag_ms | gauge | Replication lag for mirror databases (zero if not mirrored) |
Per-Tenant Metrics
All metrics labeled with database="<name>" and tenant="<name>":
| Metric | Type | Description |
nodedb_tenant_qps | counter | Queries per second for the tenant |
nodedb_tenant_memory_bytes | gauge | Memory currently used by the tenant |
nodedb_tenant_storage_bytes | gauge | Storage footprint of the tenant's data |
These metrics are emitted from the dispatch layer (qps), memory governor (memory), WAL (storage), connection listener (connections), and background scheduler (maintenance_cpu_seconds), ensuring consistent attribution across request processing.
Cross-shard transaction metrics
| Metric | Type | Description |
nodedb_sequencer_epochs_total | counter | Epochs proposed by the sequencer |
nodedb_sequencer_epoch_duration_ms | histogram | Time to drain and propose each epoch |
nodedb_sequencer_admitted_txns_total{outcome} | counter | Admission outcomes: admitted, rejected_conflict, rejected_inbox_full, rejected_txn_too_large, rejected_fanout_too_wide, rejected_tenant_quota, rejected_not_leader |
nodedb_sequencer_inbox_depth | gauge | Pending transactions in the sequencer inbox |
nodedb_calvin_scheduler_lock_wait_ms_total{vshard} | counter | Cumulative lock-wait time per shard |
nodedb_calvin_executor_txn_duration_ms{vshard} | histogram | Per-shard execution time for cross-shard txns |
nodedb_calvin_ollp_retries_total{predicate_class,outcome} | counter | OLLP retry outcomes (succeeded, retried, exhausted, circuit_open, tenant_budget_exceeded) |
nodedb_calvin_ollp_circuit_state{predicate_class} | gauge | 0 = closed, 1 = half-open, 2 = open |
nodedb_calvin_ollp_backoff_ms{predicate_class} | gauge | Current OLLP retry backoff delay |
nodedb_calvin_infra_abort_total{reason} | counter | Infrastructure aborts (disk error, OOM, corruption) |
Memory backpressure metrics
| Metric | Type | Description |
nodedb_backpressure_critical_total{engine} | counter | Write handlers that entered the Critical-pressure flush path |
nodedb_backpressure_emergency_total{engine} | counter | Write handlers rejected by Emergency-pressure |
IO priority metrics
| Metric | Type | Description |
nodedb_io_queue_depth{priority} | gauge | Pending tasks per IO priority tier (background, normal, high, critical) |
nodedb_io_wait_ns{priority} | histogram | Submission-to-completion latency per tier |
CDC metrics
| Metric | Type | Description |
nodedb_cdc_events_dropped_total{tenant,stream} | counter | Events dropped from a named stream's buffer due to overflow — per stream, not global |
Alert on this increasing for a stream whose consumer is active — it means the consumer is falling behind.
Corruption quarantine metrics
See Corruption Quarantine for the full quarantine runbook.
| Metric | Type | Description |
nodedb_segments_quarantined_total{engine,collection} | counter | Cumulative segments quarantined since startup |
nodedb_segments_quarantined_active{engine,collection} | gauge | Segments currently in quarantine |