Monitoring

Prometheus Metrics

curl http://localhost:6480/metrics

70+ system metrics: per-engine, per-core, connection, query, replication, and storage. Latency histogram with 13 buckets.

PromQL Engine

Full Prometheus query engine at /obsv/api. Point Grafana at this URL as a Prometheus data source. Supports all Tier 1+2+3 functions (rate, irate, delta, histogram_quantile, holt_winters, etc.).

Prometheus Remote Write/Read

Use NodeDB as a long-term Prometheus storage backend:

remote_write:
  - url: "http://nodedb:6480/obsv/api/v1/write"
remote_read:
  - url: "http://nodedb:6480/obsv/api/v1/read"

OpenTelemetry

  • OTLP ingest — Metrics, traces, and logs via HTTP (4318) and gRPC (4317)
  • OTLP export — Push NodeDB's own traces/metrics to any OTLP collector

Feature-gated: --features otel, --features promql, --features monitoring.

Health Checks

curl http://localhost:6480/healthz        # k8s readiness — 503 until startup completes
curl http://localhost:6480/health/live    # liveness probe
curl http://localhost:6480/health/ready   # WAL recovered, ready for queries

Key Metrics

Metric CategoryExamples
EnginePer-engine memory, query count, latency
CorePer-core CPU, queue depth, io_uring completions
ConnectionActive connections, auth failures
ReplicationRaft log lag, replication latency
StorageWAL fsync latency, segment count, compaction debt
DatabasePer-database resource usage and performance
TenantPer-tenant query rates and memory usage

Per-Database & Per-Tenant Metrics

Database-scoped and tenant-scoped metrics enable precise monitoring of multi-tenant deployments.

Per-Database Metrics

All metrics labeled with database="<name>":

MetricTypeDescription
nodedb_database_qpscounterQueries per second for the database
nodedb_database_memory_bytesgaugeMemory currently used by the database
nodedb_database_storage_bytesgaugeStorage footprint of all collections in the database
nodedb_database_connectionsgaugeActive connections to the database
nodedb_database_bridge_queue_depthgaugePending requests in SPSC bridge for the database
nodedb_database_wal_commit_latency_p99histogram99th percentile WAL commit latency for writes to the database
nodedb_database_maintenance_cpu_secondscounterCumulative CPU time spent on background maintenance tasks
nodedb_database_mirror_lag_msgaugeReplication lag for mirror databases (zero if not mirrored)

Per-Tenant Metrics

All metrics labeled with database="<name>" and tenant="<name>":

MetricTypeDescription
nodedb_tenant_qpscounterQueries per second for the tenant
nodedb_tenant_memory_bytesgaugeMemory currently used by the tenant
nodedb_tenant_storage_bytesgaugeStorage footprint of the tenant's data

These metrics are emitted from the dispatch layer (qps), memory governor (memory), WAL (storage), connection listener (connections), and background scheduler (maintenance_cpu_seconds), ensuring consistent attribution across request processing.

Cross-shard transaction metrics

MetricTypeDescription
nodedb_sequencer_epochs_totalcounterEpochs proposed by the sequencer
nodedb_sequencer_epoch_duration_mshistogramTime to drain and propose each epoch
nodedb_sequencer_admitted_txns_total{outcome}counterAdmission outcomes: admitted, rejected_conflict, rejected_inbox_full, rejected_txn_too_large, rejected_fanout_too_wide, rejected_tenant_quota, rejected_not_leader
nodedb_sequencer_inbox_depthgaugePending transactions in the sequencer inbox
nodedb_calvin_scheduler_lock_wait_ms_total{vshard}counterCumulative lock-wait time per shard
nodedb_calvin_executor_txn_duration_ms{vshard}histogramPer-shard execution time for cross-shard txns
nodedb_calvin_ollp_retries_total{predicate_class,outcome}counterOLLP retry outcomes (succeeded, retried, exhausted, circuit_open, tenant_budget_exceeded)
nodedb_calvin_ollp_circuit_state{predicate_class}gauge0 = closed, 1 = half-open, 2 = open
nodedb_calvin_ollp_backoff_ms{predicate_class}gaugeCurrent OLLP retry backoff delay
nodedb_calvin_infra_abort_total{reason}counterInfrastructure aborts (disk error, OOM, corruption)

Memory backpressure metrics

MetricTypeDescription
nodedb_backpressure_critical_total{engine}counterWrite handlers that entered the Critical-pressure flush path
nodedb_backpressure_emergency_total{engine}counterWrite handlers rejected by Emergency-pressure

IO priority metrics

MetricTypeDescription
nodedb_io_queue_depth{priority}gaugePending tasks per IO priority tier (background, normal, high, critical)
nodedb_io_wait_ns{priority}histogramSubmission-to-completion latency per tier

CDC metrics

MetricTypeDescription
nodedb_cdc_events_dropped_total{tenant,stream}counterEvents dropped from a named stream's buffer due to overflow — per stream, not global

Alert on this increasing for a stream whose consumer is active — it means the consumer is falling behind.

Corruption quarantine metrics

See Corruption Quarantine for the full quarantine runbook.

MetricTypeDescription
nodedb_segments_quarantined_total{engine,collection}counterCumulative segments quarantined since startup
nodedb_segments_quarantined_active{engine,collection}gaugeSegments currently in quarantine