Monitoring

Prometheus Metrics

curl http://localhost:6480/metrics

70+ system metrics: per-engine, per-core, connection, query, replication, and storage. Latency histogram with 13 buckets.

PromQL Engine

Full Prometheus query engine at /obsv/api. Point Grafana at this URL as a Prometheus data source. Supports all Tier 1+2+3 functions (rate, irate, delta, histogram_quantile, holt_winters, etc.).

Prometheus Remote Write/Read

Use NodeDB as a long-term Prometheus storage backend:

remote_write:
  - url: "http://nodedb:6480/obsv/api/v1/write"
remote_read:
  - url: "http://nodedb:6480/obsv/api/v1/read"

OpenTelemetry

OTLP ingest — Metrics, traces, and logs via HTTP (4318) and gRPC (4317)
OTLP export — Push NodeDB's own traces/metrics to any OTLP collector

Feature-gated: --features otel, --features promql, --features monitoring.

Health Checks

curl http://localhost:6480/healthz        # k8s readiness — 503 until startup completes
curl http://localhost:6480/health/live    # liveness probe
curl http://localhost:6480/health/ready   # WAL recovered, ready for queries

Graph Statistics

For per-collection edge counts and graph cardinality, use SHOW GRAPH STATS:

SHOW GRAPH STATS 'collection_name' VERBOSE;
SHOW GRAPH STATS 'collection_name' AS OF SYSTEM TIME <ms>;

See Graph Engine for complete SHOW GRAPH STATS documentation.

Key Metrics

Metric Category	Examples
Engine	Per-engine memory, query count, latency
Core	Per-core CPU, queue depth, io_uring completions
Connection	Active connections, auth failures
Replication	Raft log lag, replication latency
Storage	WAL fsync latency, segment count, compaction debt
Database	Per-database resource usage and performance
Tenant	Per-tenant query rates and memory usage

Per-Database & Per-Tenant Metrics

Database-scoped and tenant-scoped metrics enable precise monitoring of multi-tenant deployments.

Per-Database Metrics

All metrics labeled with database="<name>":

Metric	Type	Description
`nodedb_database_qps`	counter	Queries per second for the database
`nodedb_database_memory_bytes`	gauge	Memory currently used by the database
`nodedb_database_storage_bytes`	gauge	Storage footprint of all collections in the database
`nodedb_database_connections`	gauge	Active connections to the database
`nodedb_database_bridge_queue_depth`	gauge	Pending requests in SPSC bridge for the database
`nodedb_database_wal_commit_latency_p99`	histogram	99th percentile WAL commit latency for writes to the database
`nodedb_database_maintenance_cpu_seconds`	counter	Cumulative CPU time spent on background maintenance tasks
`nodedb_database_mirror_lag_ms`	gauge	Replication lag for mirror databases (zero if not mirrored)

Per-Tenant Metrics

All metrics labeled with database="<name>" and tenant="<name>":

Metric	Type	Description
`nodedb_tenant_qps`	counter	Queries per second for the tenant
`nodedb_tenant_memory_bytes`	gauge	Memory currently used by the tenant
`nodedb_tenant_storage_bytes`	gauge	Storage footprint of the tenant's data

These metrics are emitted from the dispatch layer (qps), memory governor (memory), WAL (storage), connection listener (connections), and background scheduler (maintenance_cpu_seconds), ensuring consistent attribution across request processing.

Cross-shard transaction metrics

Metric	Type	Description
`nodedb_sequencer_epochs_total`	counter	Epochs proposed by the sequencer
`nodedb_sequencer_epoch_duration_ms`	histogram	Time to drain and propose each epoch
`nodedb_sequencer_admitted_txns_total{outcome}`	counter	Admission outcomes: `admitted`, `rejected_conflict`, `rejected_inbox_full`, `rejected_txn_too_large`, `rejected_fanout_too_wide`, `rejected_tenant_quota`, `rejected_not_leader`
`nodedb_sequencer_inbox_depth`	gauge	Pending transactions in the sequencer inbox
`nodedb_calvin_scheduler_lock_wait_ms_total{vshard}`	counter	Cumulative lock-wait time per shard
`nodedb_calvin_executor_txn_duration_ms{vshard}`	histogram	Per-shard execution time for cross-shard txns
`nodedb_calvin_ollp_retries_total{predicate_class,outcome}`	counter	OLLP retry outcomes (`succeeded`, `retried`, `exhausted`, `circuit_open`, `tenant_budget_exceeded`)
`nodedb_calvin_ollp_circuit_state{predicate_class}`	gauge	0 = closed, 1 = half-open, 2 = open
`nodedb_calvin_ollp_backoff_ms{predicate_class}`	gauge	Current OLLP retry backoff delay
`nodedb_calvin_infra_abort_total{reason}`	counter	Infrastructure aborts (disk error, OOM, corruption)

Memory backpressure metrics

Metric	Type	Description
`nodedb_backpressure_critical_total{engine}`	counter	Write handlers that entered the Critical-pressure flush path
`nodedb_backpressure_emergency_total{engine}`	counter	Write handlers rejected by Emergency-pressure

IO priority metrics

Metric	Type	Description
`nodedb_io_queue_depth{priority}`	gauge	Pending tasks per IO priority tier (`background`, `normal`, `high`, `critical`)
`nodedb_io_wait_ns{priority}`	histogram	Submission-to-completion latency per tier

CDC metrics

Metric	Type	Description
`nodedb_cdc_events_dropped_total{tenant,stream}`	counter	Events dropped from a named stream's buffer due to overflow — per stream, not global

Alert on this increasing for a stream whose consumer is active — it means the consumer is falling behind.

Corruption quarantine metrics

See Corruption Quarantine for the full quarantine runbook.

Metric	Type	Description
`nodedb_segments_quarantined_total{engine,collection}`	counter	Cumulative segments quarantined since startup
`nodedb_segments_quarantined_active{engine,collection}`	gauge	Segments currently in quarantine