--- # What is NodeDB (docs/introduction/what-is-nodedb) --- title: What is NodeDB description: A distributed hybrid database with eight peer engines for multi-modal AI and agentic workloads. --- # What is NodeDB NodeDB is a single Rust binary that provides eight peer engines — Document (schemaless), Document (strict), Key-Value, Columnar, Timeseries, Spatial, Vector, and Array — sharing one storage core, plus Graph and Full-Text Search as cross-engine overlays on any collection. Seven of the eight are selected per-collection via `WITH (engine='')`; Array uses its own `CREATE ARRAY` DDL family. Each engine is built with purpose-specific data structures. All share the same storage, memory, and query planner. Cross-engine queries execute in one process with zero network hops. ## The Problem Modern applications don't fit in one database. A healthcare app needs patient records (relational), medical imaging embeddings (vector), care team relationships (graph), device telemetry (timeseries), and offline-first sync for field workers (CRDT). The industry answer is a polyglot stack: PostgreSQL for relational, Qdrant for vectors, Neo4j for graphs, ClickHouse for timeseries, Redis for caching. Each system has its own protocol, deployment, and failure modes. Cross-database queries require application-level joins. Some databases claim to solve this by bolting on capabilities — "graph" as recursive JOINs, "timeseries" without columnar compression or continuous aggregation. The features exist in name but not in performance. ## Eight Engines **Vector** — HNSW index with SQ8, PQ, and IVF-PQ quantization. SIMD-accelerated distance math. Adaptive bitmap pre-filtering. **Graph** — CSR adjacency index with 13 native algorithms (PageRank, WCC, Louvain, SSSP, etc.) and a Cypher-subset MATCH pattern engine. GraphRAG fusion with vector search. **Document** — Two modes per collection. Schemaless: MessagePack blobs with CRDT sync. Strict: Binary Tuples with O(1) field extraction and 3-4x cache density over BSON. **Columnar / Timeseries / Spatial** — Three peer engines sharing one compressed-column storage core (ALP, FastLanes, FSST, Gorilla, LZ4 codecs; block statistics; predicate pushdown). `columnar` for general analytics; `timeseries` adds append-only ingest, retention, and continuous aggregation; `spatial` adds R*-tree, geohash, and H3 indexing. **Key-Value** — Hash-indexed O(1) point lookups with typed value fields. Native TTL, secondary indexes, atomic INCR/CAS, and predicate-filtered scans. SQL-queryable and joinable. **Full-Text Search** — Block-Max WAND optimized BM25 with 16 Snowball stemmers, 27-language stop words, CJK bigram tokenization, posting compression, fuzzy matching, and native hybrid vector fusion. **Array** — ND sparse multi-dimensional engine with tile-based storage, Z-order indexing, per-tile MBR statistics, and bitemporal cells. Replaces TileDB / Zarr / SciDB / Rasdaman for genomics, single-cell biology, raster cubes, and climate models. **CRDT** — Loro-backed conflict-free replicated data types. AP on the edge, CP in the cloud. SQL constraint validation at sync time with compensation hints. ## Three Deployment Modes **Origin (server)** — Full distributed database. Multi-Raft consensus, Thread-per-Core Data Plane with io_uring, PostgreSQL-compatible SQL over pgwire. Horizontal scaling with automatic shard rebalancing. **Origin (local)** — Same binary, single-node. No cluster overhead. Like running PostgreSQL locally. **NodeDB-Lite (embedded)** — In-process library for phones, browsers (WASM), and desktops. All eight engines run locally with sub-millisecond reads. CRDT sync to Origin via WebSocket. ## PostgreSQL Compatible Connect with `psql`, any PostgreSQL driver, or ORM. Six wire protocols are available: - **pgwire** — PostgreSQL wire protocol (port 6432) - **HTTP** — REST, SSE, WebSocket (port 6480) - **NDB** — Native MessagePack protocol (port 6433) - **RESP** — Redis-compatible KV protocol (optional) - **ILP** — InfluxDB Line Protocol for timeseries ingest (optional) - **Sync** — WebSocket sync for NodeDB-Lite clients (port 9090) ## What NodeDB Replaces The combination of PostgreSQL + pgvector + Redis + Neo4j + ClickHouse + Elasticsearch — unified into one binary with shared storage and zero network hops between engines. --- # Quick Start (docs/introduction/quickstart) --- title: Quick Start description: Start NodeDB and run your first queries across multiple engines in under five minutes. --- # Quick Start This guide takes you from zero to running multi-engine queries in five minutes. ## 1. Start NodeDB ```bash # Linux: prebuilt binary (recommended — best performance) gh release download --repo NodeDB-Lab/nodedb --pattern 'nodedb-*-linux-x64.tar.gz' \ && tar -xzf nodedb-*-linux-x64.tar.gz \ && ./nodedb # macOS / Windows / WSL2: Docker docker compose up -d # Or from source (any Linux with Rust 1.94+) ./target/release/nodedb ``` See [Installation](/docs/introduction/installation) for the full setup, configuration, and systemd unit examples. ## 2. Connect ```bash psql -h localhost -p 6432 ``` ## 3. Documents (Schemaless) ```sql CREATE COLLECTION users; INSERT INTO users (id, name, email, age) VALUES ('u1', 'Alice', 'alice@example.com', 30); INSERT INTO users (id, name, email, role) VALUES ('u2', 'Bob', 'bob@example.com', 'admin'); -- Object literal syntax also works INSERT INTO users { name: 'Charlie', email: 'charlie@example.com', age: 25 }; SELECT name, email FROM users WHERE age > 25; ``` ## 4. Strict Documents (Schema-Enforced) ```sql CREATE COLLECTION orders ( id TEXT PRIMARY KEY, customer_id TEXT, total FLOAT, status TEXT, created_at TIMESTAMP ) WITH (engine='document_strict'); INSERT INTO orders (id, customer_id, total, status) VALUES ('o1', 'u1', 99.99, 'pending'); SELECT * FROM orders WHERE status = 'pending' ORDER BY created_at DESC; ``` ## 5. Vector Search ```sql CREATE COLLECTION articles; CREATE VECTOR INDEX idx_embed ON articles METRIC cosine DIM 384; INSERT INTO articles (id, title, embedding) VALUES ('a1', 'Intro to AI', ARRAY[0.1, 0.2, 0.3, ...]); SEARCH articles USING VECTOR(embedding, ARRAY[0.1, 0.3, ...], 10); ``` ## 6. Graph ```sql CREATE COLLECTION social; INSERT INTO social (id, name) VALUES ('alice', 'Alice'); INSERT INTO social (id, name) VALUES ('bob', 'Bob'); GRAPH INSERT EDGE IN 'social' FROM 'alice' TO 'bob' TYPE 'knows' PROPERTIES { since: 2020 }; GRAPH TRAVERSE FROM 'alice' DEPTH 2; GRAPH ALGO PAGERANK ON social DAMPING 0.85 ITERATIONS 20 TOLERANCE 1e-7; ``` ## 7. Key-Value ```sql CREATE COLLECTION sessions (key TEXT PRIMARY KEY) WITH (engine='kv'); INSERT INTO sessions (key, value) VALUES ('sess_abc', 'token-abc'); SELECT * FROM sessions WHERE key = 'sess_abc'; ``` ## 8. Columnar (Analytics) ```sql CREATE COLLECTION web_events ( ts TIMESTAMP, user_id UUID, page VARCHAR, duration_ms INT ) WITH (engine='columnar'); SELECT page, AVG(duration_ms), COUNT(*) FROM web_events WHERE ts > now() - INTERVAL '7 days' GROUP BY page ORDER BY COUNT(*) DESC; ``` ## 9. Timeseries ```sql CREATE COLLECTION cpu_metrics ( ts TIMESTAMP TIME_KEY, host VARCHAR, cpu FLOAT ) WITH (engine='timeseries', partition_by='1h'); SELECT time_bucket('5 minutes', ts) AS bucket, host, AVG(cpu) FROM cpu_metrics WHERE ts > now() - INTERVAL '1 hour' GROUP BY bucket, host; ``` ## 10. Triggers ```sql CREATE TRIGGER notify_on_order AFTER INSERT ON orders FOR EACH ROW $$ BEGIN INSERT INTO notifications (id, user_id, message) VALUES (NEW.id || '_notif', NEW.customer_id, 'Order received'); END; $$; ``` ## What's Next - [Architecture](/docs/architecture/overview) — how the three-plane execution model works - [Data Modeling](/docs/data-modeling/collections) — collections, schemas, and engine selection - [SQL Reference](/docs/sql/overview) — full query language reference --- # Installation (docs/introduction/installation) --- title: Installation description: Install NodeDB via prebuilt binary, Docker, or from source. Linux kernel 5.1+ required. --- # Installation NodeDB requires Linux kernel 5.1+ (for io_uring), regardless of how you install it. There are three ways to install NodeDB: 1. [Prebuilt binary](#prebuilt-binary-linux) — **recommended on Linux.** Direct kernel access to io_uring, no virtualization overhead, best raw performance. 2. [Docker](/docs/introduction/docker) — **recommended on macOS / Windows / WSL2**, or when you want a one-command setup with zero host configuration. 3. [Build from source](#build-from-source) — for development or custom features. All three share the same [configuration](#configuration) and connection paths described below. ## Prebuilt binary (Linux) Each tagged release ships a static `nodedb` tarball on GitHub for `linux-x64` and `linux-arm64`. macOS and Windows users should use [Docker](/docs/introduction/docker) until those targets ship. ```bash # Resolve the latest tag and your architecture TAG=$(curl -fsSL https://api.github.com/repos/NodeDB-Lab/nodedb/releases/latest \ | grep '"tag_name"' | cut -d'"' -f4) ARCH=$(uname -m | sed 's/aarch64/arm64/; s/x86_64/x64/') # Download and extract curl -L -o nodedb.tar.gz \ "https://github.com/NodeDB-Lab/nodedb/releases/download/${TAG}/nodedb-${TAG#v}-linux-${ARCH}.tar.gz" tar -xzf nodedb.tar.gz # Optional: install system-wide sudo mv nodedb /usr/local/bin/ # Run with all defaults (data goes to ~/.nodedb/data) nodedb ``` If you have the [GitHub CLI](https://cli.github.com/) installed, this is one command: ```bash gh release download --repo NodeDB-Lab/nodedb --pattern 'nodedb-*-linux-x64.tar.gz' \ && tar -xzf nodedb-*-linux-x64.tar.gz ``` Run with a config file or a custom data directory: ```bash # Point at an explicit data dir NODEDB_DATA_DIR=/var/lib/nodedb nodedb # Or load a config file (env vars still override TOML keys) nodedb --config /etc/nodedb/nodedb.toml ``` For a long-running server, drop a unit file at `/etc/systemd/system/nodedb.service`: ```ini [Unit] Description=NodeDB After=network.target [Service] Type=simple User=nodedb Group=nodedb ExecStart=/usr/local/bin/nodedb --config /etc/nodedb/nodedb.toml Restart=on-failure LimitNOFILE=1048576 [Install] WantedBy=multi-user.target ``` Then `sudo systemctl enable --now nodedb`. The user/group must be able to read the config file and write `data_dir`. For a specific version or to browse changelogs, see the release page: . The SQL surface is still pre-1.0 and changes between tags, so pin a version in production. ## Docker See [Docker](/docs/introduction/docker) for the full Compose and `docker run` setup. The right choice on macOS, Windows, or any host where you don't want to manage a binary directly. ## Build from Source Requires Rust 1.94+ and a Linux host. ```bash git clone https://github.com/NodeDB-Lab/nodedb.git cd nodedb # Release build (all crates) cargo build --release # Run tests (use nextest — the cluster integration tests rely on # the test groups defined in .config/nextest.toml and will hang # under plain `cargo test`) cargo install cargo-nextest --locked # one-time cargo nextest run --all-features ``` The build produces two binaries: - `target/release/nodedb` — the database server - `target/release/ndb` — the terminal client (TUI with syntax highlighting, tab completion, history search) Start the server: ```bash ./target/release/nodedb # Or with a config file ./target/release/nodedb --config nodedb.toml ``` Default startup output: ``` 2026-01-01T00:00:00Z INFO nodedb: Starting NodeDB 2026-01-01T00:00:00Z INFO nodedb: Listening on 127.0.0.1:6433 (native) 2026-01-01T00:00:00Z INFO nodedb: Listening on 127.0.0.1:6432 (PostgreSQL) 2026-01-01T00:00:00Z INFO nodedb: Listening on 127.0.0.1:6480 (HTTP) 2026-01-01T00:00:00Z INFO nodedb: Data directory: ~/.nodedb/data ``` ## Configuration This section applies to **every** install method — prebuilt binary, Docker, and source builds all read the same TOML schema and respond to the same environment variables. Pick whichever is convenient: - **TOML file** — pass `--config /path/to/nodedb.toml` on the command line. Best for production / systemd / pre-baked images. - **Environment variables** — prefix `NODEDB_*`. Best for Docker (`-e`), Compose (`environment:`), and Kubernetes. Env vars **override** values from the TOML file when both are set. All protocols share one bind address (`host`); only the port differs per protocol. ```toml # nodedb.toml [server] host = "127.0.0.1" data_dir = "/var/lib/nodedb" memory_limit = "4GiB" data_plane_cores = 4 max_connections = 1024 log_format = "text" [server.ports] native = 6433 pgwire = 6432 http = 6480 resp = 6381 # Optional: set to enable ilp = 8086 # Optional: set to enable ``` | Config field | Environment variable | Default | | ------------------ | ------------------------- | ---------------------------------------------------- | | `host` | `NODEDB_HOST` | `127.0.0.1` | | `ports.native` | `NODEDB_PORT_NATIVE` | `6433` | | `ports.pgwire` | `NODEDB_PORT_PGWIRE` | `6432` | | `ports.http` | `NODEDB_PORT_HTTP` | `6480` | | `ports.resp` | `NODEDB_PORT_RESP` | disabled | | `ports.ilp` | `NODEDB_PORT_ILP` | disabled | | `data_dir` | `NODEDB_DATA_DIR` | `~/.nodedb/data` (binary), `/var/lib/nodedb` (Docker) | | `memory_limit` | `NODEDB_MEMORY_LIMIT` | `1GiB` | | `data_plane_cores` | `NODEDB_DATA_PLANE_CORES` | CPUs - 1 | | `max_connections` | `NODEDB_MAX_CONNECTIONS` | `4096` | | `log_format` | `NODEDB_LOG_FORMAT` | `text` | ## Connect ```bash # With the ndb TUI client (source build) ./target/release/ndb # With psql psql -h localhost -p 6432 # Health check via HTTP curl http://localhost:6480/health ``` ## System Requirements - **OS**: Linux (kernel 5.1+ for io_uring) — required for the binary and source builds; Docker users can run on any host that supports a Linux container with io_uring. - **Rust**: 1.94+ (source builds only) - **Memory**: 512 MiB minimum, 4+ GiB recommended - **Disk**: NVMe recommended for Data Plane I/O --- # Docker (docs/introduction/docker) --- title: Docker description: Run NodeDB with a single docker run command or Docker Compose. Single binary, all ports exposed. --- # Docker NodeDB is a single binary. One `docker run` command gets you a running server. Requires Linux kernel 5.1+ (for io_uring). ## Quick Start ```bash docker run -d --name nodedb \ -p 6432:6432 \ -p 6433:6433 \ -p 6480:6480 \ -p 9090:9090 \ -v nodedb-data:/var/lib/nodedb \ farhansyah/nodedb ``` That's it. Connect with `psql -h localhost -p 6432` or `curl http://localhost:6480/healthz`. ## Ports | Port | Protocol | Required | | ------ | --------------------------------- | -------- | | `6432` | PostgreSQL wire protocol (pgwire) | Yes | | `6433` | Native MessagePack protocol (NDB) | Yes | | `6480` | HTTP API (REST, SSE, WebSocket) | Yes | | `9090` | WebSocket sync (NodeDB-Lite) | Yes | RESP (Redis) and ILP (InfluxDB) are disabled by default. Enable via environment variables: ```bash docker run -d --name nodedb \ -p 6432:6432 \ -p 6433:6433 \ -p 6480:6480 \ -p 9090:9090 \ -p 6381:6381 \ -p 8086:8086 \ -v nodedb-data:/var/lib/nodedb \ -e NODEDB_PORT_RESP=6381 \ -e NODEDB_PORT_ILP=8086 \ farhansyah/nodedb ``` ## Docker Compose ```yaml services: nodedb: image: farhansyah/nodedb ports: - "6432:6432" # pgwire - "6433:6433" # native - "6480:6480" # HTTP - "9090:9090" # sync volumes: - nodedb-data:/var/lib/nodedb environment: NODEDB_MEMORY_LIMIT: "4GiB" # NODEDB_PORT_RESP: "6381" # uncomment to enable Redis protocol # NODEDB_PORT_ILP: "8086" # uncomment to enable ILP ingest volumes: nodedb-data: ``` ```bash docker compose up -d ``` ## Stop ```bash # Stop (data preserved) docker compose down # Stop and wipe all data docker compose down -v # Or with docker run docker stop nodedb && docker rm nodedb # Data persists in the nodedb-data volume ``` ## Verify ```bash curl http://localhost:6480/health ``` ## Environment Variables | Variable | Default | Description | | ------------------------- | ---------- | ---------------------------- | | `NODEDB_MEMORY_LIMIT` | 75% of RAM | Total memory budget | | `NODEDB_DATA_PLANE_CORES` | CPUs - 1 | Number of Data Plane threads | | `NODEDB_LOG_FORMAT` | `text` | `text` or `json` | | `NODEDB_HOST` | `0.0.0.0` | Bind address | | `NODEDB_PORT_RESP` | disabled | Set to enable Redis protocol | | `NODEDB_PORT_ILP` | disabled | Set to enable ILP ingest | | `NODEDB_DATA_DIR` | `/var/lib/nodedb` | Data directory inside container | ## Custom Port Mapping Remap any port on the host side. The container always listens on the same internal ports: ```bash docker run -d --name nodedb \ -p 5432:6432 \ -p 8080:6480 \ -v nodedb-data:/var/lib/nodedb \ farhansyah/nodedb ``` --- # Architecture Overview (docs/architecture/overview) --- title: Architecture Overview description: The three-plane execution model — Control Plane, Data Plane, and Event Plane connected by lock-free ring buffers. --- # Architecture Overview NodeDB splits work across three planes connected by lock-free ring buffers. Each plane does exactly what it is best at and nothing else. Mixing planes is a correctness bug. ``` ┌───────────────────────────────────────────┐ │ Control Plane (Tokio) │ │ SQL parsing, query planning, connections │ │ Send + Sync, async │ └─────────────┬──────────────┬──────────────┘ │ SPSC Bridge │ Event subscriptions │ │ ┌─────────────▼──────────┐ ┌▼────────────────────────────────────┐ │ Data Plane (TPC) │ │ Event Plane (Tokio) │ │ Physical execution ├─►│ AFTER trigger dispatch │ │ Storage I/O, SIMD │ │ CDC change streams │ │ !Send, io_uring │ │ Cron scheduler │ │ Emits WriteEvents │ │ Durable pub/sub, webhook delivery │ └────────────────────────┘ └─────────────────────────────────────┘ ``` ## Plane Boundaries | Plane | Does | Does not do | | ------------- | -------------------------------------------------------------- | ----------------------------------------------- | | Control Plane | SQL parsing, query planning, connection handling | Event processing, trigger execution, storage I/O | | Data Plane | Physical I/O, SIMD math, WAL append, BEFORE triggers | Event delivery, AFTER triggers, cross-shard work | | Event Plane | AFTER triggers, CDC, cron, webhooks, durable pub/sub | Query planning, storage I/O, TPC tasks | If code needs to cross a plane boundary, it goes through the SPSC bridge (Control-Data) or the Event Bus (Data-Event). ## Query Entry Paths **SQL path** — All user-facing interfaces accept SQL. The Control Plane parses via sqlparser, plans via EngineRules, and dispatches a `SqlPlan` through the SPSC bridge to the Data Plane. ``` psql / ndb CLI / HTTP /v1/query → SQL parser (sqlparser-rs) → EngineRules::plan_*() → SqlPlan → SPSC Bridge → Data Plane ``` **Native opcode path** — The Rust SDK and FFI/WASM bindings dispatch typed opcode messages over the NDB protocol. The Control Plane converts them directly to a plan, skipping SQL parsing. ``` nodedb-client / FFI / WASM → Native opcode + typed fields → build_plan() → PhysicalPlan → SPSC Bridge → Data Plane ``` Both paths produce the same plan and execute identically on the Data Plane. ## Cross-Engine Identity Every row in every engine — document, KV, columnar, timeseries, spatial, vector, array, graph node, FTS posting — carries a stable global `u32` **surrogate** allocated at insert from a WAL-durable, Raft-replicated monotonic counter. Every engine keys its internal indexes on the surrogate, so cross-engine prefilter and join reduce to roaring-bitmap intersections with **zero per-query translation**. A query like "find product cells whose embedding is near `$q`, that have FTS hits for `'memory leak'`, that live within 5km of point P, in tenant 42" turns into: ``` vector_index.search($q, k) → roaring bitmap A fts_index.match("memory leak") → roaring bitmap B spatial_index.dwithin(P, 5km) → roaring bitmap C metadata_index.tenant_id = 42 → roaring bitmap D A ∩ B ∩ C ∩ D → final candidate surrogate set ``` No `HashMap` translations between hops. Adding a new engine does not require new translation paths — it just allocates surrogates from the same counter. This is the mechanism behind every cross-engine query example you'll see in the SQL reference. --- # Control Plane (docs/architecture/control-plane) --- title: Control Plane description: Tokio-based async runtime for SQL parsing, query planning, and connection handling. --- # Control Plane The Control Plane runs on the standard Tokio thread pool. All types are `Send + Sync`. It handles three responsibilities: ## Connection Handling Accepts connections on all wire protocols: - **pgwire** (port 6432) — PostgreSQL wire protocol - **HTTP** (port 6480) — REST, SSE, WebSocket - **NDB** (port 6433) — Native MessagePack protocol - **RESP** — Redis-compatible (optional) - **ILP** — InfluxDB Line Protocol (optional) - **Sync** (port 9090) — WebSocket for NodeDB-Lite clients Each connection is a Tokio task. Connection state (prepared statements, session variables, transaction context) lives on the connection task. ## Query Planning SQL text arrives from any protocol and flows through: 1. **sqlparser-rs** — parses SQL text into an AST 2. **nodedb-sql EngineRules** — resolves the target engine and produces a `SqlPlan` 3. **Plan conversion** — `SqlPlan` becomes a `PhysicalPlan` for the Data Plane `EngineRules` is the single source of truth for what each engine supports. One implementation per engine type (vector, graph, document, columnar, kv, fts, crdt). The compiler enforces exhaustive handling — adding a new engine or operation requires implementing every trait method. ## Dispatch The Control Plane dispatches `PhysicalPlan` messages to Data Plane cores via the SPSC bridge. Routing is by vShard — each request targets a specific shard, which maps to a specific core. The Control Plane never touches storage, never runs SIMD math, and never processes events. Those responsibilities belong to the Data Plane and Event Plane respectively. --- # Data Plane (docs/architecture/data-plane) --- title: Data Plane description: Thread-per-Core execution with io_uring for physical query execution and storage I/O. --- # Data Plane The Data Plane uses a Thread-per-Core (TPC) architecture. Each CPU core runs an isolated, shared-nothing shard. Types are `!Send` by design — no data crosses core boundaries. ## Execution Model Each core owns: - A dedicated event loop (no Tokio — raw TPC) - io_uring submission and completion queues for NVMe I/O - A jemalloc arena (no allocator lock contention) - Lock-free telemetry ring buffers for metrics There are no locks, no atomics, and no cross-core sharing. The Data Plane achieves predictable latency by eliminating all sources of contention. ## What the Data Plane Does - Executes `PhysicalPlan` nodes dispatched from the Control Plane - Reads from NVMe via io_uring - Runs SIMD-accelerated vector distance math - Appends to the WAL (O_DIRECT) - Evaluates BEFORE triggers (synchronous, same transaction) - Emits `WriteEvent` records to the Event Plane via per-core ring buffers ## What the Data Plane Does Not Do - Spawn Tokio tasks - Handle HTTP or pgwire connections - Process AFTER triggers or CDC events - Coordinate across shards ## WriteEvent Emission After each successful WAL commit, the Data Plane emits a `WriteEvent` containing: - `sequence` — monotonic per-core counter - `collection` — target collection name - `op` — Insert, Update, or Delete - `row_id`, `lsn`, `tenant_id`, `vshard_id` - `source` — User, Trigger, RaftFollower, or CrdtSync - `new_value`, `old_value` — for trigger and CDC consumption Events are fire-and-forget — the Data Plane never blocks waiting for the Event Plane. If the ring buffer overflows, the Event Plane replays from the WAL. ## Page Fault Hazard A major page fault on an mmap region blocks the faulting TPC thread, stalling the entire shard's reactor. The Data Plane pre-fetches pages asynchronously via `io_uring IORING_OP_READ` or `madvise(MADV_WILLNEED)` before compute touches them. --- # Event Plane (docs/architecture/event-plane) --- title: Event Plane description: Async event processing for triggers, CDC, cron scheduling, and durable pub/sub. --- # Event Plane The Event Plane is the third architectural layer. It runs on Tokio (`Send + Sync`) and handles all asynchronous, event-driven work. It never does storage I/O and never spawns TPC tasks. ## Responsibilities **AFTER trigger dispatch** — Trigger bodies execute asynchronously after the write commits. Retry with exponential backoff. Persistent failures go to a dead-letter queue. `EventSource` tagging prevents trigger cascades — the Event Plane skips non-User events. **CDC / change streams** — Routes `WriteEvent` records to matching change stream consumers. Per-partition offsets, consumer groups with rebalancing, and exactly-once transactional consumption. **Cron scheduler** — 1-second evaluation loop. Scheduled SQL is dispatched back through the Control Plane → Data Plane path. Per-collection affinity runs jobs on the collection's shard leader. **Durable pub/sub** — Named topics with consumer groups, offset tracking, and configurable retention. Consumers resume from last committed offset after disconnect. **Webhook delivery** — HTTP POST with exponential backoff retry and idempotency headers. ## Event Bus (Data → Event) The Event Bus consists of one bounded ring buffer per Data Plane core — no cross-core contention. - Data Plane emits `WriteEvent` records after WAL commit - Fire-and-forget: Data Plane never blocks on Event Plane - WAL-backed: if a ring buffer overflows, the Event Plane replays from the WAL LSN watermark ## Backpressure All Event Bus queues are bounded: - **> 85% utilization** — throttled, Data Plane continues but Event Plane signals pressure - **> 95% utilization** — suspended, Event Plane enters WAL Catchup Mode (replays from WAL instead of ring buffer) All backpressure transitions emit metrics and trace events. ## Side Effects When a trigger body or scheduled job produces a write, the Event Plane dispatches it back through the normal Control Plane → Data Plane path. The Event Plane handles routing and delivery, not compute. --- # SPSC Bridge (docs/architecture/spsc-bridge) --- title: SPSC Bridge description: Bounded lock-free ring buffers connecting the Control Plane and Data Plane. --- # SPSC Bridge The SPSC (Single-Producer, Single-Consumer) bridge is the only communication path between the Control Plane and Data Plane. No `Arc>`, no shared state — only bounded, lock-free ring buffers. ## Request Envelope Control Plane → Data Plane: | Field | Type | Purpose | | ------------------ | ---------------- | ---------------------------------- | | `request_id` | u64 | Correlates request with response | | `tenant_id` | u32 | Tenant scoping | | `vshard_id` | u16 | Routes to the correct core | | `plan` | PhysicalPlan | The physical execution plan | | `deadline_ns` | u64 | Absolute deadline (monotonic clock)| | `priority` | u8 | Scheduling priority | | `trace_id` | u64 | Distributed trace propagation | | `consistency` | ReadConsistency | Read consistency level | | `idempotency_key` | Option\ | Dedup key for non-idempotent writes| | `event_source` | EventSource | User, Trigger, Raft, or CrdtSync | | `user_roles` | Vec\ | Caller's RBAC roles | ## Response Envelope Data Plane → Control Plane: | Field | Type | Purpose | | ---------------- | -------- | -------------------------------- | | `request_id` | u64 | Matches the originating request | | `status` | u8 | Success, error, partial | | `attempt` | u32 | Retry attempt counter | | `partial` | bool | Whether more results follow | | `payload` | Payload | Response data | | `watermark_lsn` | u64 | Latest committed LSN | | `error_code` | u16 | Typed error (if status != ok) | ## Memory Ownership - Cross-plane payloads use `Arc<[u8]>` or slab IDs with explicit reclaim ACK - Producer owns buffer until ACK/NACK received - Zero-copy buffers remain valid until consumer ACK ## Ordering - Strict FIFO per `(connection_id, vshard_id)` stream - Best-effort ordering across streams - Cancellation is cooperative: Control Plane emits `CANCEL(request_id)`, Data Plane stops at the next safe point ## Backpressure | Queue utilization | Action | | ----------------- | ------------------------------ | | < 85% | Normal operation | | 85% - 95% | Reduce read depth | | > 95% | Suspend new reads | All transitions are observable via metrics and trace spans. --- # Storage Tiers (docs/architecture/storage-tiers) --- title: Storage Tiers description: Three-tier storage hierarchy — RAM, NVMe, and S3 — with temperature-aware data placement. --- # Storage Tiers NodeDB uses tiered storage to match data temperature to the right medium. ## Tiers | Tier | Medium | Contents | I/O Method | | --------- | ------ | ----------------------------------------------- | --------------------- | | L0 (hot) | RAM | Memtables, active CRDT states, incoming metrics | None (in-memory) | | L1 (warm) | NVMe | HNSW graphs, metadata indexes, segment files | mmap + madvise | | L2 (cold) | S3 | Historical logs, compressed vector layers | Parquet + HTTP range | | WAL | NVMe | Write-ahead log | O_DIRECT via io_uring | ## Critical Rules **WAL uses O_DIRECT.** Bypasses the kernel page cache entirely for deterministic write latency. Group commit batches multiple writes per io_uring submission for NVMe IOPS efficiency. **L1 indexes use mmap.** Zero-copy deserialization. SIMD reads directly from mapped pages. `madvise(MADV_WILLNEED)` pre-fetches before compute touches the data. **WAL and L1 never share page cache.** O_DIRECT (WAL) and mmap (L1) use fundamentally different I/O paths. Mixing them would cause cache coherency issues. ## Per-Core Memory Each Data Plane core is pinned to a dedicated jemalloc arena via `nodedb-mem`. This eliminates allocator lock contention in the TPC architecture. Memory budgets are enforced per engine — no single engine can starve others. ## Compaction L1 segment files undergo three-phase crash-safe compaction: 1. Write new merged segments to temporary files 2. Atomically swap file references in the catalog 3. Delete old segments after all readers have released them Compaction preserves monotonic LSN ordering. Delete bitmaps (Roaring) track removed rows without rewriting segments. ## Cold Storage L2 uses Parquet format with predicate pushdown. A packed single-file format enables HTTP range requests for minimal egress from S3/GCS/Azure. --- # Write-Ahead Log (docs/architecture/wal) --- title: Write-Ahead Log description: O_DIRECT WAL with group commit, CRC32C checksums, and AES-256-GCM encryption. --- # Write-Ahead Log The WAL ensures durability. Every write is persisted to the WAL before being acknowledged. On crash recovery, the WAL replays to reconstruct any state not yet flushed to segments. ## Record Format ``` ┌──────────┬─────────────────┬────────────┬─────┬───────────┬───────────┬─────────────┬─────────┐ │ magic │ format_version │ record_type│ lsn │ tenant_id │ vshard_id │ payload_len │ crc32c │ │ 4 bytes │ 2 bytes │ 2 bytes │ 8B │ 4 bytes │ 2 bytes │ 4 bytes │ 4 bytes │ └──────────┴─────────────────┴────────────┴─────┴───────────┴───────────┴─────────────┴─────────┘ ``` ## Properties **O_DIRECT** — Bypasses the kernel page cache. Writes go directly to NVMe via io_uring. This provides deterministic write latency — no interference from page cache eviction or writeback. **Page size** — 4 KiB or 16 KiB, alignment-compatible with O_DIRECT requirements. **CRC32C** — Every page has a checksum for silent bit-rot detection. **Group commit** — Multiple writes batch into a single io_uring submission for NVMe IOPS efficiency. A double-write buffer ensures atomicity. **Segmented** — The WAL rolls over to a new segment file automatically. Old segments are eligible for cleanup once all records have been flushed to L1 segments. **Tombstone GC** — Each checkpoint garbage-collects WAL rows for collections that have been hard-deleted (tombstoned), so tombstone records do not accumulate across restarts. On replay, the startup path merges persisted WAL tombstones with tombstones extracted from the WAL itself — a crash mid-purge cannot resurrect a dropped collection. **Encryption** — Optional AES-256-GCM encryption at the page level. Key management is external. ## Crash Recovery On startup, NodeDB: 1. Scans WAL segments from the last known checkpoint 2. Validates CRC32C checksums on each page 3. Replays valid records to reconstruct in-memory state 4. Discards any partially written records (torn writes) The Event Plane uses WAL LSN watermarks to resume event processing from the correct position after a crash. ## Write Path A write is acknowledged only after: 1. WAL append is persisted (O_DIRECT + fsync) 2. Raft quorum commit (for replicated namespaces) Single-node mode: linearizable writes for shard leader. Replicated mode: linearizable writes within each Raft group. --- # Consistency Model (docs/architecture/consistency) --- title: Consistency Model description: Read consistency levels, snapshot isolation, and cross-engine query consistency. --- # Consistency Model ## Write Consistency A write is acknowledged only after: - **WAL append** is persisted (O_DIRECT + fsync) - **Raft quorum commit** (for replicated namespaces) Single-node: linearizable writes for the shard leader. Replicated: linearizable writes within each Raft group. ## Read Consistency Levels ```sql -- Strong: leader-read after commit index >= required LSN (default for metadata/constraints) SET read_consistency = 'strong'; -- Bounded staleness: follower read allowed if lag <= duration SET read_consistency = 'bounded_staleness_5s'; -- Eventual: local read for CRDT edge sync (monotonic convergence) SET read_consistency = 'eventual'; ``` | Level | Behavior | Use case | | ------------------- | -------------------------------------------- | --------------------------------- | | Strong | Leader read, waits for commit index | Metadata, constraints, OLTP | | Bounded Staleness | Follower read if lag within configured bound | Read replicas, analytics | | Eventual | Local read, monotonic convergence | CRDT edge sync, NodeDB-Lite | ## Snapshot Isolation Cross-engine queries execute against a shared snapshot watermark. A query that combines vector search, graph traversal, and document field access sees a consistent point-in-time view across all engines. Default isolation: Snapshot Isolation (SI) per query. Write skew detection is enforced for UNIQUE and FK constraints at commit. ## Cross-Model Queries RRF fusion and cross-engine joins include source watermarks in response metadata. This lets clients verify the freshness of results that span multiple engines. --- # Cluster Topology (docs/architecture/cluster-topology) --- title: Cluster Topology description: Multi-node deployment with vShards, Raft groups, and QUIC transport. --- # Cluster Topology NodeDB scales horizontally by distributing data across multiple nodes using virtual shards (vShards). ## Nodes and Roles Every node in a NodeDB cluster runs the same binary. Roles are determined by Raft group membership: - **Leader** — accepts writes and reads for its Raft groups - **Follower** — replicates from leader, serves stale reads (bounded staleness) - **Learner** — non-voting replica, used for read scaling or cross-region copies ## vShards Data is partitioned into virtual shards (vShards). Each vShard is assigned to a Raft group. A single node may host multiple vShards. Benefits of virtual sharding: - **Rebalancing** — move a vShard to a different node without reshuffling the entire dataset - **Granular replication** — each vShard can have a different replication factor - **Independent failure domains** — a vShard failure affects only the data in that shard ## Transport Cluster communication uses QUIC (via nexar + quinn). QUIC provides: - Multiplexed streams over a single connection - 0-RTT connection resumption - Built-in TLS 1.3 - Connection migration (IP address changes) ## Request Routing The Control Plane routes each request to the correct Data Plane core based on vShard assignment. If the local node does not own the target vShard, the request is forwarded to the correct node over QUIC. --- # Multi-Raft Consensus (docs/architecture/multi-raft) --- title: Multi-Raft Consensus description: Independent Raft groups per vShard for leader election, log replication, and snapshots. --- # Multi-Raft Consensus NodeDB uses Multi-Raft — each vShard is its own independent Raft group with its own leader, log, and snapshot schedule. This avoids the bottleneck of a single Raft group for the entire cluster. ## Per-vShard Raft Each Raft group handles: - **Leader election** — automatic failover when the current leader becomes unreachable - **Log replication** — WAL entries replicated to followers before acknowledgement - **Snapshots** — periodic state snapshots to truncate the Raft log ## Write Path (Replicated) 1. Client sends write to the vShard leader 2. Leader appends to local WAL 3. Leader replicates to Raft followers 4. Quorum acknowledges (majority of replicas) 5. Leader commits and responds to client Writes are linearizable within each Raft group. ## Advantages of Multi-Raft - **Independent leaders** — different vShards can have leaders on different nodes, distributing write load - **Parallel commits** — vShards commit independently, no global ordering bottleneck - **Granular failover** — a node failure only triggers leader election for the vShards it led, not the entire cluster --- # vShards (docs/architecture/vshards) --- title: vShards description: Virtual sharding for data distribution, rebalancing, and per-shard replication. --- # vShards Virtual shards (vShards) are NodeDB's unit of data distribution. Each collection's data is partitioned across vShards, and each vShard maps to a Raft group and a Data Plane core. ## How vShards Work - Data is hashed to a vShard based on the document ID (or a configured shard key) - Each vShard is assigned to a node and a specific Data Plane core on that node - The assignment is tracked in the cluster catalog and updated during rebalancing ## Rebalancing When nodes join or leave the cluster, vShards are redistributed: 1. The coordinator selects vShards to move based on load metrics 2. Target node begins replicating the vShard's data 3. Once caught up, ownership transfers atomically in the catalog 4. Old node drops the vShard data Rebalancing is transparent to clients — requests are forwarded to the new owner during migration. There is no downtime. ## Shard Count The number of vShards is set at cluster creation and does not change. Choose a count that is several times the expected maximum number of cores across all nodes — this gives the rebalancer enough granularity to distribute evenly. --- # Replication (docs/architecture/replication) --- title: Replication description: Raft-based replication with configurable replication factor per vShard. --- # Replication NodeDB replicates data via Raft log replication. Each vShard's Raft group independently replicates its WAL entries to followers. ## Replication Factor Configurable per collection or per vShard: - **RF=1** — no replication, single-node durability only - **RF=3** — default for production, tolerates one node failure - **RF=5** — high durability, tolerates two node failures ## How Replication Works 1. Leader receives a write and appends to its WAL 2. WAL entry is sent to all followers in the Raft group 3. Each follower appends to its own WAL and acknowledges 4. Leader waits for a quorum (majority) of acknowledgements 5. Leader commits the entry and responds to the client ## Follower Reads Followers can serve reads with bounded staleness: ```sql SET read_consistency = 'bounded_staleness_5s'; SELECT * FROM orders WHERE status = 'pending'; ``` The follower checks that its replication lag is within the configured bound before serving the read. If the lag exceeds the bound, the read is forwarded to the leader. ## Cross-Region Replication Learner replicas (non-voting Raft members) can be placed in remote regions for read scaling. They replicate asynchronously and do not participate in quorum decisions, so they do not add write latency. --- # Collections (docs/data-modeling/collections) --- title: Collections description: Collections are NodeDB's top-level data containers. Each collection has a storage engine chosen at creation time. --- # Collections A collection is NodeDB's top-level data container — analogous to a table in PostgreSQL or a collection in MongoDB. Each collection has a storage engine chosen at creation time. ## Creating Collections ```sql -- Schemaless document (default) CREATE COLLECTION users; -- Strict document (schema-enforced) CREATE COLLECTION orders ( id UUID DEFAULT gen_uuid_v7(), customer_id UUID NOT NULL, total DECIMAL NOT NULL, status STRING DEFAULT 'pending' ) WITH (engine='document_strict'); -- Columnar (analytics) CREATE COLLECTION events ( ts TIMESTAMP TIME_KEY, user_id UUID, event VARCHAR, duration_ms INT ) WITH (engine='columnar'); -- Key-Value CREATE COLLECTION sessions (key TEXT PRIMARY KEY) WITH (engine='kv'); ``` ## Storage Engines NodeDB has eight peer engines. Seven of them are picked per collection via `WITH (engine='')` — the default (no `engine=`) is `document_schemaless`. The eighth, Array, uses its own `CREATE ARRAY` DDL family. See [Array Engine](../storage-engines/array) for that path. | Engine | Selector | Best for | | --------------------- | ----------------------------------- | ------------------------------------------------- | | `document_schemaless` | (default) | Flexible data, prototyping, agent state | | `document_strict` | `WITH (engine='document_strict')` | OLTP, transactions, known schemas | | `columnar` | `WITH (engine='columnar')` | Analytics, reporting, scan-heavy workloads | | `timeseries` | `WITH (engine='timeseries')` | Append-only metrics, retention, continuous aggs | | `spatial` | `WITH (engine='spatial')` | Geo-primary workloads, R*-tree, OGC predicates | | `kv` | `WITH (engine='kv')` | Sessions, caches, counters, key-dominant access | | `vector` | `WITH (engine='vector', vector_field='emb')` | Vector-primary collections | Column modifiers add capabilities to any engine: | Modifier | Effect | | ---------------- | ----------------------------------------------------------------- | | `TIME_KEY` | Marks the time column (used by timeseries engine, allowed elsewhere) | | `SPATIAL_INDEX` | R*-tree index on a `GEOMETRY` column | | `PRIMARY KEY` | Row identity column (inline only — table-level not supported) | ## Cross-Engine Indexes Any collection can have indexes from multiple engines: ```sql CREATE COLLECTION products; -- Add a vector index for semantic search CREATE VECTOR INDEX ON products METRIC cosine DIM 384; -- Add a full-text index for keyword search CREATE SEARCH INDEX ON products FIELDS title, description ANALYZER 'english'; -- Add a spatial index for location queries CREATE SPATIAL INDEX ON products FIELDS location; -- Add graph edges for relationships GRAPH INSERT EDGE IN 'products' FROM 'products:p1' TO 'products:p2' TYPE 'similar'; ``` ## Converting Between Engines Collections can be converted at any time without data loss: ```sql CONVERT COLLECTION users TO document_strict; CONVERT COLLECTION cache TO kv; CONVERT COLLECTION events TO document_schemaless; ``` NodeDB infers the schema from existing documents when converting to strict mode. --- # Schemas & Types (docs/data-modeling/schemas-types) --- title: Schemas & Types description: Data types, schema enforcement, typeguards, and schema evolution. --- # Schemas & Types ## Data Types | Type | Description | Example | | ----------- | ------------------------------------ | ---------------------- | | `STRING` | UTF-8 text | `'hello'` | | `TEXT` | Alias for STRING | `'hello'` | | `VARCHAR` | Alias for STRING | `'hello'` | | `INT` | 64-bit signed integer | `42` | | `FLOAT` | 64-bit IEEE 754 | `3.14` | | `DECIMAL` | Arbitrary precision | `99.99` | | `BOOL` | Boolean | `true` | | `TIMESTAMP` | UTC timestamp (nanosecond precision) | `now()` | | `DATETIME` | Alias for TIMESTAMP | `'2026-01-01T00:00Z'` | | `UUID` | 128-bit UUID | `gen_uuid_v7()` | | `GEOMETRY` | OGC geometry (WKB internally) | `ST_Point(-73.9, 40.7)`| | `ARRAY` | Ordered list | `ARRAY[1, 2, 3]` | | `OBJECT` | Nested document | `{ name: 'Alice' }` | ## Schema Modes **Schemaless** — No schema required. Fields can vary between documents. Types are inferred on insert. This is the default when you `CREATE COLLECTION x`. **Strict** — Schema is defined at creation time and enforced on every write. O(1) field extraction via binary tuple format. Created with `(...) WITH (engine='document_strict')`. ## Typeguards (Schemaless Validation) Typeguards add write-time validation to schemaless collections without changing the storage format. Guarded fields are type-checked; unguarded fields pass freely. ```sql CREATE TYPEGUARD ON users ( email STRING REQUIRED CHECK (email LIKE '%@%.%'), age INT CHECK (age >= 0 AND age <= 150), role STRING DEFAULT 'user', updated_at TIMESTAMP VALUE now() ); ``` | Modifier | Behavior | | ---------- | ----------------------------------------------- | | `REQUIRED` | Field must be present and non-null | | `DEFAULT` | Inject a value when the field is absent | | `VALUE` | Always inject/overwrite (computed fields) | | `CHECK` | SQL boolean expression validated at write time | ```sql -- Modify guards ALTER TYPEGUARD ON users ADD score FLOAT CHECK (score >= 0); ALTER TYPEGUARD ON users DROP age; -- Audit existing data against guards VALIDATE TYPEGUARD ON users; -- Graduate to strict schema CONVERT COLLECTION users TO document_strict; ``` ## Schema Evolution (Strict Mode) ```sql -- Add a column with a default (zero-downtime, multi-version reads) ALTER COLLECTION orders ADD COLUMN region STRING DEFAULT 'us-east'; -- Drop a column ALTER COLLECTION orders DROP COLUMN region; ``` `ALTER ADD COLUMN` uses multi-version reads — existing rows are served with the default value without rewriting storage. New writes include the column natively. --- # Document Engine (docs/storage-engines/document) --- title: Document Engine description: Two modes — schemaless (MessagePack, CRDT sync) and strict (Binary Tuples, O(1) field extraction). --- # Document Engine NodeDB supports two document storage modes per collection. Choose based on your workload and convert at any time. ## Schemaless Documents Flexible JSON-like documents stored as MessagePack. No schema required — fields can vary between documents. ```sql CREATE COLLECTION users; INSERT INTO users { name: 'Alice', email: 'alice@example.com', age: 30 }; INSERT INTO users { name: 'Bob', role: 'admin', tags: ['ops', 'dev'] }; -- Standard SQL also works INSERT INTO users (id, name, email) VALUES ('u3', 'Charlie', 'charlie@example.com'); CREATE INDEX ON users FIELDS email; SELECT * FROM users WHERE age > 25; ``` **Best for:** Prototyping, AI agent state, user profiles, nested data, CRDT sync. ## Strict Documents Schema-enforced documents stored as Binary Tuples with O(1) field extraction. The engine jumps directly to the byte offset of any column — 3-4x better cache density than MessagePack or BSON. ```sql CREATE COLLECTION orders ( id UUID DEFAULT gen_uuid_v7(), customer_id UUID NOT NULL, total DECIMAL NOT NULL, status STRING DEFAULT 'pending', created_at TIMESTAMP DEFAULT now() ) WITH (engine='document_strict'); INSERT INTO orders (customer_id, total, status) VALUES ($cust_id, 149.99, 'shipped'); SELECT * FROM orders WHERE id = '...'; -- Schema evolution (zero-downtime) ALTER COLLECTION orders ADD COLUMN region STRING DEFAULT 'us-east'; ``` **Best for:** OLTP, transactions, known schemas, HTAP with columnar materialized views. ## Comparison | | Schemaless | Strict | | ------------- | ------------------------------ | ------------------------------- | | Schema | Flexible, evolves freely | Fixed, enforced on write | | Field access | Parse MessagePack | O(1) byte offset | | Cache density | Good | 3-4x better | | CRDT sync | Native | Via adapter | | HTAP | No | Yes (materialized views) | ## Typeguards Add validation to schemaless collections without changing to strict mode: ```sql CREATE TYPEGUARD ON users ( email STRING REQUIRED CHECK (email LIKE '%@%.%'), age INT CHECK (age >= 0 AND age <= 150), role STRING DEFAULT 'user' ); ``` See [Schemas & Types](/docs/data-modeling/schemas-types) for full typeguard reference. ## Converting ```sql CONVERT COLLECTION users TO document_strict; -- schemaless → strict CONVERT COLLECTION logs TO columnar; -- document → columnar CONVERT COLLECTION cache TO kv; -- document → key-value ``` --- # Key-Value Engine (docs/storage-engines/kv) --- title: Key-Value Engine description: Hash-indexed O(1) point lookups with TTL, atomic operations, sorted indexes, and rate gates. --- # Key-Value Engine Purpose-built hash-indexed store with O(1) point lookups, native TTL, and secondary indexes. Unlike a standalone KV store, data is SQL-queryable, joinable, and syncable via CRDTs. ## When to Use - Session state and tokens - Feature flags and configuration - Rate limiters and counters - Caching without an external cache - Leaderboards and rankings ## SQL Usage ```sql CREATE COLLECTION sessions (key TEXT PRIMARY KEY) WITH (engine='kv'); -- Insert with TTL; plain INSERT raises unique_violation (23505) on duplicate key. INSERT INTO sessions { key: 'sess_abc', user_id: 'alice', role: 'admin', ttl: 3600 }; -- Set-or-overwrite (Redis SET semantics) UPSERT INTO sessions { key: 'sess_abc', user_id: 'alice', role: 'admin', ttl: 3600 }; -- Set-if-absent (Redis SETNX semantics) INSERT INTO sessions { key: 'sess_abc', user_id: 'alice', role: 'admin', ttl: 3600 } ON CONFLICT DO NOTHING; -- Conditional merge: bump counter on conflict, EXCLUDED references the incoming row. INSERT INTO sessions (key, user_id, role, hits) VALUES ('sess_abc', 'alice', 'admin', 1) ON CONFLICT (key) DO UPDATE SET role = EXCLUDED.role, hits = sessions.hits + 1; -- Get by key SELECT * FROM sessions WHERE key = 'sess_abc'; -- Analytical queries work too SELECT role, COUNT(*) FROM sessions GROUP BY role; -- Join with other collections SELECT u.name, s.role FROM users u JOIN sessions s ON u.id = s.user_id; ``` ## Atomic Operations ```sql SELECT KV_INCR('counters', 'page_views', 1); SELECT KV_DECR('credits', 'user-123', 50); SELECT KV_CAS('state', 'player-1', 'idle', 'in_match'); SELECT KV_GETSET('token', 'sess-1', 'new-token'); SELECT KV_INCR('daily_logins', 'user-1', 1, TTL => 86400); ``` ## Sorted Indexes (Leaderboards) ```sql CREATE SORTED INDEX lb_global ON scores (score DESC, updated_at ASC) KEY player_id; SELECT RANK(lb_global, 'player-123'); SELECT * FROM TOPK(lb_global, 10); SELECT SORTED_COUNT(lb_global); -- Time-windowed leaderboard CREATE SORTED INDEX lb_daily ON scores (score DESC) KEY player_id WINDOW DAILY ON updated_at; ``` ## Rate Gates ```sql SELECT RATE_CHECK('attack_cooldown', 'player-123', 3, 10); -- {"allowed": true, "remaining": 2, "resets_in_ms": 7500} SELECT RATE_REMAINING('api_calls', 'tenant-1', 100, 60); SELECT RATE_RESET('attack_cooldown', 'player-123'); ``` ## Redis-Compatible Access (RESP) Enable by setting `ports.resp = 6381` in config. Any Redis client works: ```bash redis-cli -p 6381 SET sess_abc '{"user":"alice"}' EX 3600 GET sess_abc INCR page_views ZADD leaderboard 1500 player-123 ZRANK leaderboard player-123 ``` Supported: GET, SET, DEL, EXISTS, MGET, MSET, EXPIRE, TTL, SCAN, KEYS, HGET, HSET, INCR, DECR, ZADD, ZREM, ZRANK, ZRANGE, SUBSCRIBE, PUBLISH, and more. --- # Columnar Engine (docs/storage-engines/columnar) --- title: Columnar Engine description: Per-column compression with block statistics, predicate pushdown, and 20-40x compression ratios. --- # Columnar Engine Typed columns with per-column compression — the same approach as ClickHouse and DuckDB, living alongside your OLTP data. ## When to Use - Analytical queries (GROUP BY, aggregations, window functions) - Reporting dashboards - Data science workloads - Any scan-heavy workload reading few columns from many rows - HTAP: pair with strict documents for combined OLTP + OLAP ## Compression Codecs Each column gets a codec chain tuned for its data type: | Codec | Target | Approach | | ---------- | ----------------------------- | ----------------------------------- | | ALP | Floats | Lossless float-to-integer conversion | | FastLanes | Integers | SIMD bit-packing | | FSST | Strings | Substring dictionary compression | | Gorilla | Metrics | XOR-based compression | | Pcodec | Complex numerics | Adaptive numeric compression | | rANS | Cold-tier data | Entropy coding | | LZ4 | All | Terminal stage compression | Multi-stage pipeline achieves 20-40x compression on typical workloads. ## Block Statistics Data is stored in 1024-row blocks. Each block has min/max/null-count statistics. The query engine skips blocks that can't match the predicate — no decompression needed. ## DDL ```sql -- Plain columnar CREATE COLLECTION logs ( ts TIMESTAMP TIME_KEY, host VARCHAR, level VARCHAR, message VARCHAR ) WITH (engine='columnar'); -- Timeseries (peer engine — append-only, retention, continuous aggregates) CREATE COLLECTION metrics ( ts TIMESTAMP TIME_KEY, host VARCHAR, cpu FLOAT ) WITH (engine='timeseries', partition_by='1h'); -- Spatial (peer engine — R-tree + geohash + OGC predicates) CREATE COLLECTION locations ( geom GEOMETRY SPATIAL_INDEX, name VARCHAR ) WITH (engine='spatial'); ``` ## Queries ```sql -- Point-get by primary key hits the segment PK index (not a full scan). SELECT * FROM logs WHERE ts = '2026-04-24T10:00:00Z'; SELECT level, COUNT(*) FROM logs WHERE ts > now() - INTERVAL '1 hour' GROUP BY level ORDER BY COUNT(*) DESC; -- ORDER BY is supported on columnar scans SELECT ts, host, message FROM logs WHERE level = 'error' ORDER BY ts DESC LIMIT 100; -- Window functions SELECT host, message, ROW_NUMBER() OVER (PARTITION BY host ORDER BY ts DESC) AS rank FROM logs; ``` Plain `INSERT` on a columnar collection raises `unique_violation` on primary-key conflict; use `UPSERT` or `INSERT ... ON CONFLICT (pk) DO UPDATE SET col = EXCLUDED.col` for overwrite semantics. ## HTAP Bridge Combine strict (OLTP) with columnar (OLAP): ```sql CREATE COLLECTION orders (...) WITH (engine='document_strict'); CREATE MATERIALIZED VIEW order_analytics AS SELECT status, DATE_TRUNC('day', created_at) AS day, COUNT(*), SUM(total) FROM orders GROUP BY status, day; -- Point lookups → strict engine. Scans → columnar engine. Automatic routing. ``` ## Delete Bitmaps Deleted rows are tracked with Roaring Bitmaps. Three-phase crash-safe compaction reclaims space: write new segments → swap references → delete old segments. --- # Timeseries Engine (docs/storage-engines/timeseries) --- title: Timeseries Engine description: Peer engine for time-ordered data with retention, continuous aggregation, ILP ingest, and PromQL. --- # Timeseries Engine Timeseries is a peer engine sharing storage with `columnar` and `spatial`. It adds retention policies, continuous aggregation, ILP ingest, and dedicated time-series SQL functions. Data is stored in the same compressed-column memtables with a `TIME_KEY` column driving partition-by-time and block-level skip. Selected via `WITH (engine='timeseries')` on `CREATE COLLECTION`, or the `CREATE TIMESERIES ` convenience alias. ## DDL ```sql CREATE COLLECTION cpu_metrics ( ts TIMESTAMP TIME_KEY, host VARCHAR, region VARCHAR, cpu_usage FLOAT, mem_usage FLOAT ) WITH (engine='timeseries', partition_by='1d', retention='90d'); -- Convenience alias CREATE TIMESERIES cpu_metrics; ``` ## Queries ```sql -- Time-bucketed aggregation SELECT time_bucket('5 minutes', ts) AS bucket, host, AVG(cpu_usage) AS avg_cpu FROM cpu_metrics WHERE ts > now() - INTERVAL '1 hour' GROUP BY bucket, host ORDER BY bucket DESC; -- Approximate aggregation (mergeable across shards) SELECT approx_count_distinct(host), approx_percentile(cpu_usage, 0.95) FROM cpu_metrics WHERE ts > now() - INTERVAL '24 hours'; ``` ## Continuous Aggregation Incrementally maintained views — no full re-scan on refresh: ```sql CREATE CONTINUOUS AGGREGATE cpu_hourly ON cpu_metrics AS SELECT time_bucket('1 hour', ts) AS hour, host, AVG(cpu_usage), ts_percentile(cpu_usage, 0.99) FROM cpu_metrics GROUP BY hour, host WITH (refresh_interval = '1m'); REFRESH CONTINUOUS AGGREGATE cpu_hourly; ``` ## Timeseries SQL Functions | Function | What it does | | ---------------------- | ------------------------------------ | | `ts_rate` | Per-second rate of change | | `ts_delta` | Difference between consecutive values| | `ts_moving_avg` | Moving average over a window | | `ts_ema` | Exponential moving average | | `ts_interpolate` | Gap-fill with interpolated values | | `ts_percentile` | Percentile calculation | | `ts_zscore` | Z-score anomaly detection | | `ts_bollinger_upper/lower/mid/width` | Bollinger Bands | | `ts_moving_percentile` | Rolling percentile | | `ts_correlate` | Correlation between two series | | `ts_lag` / `ts_lead` | Previous/next value in a series | ## ILP Ingest Enable with `ports.ilp = 8086`. Any ILP-compatible client (Telegraf, Vector) pushes metrics directly: ```bash echo "cpu,host=web-01 usage=72.5 1609459200000000000" | nc localhost 8086 ``` Adaptive batching and per-series core routing — self-tuning, no configuration needed. ## Grafana / PromQL NodeDB works as a native Grafana Prometheus data source at `http://nodedb:6480/v1/obsv/api`. Full PromQL engine (Tier 1+2+3 functions). Also supports Prometheus remote write/read for long-term storage. --- # Spatial Engine (docs/storage-engines/spatial) --- title: Spatial Engine description: Peer engine for geo-primary workloads — R*-tree index, OGC predicates, geohash, H3 hexagonal indexing, and hybrid spatial-vector search. --- # Spatial Engine Spatial is a peer engine sharing storage with `columnar` and `timeseries`. Collections selected via `WITH (engine='spatial')` and a `GEOMETRY SPATIAL_INDEX` column get automatic R*-tree indexing over the geometry column. ## DDL ```sql -- Spatial peer engine (R-tree + geohash + OGC predicates) CREATE COLLECTION restaurants ( location GEOMETRY SPATIAL_INDEX, name VARCHAR, cuisine VARCHAR, rating FLOAT ) WITH (engine='spatial'); -- Or add a spatial index to any collection CREATE COLLECTION restaurants; CREATE SPATIAL INDEX ON restaurants FIELDS location; ``` ## Queries ```sql -- Find within 1km SELECT name, ST_Distance(location, ST_Point(-73.990, 40.750)) AS dist FROM restaurants WHERE ST_DWithin(location, ST_Point(-73.990, 40.750), 1000) ORDER BY dist; -- Geofencing (point-in-polygon) SELECT name FROM restaurants WHERE ST_Within(location, ST_GeomFromGeoJSON('{ "type": "Polygon", "coordinates": [[[-74.0, 40.7], [-73.9, 40.7], [-73.9, 40.8], [-74.0, 40.8], [-74.0, 40.7]]] }')); -- H3 hexagonal binning SELECT h3_to_string(h3_encode(40.748, -73.985, 9)) AS hex; -- Spatial join SELECT r.name, z.zone_name FROM restaurants r, delivery_zones z WHERE ST_Contains(z.boundary, r.location); -- Hybrid spatial-vector: nearby AND semantically similar SELECT name, vector_distance(embedding, $query_vec) AS similarity FROM restaurants WHERE ST_DWithin(location, ST_Point(-73.990, 40.750), 2000) AND embedding <-> $query_vec LIMIT 10; ``` ## OGC Predicates `ST_Contains`, `ST_Intersects`, `ST_Within`, `ST_DWithin`, `ST_Distance`, `ST_Intersection`, `ST_Buffer`, `ST_Envelope`, `ST_Union`. ## Format Support WKB, WKT, and GeoJSON interchange. GeoParquet v1.1.0 and GeoArrow metadata for bulk export. ## Combined with Timeseries ```sql -- Fleet tracking: spatial + time CREATE COLLECTION fleet ( ts TIMESTAMP TIME_KEY, vehicle_id VARCHAR, position GEOMETRY SPATIAL_INDEX, speed FLOAT ) WITH (engine='timeseries', partition_by='1d'); ``` --- # Vector Engine (docs/storage-engines/vector) --- title: Vector Engine description: HNSW index with quantization, adaptive pre-filtering, and SIMD-accelerated distance math for semantic search. --- # Vector Engine The vector engine powers semantic search — nearest-neighbor retrieval over high-dimensional embeddings. It uses a custom HNSW index with multiple quantization levels and hardware-accelerated distance math. ## When to Use - Semantic search over embeddings (text, images, audio) - RAG pipelines for AI agents - Recommendation systems - Similarity matching and deduplication ## Key Features - **HNSW index** — Multi-layer proximity graph. Construction at full precision (FP32/FP16) for structural integrity; traversal on quantized payloads for cache residency. - **Quantization** — SQ8 (~4x memory reduction), PQ (~4-8x), IVF-PQ (~16 bytes/vector for 100M+ datasets). - **Adaptive pre-filtering** — Roaring Bitmap-based filtering. Automatic strategy selection: pre-filter (selective), post-filter (broad), or brute-force. - **Distance metrics** — L2, cosine, inner product, Manhattan, Chebyshev, Hamming, Jaccard, Pearson. - **Cross-engine fusion** — Combine with graph (GraphRAG), full-text (hybrid BM25+vector), or spatial filtering. ## SQL Usage ```sql -- Create a collection with a vector index CREATE COLLECTION articles; CREATE VECTOR INDEX idx_embed ON articles METRIC cosine DIM 384; -- Insert with embedding INSERT INTO articles { title: 'Understanding Transformers', embedding: [0.12, -0.34, 0.56, ...] }; -- Nearest neighbor search SEARCH articles USING VECTOR(embedding, ARRAY[0.1, 0.3, -0.2, ...], 10); -- Filtered vector search SELECT title, vector_distance(embedding, ARRAY[0.1, 0.3, ...]) AS score FROM articles WHERE category = 'machine-learning' AND id IN (SEARCH articles USING VECTOR(embedding, ARRAY[0.1, 0.3, ...], 10)); -- Hybrid BM25 + vector (RRF fusion) SELECT title, rrf_score( vector_distance(embedding, $query_vec), bm25_score(body, 'transformer attention') ) AS score FROM articles LIMIT 10; ``` ## Quantization Selection | Type | Memory (384d) | Recall | Best for | | ----------- | ------------- | ------- | -------------------------- | | HNSW (FP32) | ~1.5 KB | ~99% | < 1M vectors, max accuracy | | HNSW + SQ8 | ~384 B | ~98% | 1-10M vectors | | HNSW + PQ | ~96 B | ~95% | 10-50M vectors | | IVF-PQ | ~16 B | ~85-95% | 50M+ vectors | ## How It Works Vectors are indexed in the HNSW graph at full precision. During search, quantized copies are traversed for speed, then top candidates are re-ranked against full-precision vectors. When metadata filters are present, the engine builds a Roaring Bitmap of matching IDs and selects the optimal strategy based on selectivity. --- # Array Engine (docs/storage-engines/array) --- title: Array Engine description: ND sparse array engine — coordinate-tuple indexed, tile-based, bitemporal. Replaces TileDB / Zarr / SciDB / Rasdaman. --- # Array Engine NodeDB's array engine stores multi-dimensional sparse data with bitemporal support — system time (when the cell was written) and valid time (when the cell represents). Cells are coordinate-tuple indexed, grouped into tiles, compressed per-tile, and queryable via SQL table-valued functions. This is a peer engine but uses its own DDL family (`CREATE ARRAY`) — not `CREATE COLLECTION ... WITH (engine='array')`. ## When to Use - Genomics: `(chromosome × position × sample × allele)` — replaces TileDB-VCF - Single-cell biology: `(gene × cell × condition × replicate)` — replaces TileDB-SOMA - Earth observation: `(lat × lon × band × time)` raster cubes — replaces Zarr / TileDB-Geo - Climate models: `(lat × lon × level × time × variable)` — replaces HDF5 + Dask - Astronomy: `(RA × Dec × wavelength × time)` — replaces custom Zarr stacks - Sparse ML features: `(user × item × context)` — replaces specialized matrix-factorization systems ## Key Features - **ND coordinate-tuple keying** — arbitrary number of dimensions; only materialized cells are stored - **Tile-based compression** — cells grouped into tiles; each tile independently compressed (ALP, FastLanes, Gorilla, LZ4 via `nodedb-codec`) - **Z-order indexing** — Hilbert/Z-order curve linearization for spatial locality and fast range queries - **Per-tile ND MBR statistics** — minimum bounding rectangle skip; queries prune entire tiles before decompressing - **Bitemporal** — both system time (audit trail) and valid time (temporal semantics) tracked per tile - **Row-major or column-major layout** — `cell_order` chosen at creation - **Cross-engine surrogate identity** — array cells participate in cross-engine bitmap intersections alongside vector / graph / document / columnar - **Distributed** — tiles vShard-routed; queries scatter-gather across cores and nodes - **WAL-durable + Raft-replicated** — same durability guarantees as the rest of NodeDB - **Tile-level retention** — `audit_retain_ms` enables GDPR / data-minimization compliance ## DDL Syntax ```sql CREATE ARRAY spatial_grid DIMS ( x INT64 DOMAIN [0, 1000), y INT64 DOMAIN [0, 1000), z INT64 DOMAIN [0, 1000) ) ATTRS ( temperature FLOAT32, pressure FLOAT32, humidity FLOAT32 ) TILE_EXTENTS (64, 64, 64) WITH ( cell_order = 'Z-ORDER', audit_retain_ms = 86400000 ); ``` | Parameter | Required | Default | Description | | ----------------- | -------- | ----------- | -------------------------------------------------------------------------------------------------------- | | `DIMS` | Yes | — | Dimensions. Each has a name, type (`INT32`, `INT64`, `FLOAT64`), and half-open domain `[lo, hi)`. | | `ATTRS` | Yes | — | Attributes (cell values). Each has a name and type (`FLOAT32`, `FLOAT64`, `INT32`, `INT64`, `STRING`). | | `TILE_EXTENTS` | Yes | — | Tile extent per dimension; all > 0. Determines cell locality and compression block granularity. | | `cell_order` | No | `'Z-ORDER'` | `'Z-ORDER'` (Hilbert curve) or `'ROW-MAJOR'`. Affects spatial cache locality. | | `audit_retain_ms` | No | `NULL` | Tiles older than `now - audit_retain_ms` (system time) become eligible for purge. `NULL` = keep all. | `ALTER NDARRAY SET (audit_retain_ms = ...)` updates retention; `DROP ARRAY ` is two-phase like `DROP COLLECTION`. ## Insert ```sql CREATE ARRAY elevation_map DIMS ( lon FLOAT64 DOMAIN [-180, 180), lat FLOAT64 DOMAIN [-90, 90) ) ATTRS (height FLOAT32) TILE_EXTENTS (256, 256); INSERT INTO ARRAY elevation_map (lon, lat, height) VALUES (-73.5, 40.7, 10.5), (-73.6, 40.8, 12.3), (-73.7, 40.6, 8.9); -- Force the in-memory tiles to durable storage SELECT NDARRAY_FLUSH('elevation_map'); ``` ## Query Functions Array queries are expressed as table-valued functions in `FROM`. System time and valid time apply via `AS OF` clauses. ### `NDARRAY_SLICE` — multi-dimensional range ```sql SELECT * FROM NDARRAY_SLICE( 'elevation_map', {lon: [-74.0, -73.0), lat: [40.0, 41.0)}, ['height'], -- attribute projection (optional) 1000 -- max cells (optional) ); ``` | Parameter | Required | Type | Description | | --------- | -------- | -------------- | --------------------------------------------------------------- | | `array` | Yes | `STRING` | Array name | | `bounds` | Yes | `OBJECT` | `{ dim: [lo, hi) }`. Omitted dims = full range. | | `attrs` | No | `ARRAY[STRING]`| Attributes to project. `NULL` = all attributes. | | `limit` | No | `INT64` | Max cells returned. `NULL` = no limit. | ### `NDARRAY_PROJECT` — attribute projection ```sql SELECT * FROM NDARRAY_PROJECT('spatial_grid', ['temperature', 'pressure']); ``` ### `NDARRAY_AGG` — reduce a dimension Aggregates an attribute over a dimension, reducing dimensionality: ```sql -- Sum temperature over x; result keeps y and z SELECT * FROM NDARRAY_AGG('spatial_grid', 'temperature', 'SUM', 'x'); ``` Reducers: `'SUM'`, `'AVG'`, `'MIN'`, `'MAX'`, `'COUNT'`. ### `NDARRAY_ELEMENTWISE` — between two arrays of the same shape ```sql SELECT * FROM NDARRAY_ELEMENTWISE('current_grid', 'baseline_grid', 'SUBTRACT', 'temperature'); ``` ## Maintenance ```sql SELECT NDARRAY_FLUSH('spatial_grid'); -- force memtable flush SELECT NDARRAY_COMPACT('spatial_grid'); -- merge tile versions, reclaim space ``` `NDARRAY_FLUSH` always returns `{result: true}` on success; failure raises. Compaction also runs automatically in the background. ## Bitemporal Queries Every array cell carries two times: - **System time** — when the value was written (audit trail, compliance, point-in-time recovery) - **Valid time** — when the value represents (forecasts, backdated corrections, scientific replays) ```sql -- Read cells as the array existed in the past SELECT * FROM NDARRAY_SLICE('data', {x: [0, 100), y: [0, 100)}, ['value']) AS OF SYSTEM TIME 1700000000000; -- Read cells whose valid-time interval includes a given moment SELECT * FROM NDARRAY_SLICE('forecast', {x: [0, 100), y: [0, 100)}, ['temp']) AS OF VALID TIME 1700000000000; -- Both clauses combined SELECT * FROM NDARRAY_SLICE('forecast', {x: [0, 100), y: [0, 100)}, ['temp']) AS OF SYSTEM TIME 1700000000000 AS OF VALID TIME 1700000001000; ``` System-time–based retention is the path to GDPR and data-minimization compliance: `audit_retain_ms` makes tiles older than the window eligible for irreversible purge during compaction. ## Cross-Engine Queries Array cells participate in surrogate-identity bitmaps with the rest of the engines, so a single query can prefilter by vector neighborhood and slice an array: ```sql SELECT * FROM NDARRAY_SLICE('spatial_data', {x: [0, 1000), y: [0, 1000)}, ['attr1', 'attr2']) WHERE id IN ( SEARCH vectors USING VECTOR(embedding, $query, 100) ); ``` See [Architecture Overview](../architecture/overview) for the cross-engine identity model. ## Performance - **Tile-level parallelism** — each tile is read and processed on its own core - **Compression** — typical 5–20× depending on data homogeneity - **Range queries** — Z-order layout gives cache-friendly access; ND-MBR per-tile stats prune irrelevant tiles before decompression - **Sparse-friendly** — only materialized cells are stored; implicit zeros and empty regions cost nothing --- # Graph Engine (docs/storage-engines/graph) --- title: Graph Engine description: CSR adjacency index with 13 algorithms, Cypher-subset MATCH, and GraphRAG fusion. --- # Graph Engine The graph engine uses a native CSR (Compressed Sparse Row) adjacency index — not recursive JOINs. At 1 billion edges, CSR uses ~10 GB vs ~60 GB for naive adjacency lists. Sub-millisecond multi-hop traversals, 13 native algorithms, Cypher-subset pattern matching, and GraphRAG fusion. ## When to Use - Knowledge graphs and entity relationships - Social networks and recommendation - Fraud detection (pattern matching) - Supply chain and dependency analysis - RAG pipelines with graph context (GraphRAG) ## Graph is an Overlay Graph edges are an overlay on document collections. Any collection can have graph edges — you don't create a separate "graph collection." ```sql CREATE COLLECTION people; INSERT INTO people (id, name) VALUES ('alice', 'Alice'); INSERT INTO people (id, name) VALUES ('bob', 'Bob'); GRAPH INSERT EDGE IN 'people' FROM 'alice' TO 'bob' TYPE 'knows' PROPERTIES { since: 2020, weight: 0.9 }; -- The IN '' clause is required — edges live on a named collection. ``` ## Traversal ```sql -- BFS traversal GRAPH TRAVERSE FROM 'alice' DEPTH 3; GRAPH TRAVERSE FROM 'alice' DEPTH 2 LABEL 'follows' DIRECTION out; -- Immediate neighbors GRAPH NEIGHBORS OF 'bob' LABEL 'follows' DIRECTION both; -- Shortest path GRAPH PATH FROM 'alice' TO 'charlie' MAX_DEPTH 5 LABEL 'knows'; ``` ## MATCH Pattern Queries Cypher-subset pattern matching: ```sql -- Friend-of-friend MATCH (a:Person)-[:knows]->(b:Person)-[:knows]->(c:Person) WHERE a.name = 'Alice' RETURN b.name, c.name; -- Variable-length paths MATCH (u:User)-[:follows*2..3]->(recommended:User) WHERE u.id = 'you' RETURN DISTINCT recommended.id LIMIT 10; -- Anti-join MATCH (a:User)-[:follows]->(b:User) WHERE NOT EXISTS { MATCH (b)-[:blocked_by]->(a) } RETURN a.id, b.id; -- OPTIONAL MATCH MATCH (a:Person)-[:knows]->(b:Person) OPTIONAL MATCH (b)-[:works_at]->(c:Company) RETURN a.name, b.name, c.name; ``` ## 13 Algorithms ```sql GRAPH ALGO PAGERANK ON social DAMPING 0.85 ITERATIONS 20 TOLERANCE 1e-7; GRAPH ALGO WCC ON knowledge_graph; GRAPH ALGO SSSP ON routes FROM 'city:chicago'; GRAPH ALGO COMMUNITY ON products ITERATIONS 10 RESOLUTION 1.0; GRAPH ALGO BETWEENNESS ON network SAMPLE 500; GRAPH ALGO KCORE ON collaboration; GRAPH ALGO TRIANGLES ON social MODE global; GRAPH ALGO DIAMETER ON web; ``` | Algorithm | Computes | | ----------------- | -------------------------------------------- | | PageRank | Node importance via link structure | | WCC | Weakly connected components | | Label Propagation | Community detection via label spreading | | LCC | Local clustering coefficient | | SSSP | Single-source shortest path (Dijkstra) | | Betweenness | Bridge node identification | | Closeness | How close a node is to all others | | Harmonic | Closeness for disconnected graphs | | Degree | Connection count (in/out/both) | | Louvain | Community detection via modularity | | Triangles | Triangle count (per-node or global) | | Diameter | Longest shortest path | | k-Core | Coreness decomposition | ## GraphRAG Combines vector similarity with graph traversal in one query: ```sql GRAPH RAG FUSION ON entities QUERY $embedding VECTOR_FIELD 'embedding' VECTOR_TOP_K 50 EXPANSION_DEPTH 2 EDGE_LABEL 'related_to' FINAL_TOP_K 10 RRF_K (60.0, 35.0); ``` 1. Vector search finds semantically similar seed nodes 2. BFS expands seeds along edges 3. RRF merges vector rank with graph hop distance --- # Full-Text Search Engine (docs/storage-engines/fts) --- title: Full-Text Search Engine description: Block-Max WAND BM25 with 27-language support, CJK tokenization, fuzzy matching, and hybrid vector fusion. --- # Full-Text Search Engine Block-Max WAND (BMW) optimized BM25 ranking with 16 Snowball stemmers, 27-language stop words, CJK bigram tokenization, posting compression, fuzzy matching, and native hybrid fusion with vector search. ## When to Use - Text search across documents, articles, products, logs - Search-as-you-type with fuzzy matching - Multilingual content search (including CJK, Arabic, Hindi) - Hybrid retrieval: keyword matching + semantic similarity ## SQL Usage ```sql CREATE COLLECTION articles; CREATE SEARCH INDEX ON articles FIELDS title, body ANALYZER 'english' FUZZY true; -- Basic search SELECT title, bm25_score(body, 'distributed database') AS score FROM articles WHERE text_match(body, 'distributed database') ORDER BY score DESC LIMIT 20; -- Fuzzy search SELECT title FROM articles WHERE text_match(title, 'databse', { fuzzy: true, distance: 2 }); -- Hybrid BM25 + vector (RRF) SELECT title, rrf_score( vector_distance(embedding, $query_vec), bm25_score(body, 'distributed systems') ) AS score FROM articles LIMIT 10; -- Synonyms CREATE SYNONYM GROUP db_terms AS ('database', 'db', 'datastore'); ``` ## Analyzers | Analyzer | Behavior | | ------------ | ----------------------------------------------- | | `standard` | NFD normalize, lowercase, English stop/stem | | `simple` | Lowercase + whitespace split | | `keyword` | Entire input as a single token | | `cjk_bigram` | CJK bigram tokenization | | `ngram:2:4` | Character n-grams (min:max) | | `edge_ngram` | Prefix-anchored n-grams for autocomplete | 16 language-specific analyzers: `ar`, `da`, `nl`, `en`, `fi`, `fr`, `de`, `hu`, `it`, `no`, `pt`, `ro`, `ru`, `es`, `sv`, `tr`. CJK text is automatically routed to bigram tokenizer regardless of configured analyzer. Optional dictionary segmentation via feature gates: `lang-ja`, `lang-zh`, `lang-ko`, `lang-th`. ## Internals - **BMW scoring** — WAND pivot selection + 128-doc block pruning via precomputed upper bounds - **Posting compression** — Delta-encoded, variable-width bitpacked doc IDs with SIMD unpack (SSE2/NEON) - **SmallFloat fieldnorms** — 1-byte length quantization (4x space reduction) - **LSM storage** — In-memory memtable → immutable segments → level-based compaction (8x8 tiering) - **AND-first with OR fallback** — Tries AND; falls back to OR with coverage penalty if zero results - **Phrase proximity boost** — Consecutive tokens at consecutive positions get up to 3x score boost --- # CRDT Engine (docs/storage-engines/crdt) --- title: CRDT Engine description: Loro-backed conflict-free replicated data types for offline-first edge sync with cloud consistency. --- # CRDT Engine The CRDT engine provides conflict-free replication for offline-first applications. Locally, writes are available immediately (AP). Globally, deltas are committed through Multi-Raft consensus (CP). SQL constraints are enforced at sync time. ## How It Works 1. Device writes locally — available immediately, no network required 2. Write produces a CRDT delta (Loro) 3. When connectivity returns, delta syncs to Origin via WebSocket 4. Origin validates SQL constraints (UNIQUE, FK, CHECK) at Raft commit 5. If constraints pass, delta is committed and replicated 6. If constraints fail, Origin sends a typed `CompensationHint` back to the device ## Conflict Resolution Declarative conflict policies per collection: ```sql -- Last-writer-wins (default) CREATE COLLECTION notes WITH (conflict_policy = 'lww'); -- Rename suffix for UNIQUE conflicts CREATE COLLECTION profiles WITH (conflict_policy = 'rename_suffix'); ``` The `PolicyRegistry` stores per-collection resolution strategies. Available policies: `lww` (last-writer-wins), `rename_suffix` (append suffix on UNIQUE conflict), `cascade_defer` (retry with backoff for FK violations), `custom` (webhook), `escalate_to_dlq` (send to dead-letter queue). ## Constraint Validation SQL constraints are checked on Origin at sync time, not on the device: - **UNIQUE** — duplicate key → `CompensationHint::Rename` or `CompensationHint::Merge` - **FK** — dangling reference → `CompensationHint::CreateParent` or `CompensationHint::Discard` - **CHECK** — constraint violation → `CompensationHint::Adjust` The application handles compensation — no silent data loss. ## Dead-Letter Queue Deltas that persistently fail constraint validation are routed to a dead-letter queue with diagnostic metadata. Operators can inspect, replay, or discard them. ## Related - [CRDT Sync](/docs/crdt-sync/overview) — Full sync protocol and offline patterns - [Document Engine](/docs/storage-engines/document) — Schemaless documents with CRDT sync --- # HNSW Index (docs/indexes/hnsw-index) --- title: HNSW Index description: Hierarchical Navigable Small World graph for approximate nearest neighbor search on vector embeddings. --- # HNSW Index HNSW (Hierarchical Navigable Small World) is a multi-layer proximity graph for approximate nearest neighbor search. It provides logarithmic search complexity with high recall. ## Creating an HNSW Index ```sql CREATE VECTOR INDEX idx_embed ON articles METRIC cosine DIM 384; -- With explicit parameters CREATE VECTOR INDEX idx_embed ON articles METRIC cosine DIM 384 M 16 EF_CONSTRUCTION 200; ``` ## Parameters | Parameter | Default | Description | | ------------------ | ------- | ------------------------------------------------------------- | | `METRIC` | — | Distance metric: `l2`, `cosine`, `inner_product`, `manhattan`, `chebyshev`, `hamming`, `jaccard`, `pearson` | | `DIM` | — | Vector dimension (must match your embeddings) | | `M` | 16 | Max connections per node per layer. Higher = more accurate, more memory | | `EF_CONSTRUCTION` | 200 | Search width during index build. Higher = slower build, better graph quality | ## How It Works - **Construction** — Vectors inserted at full precision (FP32/FP16) to maintain structural integrity - **Search** — Traverses quantized copies for speed, then re-ranks top candidates against full-precision vectors - **Layers** — Upper layers provide coarse navigation; bottom layer is fully connected ## Quantization Add quantization to reduce memory: | Quantization | Memory reduction | Recall impact | | ------------ | ---------------- | ------------- | | SQ8 | ~4x | Minimal | | PQ | ~4-8x | ~5% loss | | IVF-PQ | ~16 bytes/vector | ~5-15% loss | ## Search ```sql -- k-NN search SEARCH articles USING VECTOR(embedding, ARRAY[0.1, 0.3, ...], 10); -- With pre-filtering SELECT * FROM articles WHERE category = 'ml' AND id IN (SEARCH articles USING VECTOR(embedding, $vec, 10)); ``` --- # CSR Index (docs/indexes/csr-index) --- title: CSR Index description: Compressed Sparse Row adjacency index for cache-resident graph traversal. --- # CSR Index CSR (Compressed Sparse Row) is the graph engine's core index format. It stores adjacency data in contiguous arrays for cache-resident traversal. ## Layout ``` CsrIndex (per tenant): out_offsets: Vec [num_nodes + 1] — offset into target array per node out_targets: DenseArray [num_edges] — destination node IDs (contiguous, mmap-capable) out_labels: DenseArray [num_edges] — edge labels (parallel array) out_weights: Option> — optional, allocated only when weighted in_offsets / in_targets / in_labels / in_weights — symmetric for inbound ``` ## Tenant Partitioning The in-memory index is `ShardedCsrIndex` — one `CsrIndex` per tenant. Algorithms and traversals receive a single tenant's partition; there is no lexical tenant prefix on node names. Each `CsrIndex` is assigned a unique partition tag at construction, and public APIs that return dense node indices hand out `LocalNodeId { id, partition_tag }`. Using a node id from one partition with another partition's API panics at the boundary. ## Memory Efficiency At 1 billion edges, CSR uses ~10 GB vs ~60 GB for naive adjacency lists (6x improvement). Node IDs are interned as `u32`, labels as `u32`. ## Storage Edges are persisted in a redb B-Tree with forward and reverse indexes, both keyed by `(tenant_id: u32, "src\x00label\x00dst")` tuples. Tenant isolation is structural (first-class key component), not lexical. The CSR index is built at query time for bulk operations. Writes go to a mutable buffer and become visible immediately. Compaction merges the buffer into dense CSR arrays when the buffer exceeds 10% of the dense size. ## Graph Operations The CSR index supports all traversal and algorithm operations: - BFS/DFS traversal - Shortest path (Dijkstra) - All 13 native graph algorithms - MATCH pattern matching - GraphRAG fusion --- # R*-tree Index (docs/indexes/rtree-index) --- title: R*-tree Index description: Spatial index for range queries, nearest neighbor, and OGC predicate evaluation. --- # R*-tree Index The R*-tree is the spatial engine's primary index. It supports range queries, nearest neighbor search, and spatial predicate evaluation. ## Creating a Spatial Index ```sql -- Automatic via SPATIAL_INDEX column modifier (spatial peer engine) CREATE COLLECTION locations ( geom GEOMETRY SPATIAL_INDEX, name VARCHAR ) WITH (engine='spatial'); -- Or add to any collection CREATE SPATIAL INDEX ON restaurants FIELDS location; ``` ## Operations - **Range query** — `ST_DWithin`, `ST_Within`, `ST_Intersects` - **Nearest neighbor** — `ORDER BY ST_Distance(geom, point) LIMIT k` - **Bulk load** — Optimized for batch inserts - **Spatial join** — R*-tree probe join between two collections ## Query Execution When a spatial predicate is present, the R*-tree narrows the candidate set before the columnar sparse index does final refinement. Queries without spatial predicates read directly from the columnar memtable — the R*-tree is not involved. ```sql -- R*-tree lookup → sparse refinement → result SELECT name FROM locations WHERE ST_DWithin(geom, ST_Point(-73.98, 40.75), 500); -- No spatial predicate → direct columnar scan SELECT name FROM locations WHERE name LIKE 'Park%'; ``` --- # B-tree Index (docs/indexes/btree-index) --- title: B-tree Index description: redb B-Tree indexes for metadata lookups and secondary indexes on document collections. --- # B-tree Index B-tree indexes power metadata lookups and secondary indexes on document collections. Backed by redb's ACID B-Tree storage. ## Creating Indexes ```sql -- Secondary index on a document collection CREATE INDEX ON users FIELDS email; -- Compound index CREATE INDEX ON orders FIELDS customer_id, status; -- Unique index CREATE UNIQUE INDEX ON users FIELDS email; ``` ## When Used - Point lookups on document collections (`WHERE email = 'alice@example.com'`) - Range scans (`WHERE age > 25 AND age < 40`) - Sorting (`ORDER BY created_at DESC`) - Constraint enforcement (UNIQUE) ## Graph Edge Storage Graph edges are persisted in redb B-Trees with forward and reverse indexes, keyed by `(tenant_id: u32, "src\x00label\x00dst")` tuples. Tenant isolation is a first-class key component, not a lexical prefix; the composite portion enables prefix scans for outbound traversal within a tenant. --- # Inverted Index (docs/indexes/inverted-index) --- title: Inverted Index description: LSM-based inverted index for full-text search with Block-Max WAND scoring. --- # Inverted Index The inverted index maps terms to document IDs for full-text search. It uses an LSM architecture with posting compression and Block-Max WAND scoring. ## Creating a Search Index ```sql CREATE SEARCH INDEX ON articles FIELDS title, body ANALYZER 'english' FUZZY true; ``` ## Architecture **Memtable** — Writes accumulate in-memory (`HashMap>`). When the memtable exceeds the threshold (32M posting entries or 100K unique terms), it flushes to an immutable segment. **Segments** — Compressed on-disk segments with delta-encoded, bitpacked posting lists. Level-based compaction (8 levels, 8 segments per level). **Query merge** — Searches merge the active memtable with all persisted segments. ## Posting Compression - Delta encoding for sorted doc IDs - Variable-width bitpacking (3-byte header: `[count: u16][bit_width: u8]`) - SIMD-accelerated unpack (SSE2 on x86_64, NEON on AArch64) - SmallFloat fieldnorms (1 byte per document, 4x space reduction) ## Block-Max WAND Posting lists are split into 128-document blocks with precomputed `block_max_tf` and `block_min_fieldnorm`. During scoring, blocks that can't beat the current top-k threshold are skipped entirely. --- # Hash Index (docs/indexes/hash-index) --- title: Hash Index description: O(1) hash-based index for key-value point lookups. --- # Hash Index The hash index provides O(1) point lookups by user-defined key. It is the primary index for KV collections. ## How It Works Keys are hashed to locate the value directly — no tree traversal. This provides constant-time reads regardless of collection size. ## Creating a KV Collection ```sql CREATE COLLECTION sessions (key TEXT PRIMARY KEY) WITH (engine='kv'); ``` The `PRIMARY KEY` column is automatically hash-indexed. ## Secondary Indexes KV collections can also have secondary B-tree indexes on value fields: ```sql CREATE INDEX ON sessions FIELDS role; SELECT key, user_id FROM sessions WHERE role = 'admin'; ``` ## TTL Native TTL with an index-backed expiry wheel. Keys expire automatically: ```sql INSERT INTO sessions { key: 'sess_abc', user_id: 'alice', ttl: 3600 }; ``` --- # Bitemporal Queries (docs/temporal/bitemporal) --- title: Bitemporal Queries description: Track data along system time (when written) and valid time (when valid). Audit trails, backdated corrections, and compliance-grade history across multiple engines. --- # Bitemporal Queries Bitemporal databases track data along two independent time dimensions: - **System time** — when the database recorded the value. Stored *in the key* (structural), derived from the WAL LSN at Raft commit. Used for audit trails and asking "what did the database think was true at moment X?". - **Valid time** — when the value represents. Stored *in the value*, client/device-assigned. Used for forecasts, backdated corrections, and asking "what was true in the world at moment X?". Closed-open intervals: `[from, to)`; an open upper bound is `i64::MAX`. ## Supported Engines | Engine | System Time | Valid Time | Example use case | | ------------------------------------------------------------ | :---------: | :--------: | --------------------------------- | | [Graph](../storage-engines/graph) (edges + nodes) | Yes | Yes | Entity-relationship timelines | | [Document (strict)](../storage-engines/document) | Yes | Yes | Versioned profiles, ledgers | | [Document (schemaless)](../storage-engines/document) | Yes | Yes | Event logs with backdated entries | | [Columnar](../storage-engines/columnar) | Yes | Yes | Audit tables, corrected metrics | | [Timeseries](../storage-engines/timeseries) | Yes | Yes | Forecast corrections, data repair | | [Array](../storage-engines/array) (tile-level versioning) | Yes | Yes | Historical spatial snapshots | **Index engines (Vector, Full-Text Search, Spatial R\*-tree, Key-Value)** do not carry temporal columns themselves — they index records that live in data-bearing collections. To query at a point in time, attach the index to a `bitemporal=true` collection; `AS OF` filtering happens at the collection layer, not the index. See [Index engines and temporal composition](#index-engines-and-temporal-composition). ## SQL Syntax `AS OF` clauses sit in the `FROM` part of a query. Times are milliseconds since Unix epoch — use `extract(epoch from now()) * 1000` for current time: ```sql -- Read the database state as of a past system time SELECT * FROM collection AS OF SYSTEM TIME 1700000000000; -- Read rows whose valid-time interval includes a given moment SELECT * FROM collection AS OF VALID TIME 1700000000000; -- Both: rows that were valid AND in the system at a point SELECT * FROM collection AS OF SYSTEM TIME 1700000000000 AS OF VALID TIME 1700000001000; ``` ## Examples ### Audit Trail (System Time) ```sql CREATE COLLECTION user_accounts ( id UUID DEFAULT gen_uuid_v7(), email VARCHAR, balance DECIMAL, created_at TIMESTAMP DEFAULT now() ) WITH (engine='document_strict', bitemporal=true); INSERT INTO user_accounts (email, balance) VALUES ('alice@example.com', 100.00); UPDATE user_accounts SET balance = 150.00 WHERE email = 'alice@example.com'; -- Database state 10 minutes ago SELECT email, balance FROM user_accounts AS OF SYSTEM TIME (extract(epoch from now()) * 1000 - 600000); -- → alice@example.com, 100.00 -- Current state SELECT email, balance FROM user_accounts; -- → alice@example.com, 150.00 ``` ### Backdated Corrections (Valid Time) ```sql CREATE COLLECTION sensor_readings ( ts TIMESTAMP TIME_KEY, location VARCHAR, temperature FLOAT ) WITH (engine='timeseries', bitemporal=true); -- Original reading INSERT INTO sensor_readings (ts, location, temperature, valid_time) VALUES ('2026-04-01T10:00:00Z', 'warehouse-a', 21.5, '2026-04-01T10:00:00Z'); -- Discover it was wrong; insert a correction valid as-of April 2nd INSERT INTO sensor_readings (ts, location, temperature, valid_time) VALUES ('2026-04-01T10:00:00Z', 'warehouse-a', 22.3, '2026-04-02T15:30:00Z'); -- What we knew on April 1st (before correction) SELECT location, temperature FROM sensor_readings WHERE ts BETWEEN '2026-04-01' AND '2026-04-02' AS OF VALID TIME 1711953600000; -- What we know now (after correction) SELECT location, temperature FROM sensor_readings WHERE ts BETWEEN '2026-04-01' AND '2026-04-02' AS OF VALID TIME 1712040000000; ``` ### Bitemporal Array Snapshot ```sql CREATE ARRAY climate_grid DIMS (lon INT32 DOMAIN [-180, 180), lat INT32 DOMAIN [-90, 90)) ATTRS (temp_c FLOAT32) TILE_EXTENTS (64, 64) WITH (audit_retain_ms = 7776000000); -- 90 days -- Cells as committed yesterday SELECT lon, lat, temp_c FROM NDARRAY_SLICE( 'climate_grid', {lon: [-10, 10), lat: [0, 20)}, ['temp_c'] ) AS OF SYSTEM TIME (extract(epoch from now()) * 1000 - 86400000); ``` ### Lineage and Compliance ```sql SELECT status, system_time FROM transactions WHERE id = 'txn-1' AS OF SYSTEM TIME NULL -- special: returns all versions in system-time order ORDER BY system_time ASC; ``` ## Index Engines and Temporal Composition Vector, FTS, KV, and Spatial don't carry temporal columns — by design. They are **index engines** that point at records living in data-bearing engines. Indexes don't have time; the records they reference do. To query any of these "as of", attach the index to a collection with `bitemporal=true`. The collection holds the payload + temporal columns, the index returns candidate IDs, and `AS OF` filters at the collection layer. ### Bitemporal Vector Search ```sql CREATE COLLECTION product_embeddings ( id UUID DEFAULT gen_uuid_v7(), product_id UUID, description TEXT, embedding FLOAT[384], updated_at TIMESTAMP DEFAULT now() ) WITH (engine='document_strict', bitemporal=true); CREATE VECTOR INDEX idx_product_vec ON product_embeddings METRIC cosine DIM 384; -- Nearest neighbors as of 30 days ago SELECT p.product_id, p.description, vector_distance(p.embedding, $query_vec) AS score FROM product_embeddings p AS OF SYSTEM TIME (extract(epoch from now()) * 1000 - 2592000000) WHERE p.id IN ( SEARCH product_embeddings USING VECTOR(embedding, $query_vec, 20) ) ORDER BY score LIMIT 10; ``` The vector index narrows the candidate set; the `AS OF SYSTEM TIME` clause on the collection removes candidates that didn't exist at that system time. ### Bitemporal Full-Text Search ```sql CREATE COLLECTION articles ( id UUID DEFAULT gen_uuid_v7(), title VARCHAR, body TEXT, published_at TIMESTAMP ) WITH (engine='document_strict', bitemporal=true); CREATE SEARCH INDEX ON articles FIELDS title, body ANALYZER 'english'; SELECT id, title FROM articles AS OF SYSTEM TIME (extract(epoch from now()) * 1000 - 86400000) WHERE text_match(body, 'distributed consensus raft') ORDER BY bm25_score(body, 'distributed consensus raft') DESC LIMIT 20; ``` ### Bitemporal Spatial Queries ```sql CREATE COLLECTION store_locations ( id UUID DEFAULT gen_uuid_v7(), name VARCHAR, location GEOMETRY, opened_at TIMESTAMP, closed_at TIMESTAMP ) WITH (engine='document_strict', bitemporal=true); CREATE SPATIAL INDEX ON store_locations FIELDS location; SELECT id, name, ST_Distance(location, ST_Point(-73.990, 40.750)) AS dist_m FROM store_locations AS OF SYSTEM TIME 1704067200000 WHERE ST_DWithin(location, ST_Point(-73.990, 40.750), 5000) ORDER BY dist_m; ``` ### Temporal Key-Value: Use `document_strict` The KV engine targets O(1) point lookups and doesn't carry temporal columns. For versioned config, auditable feature flags, or time-stamped session state, use a `document_strict` collection with a unique index — O(1) lookup is preserved, temporal history comes from the document layer: ```sql CREATE COLLECTION config_entries ( id UUID DEFAULT gen_uuid_v7(), key VARCHAR, value TEXT, updated_at TIMESTAMP DEFAULT now() ) WITH (engine='document_strict', bitemporal=true); CREATE UNIQUE INDEX ON config_entries(key); INSERT INTO config_entries (key, value) VALUES ('feature_x_enabled', 'true') ON CONFLICT (key) DO UPDATE SET value = EXCLUDED.value; SELECT value FROM config_entries WHERE key = 'feature_x_enabled'; -- What was the value 7 days ago? SELECT value FROM config_entries AS OF SYSTEM TIME (extract(epoch from now()) * 1000 - 604800000) WHERE key = 'feature_x_enabled'; ``` ## GDPR and Data Minimization `audit_retain_ms` enforces automatic purge of old versions during compaction: ```sql CREATE COLLECTION user_activity ( user_id UUID, action VARCHAR, ts TIMESTAMP TIME_KEY ) WITH (engine='columnar', audit_retain_ms=2592000000); -- 30 days ``` Tiles or row-versions older than the window are irreversibly purged. Historical queries beyond the retention window return no rows for that range. ## Performance Notes - **System time queries** read from historical snapshots — performance depends on snapshot availability for the requested time - **Valid time queries** scan all versions and filter — slower than single-version reads - **Both clauses** intersect — slower still, but produce precise audit trails For large collections with heavy correction traffic: archive old versions to L2 (S3) cold storage periodically, reduce `audit_retain_ms` once the compliance window expires, and rely on columnar compression to keep storage overhead low. --- # SQL Overview (docs/sql/overview) --- title: SQL Overview description: NodeDB uses SQL as its primary query language, extended with engine-specific syntax for vectors, graphs, spatial, and more. --- # SQL Overview NodeDB uses SQL as its primary query language. Whether you connect via `ndb`, `psql`, or HTTP — the same SQL works everywhere. ## Query Execution ``` ndb CLI (NDB protocol) ──┐ psql (pgwire) ──┼──► SQL Parser ──► EngineRules ──► PhysicalPlan ──► Data Plane HTTP (REST/JSON) ──┘ ``` Three doors, one room. Same parser, same optimizer, same execution engine. ## Standard SQL Support NodeDB supports standard SQL with high PostgreSQL compatibility: - SELECT with WHERE, GROUP BY, HAVING, ORDER BY, LIMIT, OFFSET - JOINs (INNER, LEFT, RIGHT, CROSS, SEMI, ANTI) - Window functions (ROW_NUMBER, RANK, LAG, LEAD, SUM OVER) - CTEs (WITH, WITH RECURSIVE) - Subqueries and derived tables - Set operations (UNION, INTERSECT, EXCEPT) - Transactions with savepoints (BEGIN, COMMIT, ROLLBACK, SAVEPOINT) - Prepared statements ## Engine-Specific Extensions Beyond standard SQL, NodeDB extends the language for each engine: - **Vector** — `SEARCH ... USING VECTOR()`, `vector_distance()` - **Graph** — `GRAPH TRAVERSE`, `GRAPH ALGO`, `MATCH` patterns - **Full-Text** — `text_match()`, `bm25_score()`, `search_score()` - **Spatial** — `ST_DWithin()`, `ST_Contains()`, `ST_Distance()`, etc. - **Timeseries** — `time_bucket()`, `ts_rate()`, continuous aggregates - **KV** — `KV_INCR()`, `KV_CAS()`, sorted indexes, rate gates - **CRDT** — `crdt_state()`, `crdt_apply()` ## PostgreSQL Compatibility These standard Postgres surface elements work natively over pgwire — no rewriting, no extension. ### Identifiers vs. Literals `"quoted"` is a case-preserved identifier; `'quoted'` is a string literal. `SELECT "userId"` projects the column named `userId` exactly; `SELECT 'userId'` returns the constant string. ### Pattern Matching ```sql SELECT * FROM users WHERE name LIKE 'Ali%'; -- case-sensitive SELECT * FROM users WHERE email ILIKE '%@EXAMPLE.COM'; -- case-insensitive ``` ### JSON Operators ```sql doc -> 'addr' -- field as JSON doc ->> 'name' -- field as text doc #> '{addr,city}' -- nested path as JSON doc #>> '{addr,city}' -- nested path as text doc @> '{"role":"admin"}' -- left contains right doc <@ '{"role":"admin"}' -- left contained in right doc ? 'email' -- top-level key exists ``` ### Full-Text Search (Postgres syntax) ```sql SELECT * FROM articles WHERE to_tsvector('english', body) @@ to_tsquery('english', 'transformer & attention'); ``` NodeDB also exposes `text_match()`, `bm25_score()`, and `search_score()` for finer control — see [Full-Text Search](fulltext-search). ### Vector Distance Operators | Operator | Function | Metric | | -------- | ------------------------------ | --------------------- | | `<->` | `vector_distance` | L2 (default) | | `<=>` | `vector_cosine_distance` | Cosine | | `<#>` | `vector_neg_inner_product` | Negative inner product | ```sql SELECT id FROM articles ORDER BY embedding <=> $query LIMIT 10; ``` ## Isolation Level Snapshot Isolation (SI). Reads see a consistent snapshot from `BEGIN` time. Write conflicts detected at `COMMIT`. --- # DDL (docs/sql/ddl) --- title: DDL description: CREATE, ALTER, and DROP statements for collections, indexes, triggers, functions, and more. --- # DDL (Data Definition Language) ## Collections ```sql -- Schemaless document (default) CREATE COLLECTION users; -- Strict document CREATE COLLECTION orders ( id TEXT PRIMARY KEY, customer_id TEXT, total FLOAT, status TEXT, created_at TIMESTAMP ) WITH (engine='document_strict'); -- Key-Value CREATE COLLECTION sessions (key TEXT PRIMARY KEY) WITH (engine='kv'); -- Columnar (plain analytics) CREATE COLLECTION logs (ts TIMESTAMP TIME_KEY, host VARCHAR, level VARCHAR, message VARCHAR) WITH (engine='columnar'); -- Timeseries (peer engine, append-only with retention + continuous aggregates) CREATE COLLECTION metrics (ts TIMESTAMP TIME_KEY, host VARCHAR, cpu FLOAT) WITH (engine='timeseries', partition_by='1h', retention='90d'); -- Spatial (peer engine, R-tree + geohash + OGC predicates) CREATE COLLECTION locations (geom GEOMETRY SPATIAL_INDEX, name VARCHAR) WITH (engine='spatial'); -- Convenience alias for timeseries CREATE TIMESERIES metrics; DROP COLLECTION users; -- soft-delete: tombstoned, still restorable UNDROP COLLECTION users; -- restore while within the retention window DROP COLLECTION users PURGE; -- admin-only: immediate hard-delete, not restorable SHOW COLLECTIONS; DESCRIBE users; ``` `DROP COLLECTION` is two-phase. A plain drop marks the collection as tombstoned and enqueues background L2 cleanup after a configurable retention window; within that window, `UNDROP COLLECTION` restores the collection and its data. Admins can bypass the window with `DROP ... PURGE`. ```sql -- Retention window knobs ALTER SYSTEM SET collection_gc_retention = '7d'; ALTER TENANT acme SET collection_gc_retention = '24h'; -- Inspect the lifecycle SELECT name, dropped_at, size_bytes_estimate FROM _system.dropped_collections; SELECT * FROM _system.l2_cleanup_queue; ``` Dropping a collection also cascades downstream catalog state: dependent materialized views are unregistered, change-stream consumer groups are torn down, and the L2 cleanup worker reclaims segment files. ## Constraint Policy NodeDB accepts a deliberately small subset of SQL constraints in `CREATE COLLECTION` / `CREATE TABLE`. Anything outside that subset is rejected with SQLSTATE `0A000` (`feature_not_supported`) — the parser tells you the canonical replacement instead of silently stripping the constraint. **Accepted:** inline ` PRIMARY KEY` on a single column. The named column becomes the row identity; without one, NodeDB auto-injects `_rowid INT64 PRIMARY KEY`. **Rejected at DDL time:** | Form | Replacement | | ------------------------------------------ | ---------------------------------------------------------- | | Table-level `PRIMARY KEY (col)` | Inline form: ` PRIMARY KEY` | | Named `CONSTRAINT pk_x PRIMARY KEY (col)` | Inline form | | Inline or table-level `UNIQUE` | `CREATE UNIQUE INDEX ON tbl(col)` | | Inline or table-level `CHECK (...)` | Application-side validation (typeguards on schemaless docs) | | `FOREIGN KEY (...) REFERENCES ...` | Application-side enforcement | | Column-level `REFERENCES other(col)` | Application-side enforcement | These are launch-frozen choices. UNIQUE/CHECK/FK/REFERENCES would each pull in a fixed semantic that's expensive to walk back, and the canonical replacements (unique indexes, app-side validation) cover the same use cases without locking the engine into a particular cross-row enforcement strategy. ## Schema Evolution ```sql ALTER TABLE orders ADD COLUMN priority INT; ALTER COLLECTION orders ADD COLUMN region STRING DEFAULT 'us-east'; ALTER COLLECTION orders DROP COLUMN region; ``` ## Storage Conversion ```sql CONVERT COLLECTION cache TO kv; CONVERT COLLECTION users TO document_strict; CONVERT COLLECTION logs TO document_schemaless; ``` ## Indexes ```sql CREATE INDEX idx_email ON users(email); CREATE UNIQUE INDEX ON users(username); CREATE VECTOR INDEX ON articles METRIC cosine DIM 384 M 16 EF_CONSTRUCTION 200; CREATE SEARCH INDEX ON articles FIELDS title, body ANALYZER 'english' FUZZY true; CREATE SPATIAL INDEX ON locations(geom); CREATE SORTED INDEX lb ON scores (score DESC) KEY player_id; DROP INDEX idx_email; ``` ## Triggers ```sql -- ASYNC (default): Event Plane, zero write-latency impact CREATE TRIGGER notify AFTER INSERT ON orders FOR EACH ROW $$ BEGIN INSERT INTO notifications (user_id, message) VALUES (NEW.customer_id, 'Order placed'); END; $$; -- SYNC: same transaction, ACID CREATE TRIGGER enforce AFTER UPDATE ON accounts FOR EACH ROW WITH (EXECUTION = SYNC) $$ BEGIN IF NEW.balance < 0 THEN RAISE EXCEPTION 'Negative balance'; END IF; END; $$; DROP TRIGGER notify ON orders; SHOW TRIGGERS; ``` The op tag (INSERT vs UPDATE) is derived from storage prior-bytes, not the surface SQL verb. An `UPSERT` or `INSERT ... ON CONFLICT (pk) DO UPDATE` that overwrites an existing row fires `AFTER UPDATE`; the same statement against a non-existent key fires `AFTER INSERT`. `ON CONFLICT DO NOTHING` emits no event on conflict. ## Functions & Procedures ```sql -- SQL expression (inlined, zero overhead) CREATE FUNCTION full_name(first VARCHAR, last VARCHAR) RETURNS VARCHAR LANGUAGE SQL IMMUTABLE AS $$ first || ' ' || last $$; -- Procedural CREATE PROCEDURE transfer_funds(from_id UUID, to_id UUID, amount DECIMAL) BEGIN UPDATE accounts SET balance = balance - amount WHERE id = from_id; UPDATE accounts SET balance = balance + amount WHERE id = to_id; END; CALL transfer_funds('acc_a', 'acc_b', 50.00); ``` ## Change Streams & Topics ```sql CREATE CHANGE STREAM order_events ON orders WITH (URL = 'https://hooks.example.com/orders'); CREATE CONSUMER GROUP processors ON order_events; CREATE TOPIC alerts WITH (RETENTION = '1 hour'); PUBLISH TO alerts 'message'; ``` ## Materialized Views ```sql CREATE MATERIALIZED VIEW order_stats AS SELECT status, COUNT(*), SUM(total) FROM orders GROUP BY status; REFRESH MATERIALIZED VIEW order_stats; CREATE CONTINUOUS AGGREGATE cpu_hourly ON cpu_metrics AS SELECT time_bucket('1 hour', ts) AS hour, host, AVG(cpu) FROM cpu_metrics GROUP BY hour, host WITH (refresh_interval = '1m'); ``` ## Cron Scheduler ```sql CREATE SCHEDULE nightly_cleanup CRON '0 2 * * *' AS BEGIN DELETE FROM sessions WHERE expires_at < now(); END; ``` ## Backup & Restore ```sql BACKUP TENANT acme TO '/backups/acme.bak'; RESTORE TENANT acme FROM '/backups/acme.bak'; RESTORE TENANT acme FROM '/backups/acme.bak' DRY RUN; PURGE TENANT acme CONFIRM; ``` --- # SELECT (docs/sql/select) --- title: SELECT description: Query data with filtering, aggregation, window functions, CTEs, and subqueries. --- # SELECT ```sql SELECT [DISTINCT] FROM [WHERE ] [GROUP BY ] [HAVING ] [ORDER BY [ASC|DESC], ...] [LIMIT ] [OFFSET ] ``` ## Filtering ```sql SELECT * FROM users WHERE age > 30 AND status = 'active'; SELECT * FROM users WHERE name LIKE 'Ali%'; SELECT * FROM users WHERE email ILIKE '%@EXAMPLE.COM'; SELECT * FROM orders WHERE total BETWEEN 10 AND 100; SELECT * FROM users WHERE role IN ('admin', 'editor'); SELECT * FROM users WHERE deleted_at IS NULL; ``` ## Aggregates ```sql SELECT status, COUNT(*), AVG(age), MIN(salary), MAX(salary) FROM employees WHERE department = 'sales' GROUP BY status HAVING COUNT(*) > 5; SELECT COUNT(DISTINCT user_id) FROM orders; ``` ## Window Functions ```sql SELECT id, ROW_NUMBER() OVER (ORDER BY created_at) AS rn, RANK() OVER (PARTITION BY dept ORDER BY salary DESC) AS rank, LAG(salary, 1) OVER (ORDER BY created_at) AS prev_salary, SUM(amount) OVER (ORDER BY created_at ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total FROM employees; ``` ## CTEs & Subqueries ```sql WITH active AS (SELECT id FROM users WHERE status = 'active') SELECT * FROM orders WHERE user_id IN (SELECT id FROM active); -- Recursive CTE WITH RECURSIVE subordinates AS ( SELECT id, name, manager_id FROM employees WHERE id = 'emp_root' UNION ALL SELECT e.id, e.name, e.manager_id FROM employees e JOIN subordinates s ON e.manager_id = s.id ) SELECT * FROM subordinates; ``` ## Set Operations ```sql SELECT id FROM collection_a UNION ALL SELECT id FROM collection_b; SELECT id FROM collection_a INTERSECT SELECT id FROM collection_b; SELECT id FROM collection_a EXCEPT SELECT id FROM collection_b; ``` ## Computed Columns ```sql SELECT price * qty AS total, UPPER(name) AS name_upper, CASE WHEN price > 100 THEN 'expensive' ELSE 'cheap' END AS tier FROM orders; ``` --- # INSERT & UPSERT (docs/sql/insert-upsert) --- title: INSERT & UPSERT description: Insert documents, upsert with merge semantics, and bulk import. --- # INSERT & UPSERT ## INSERT ```sql -- Single row INSERT INTO users (id, name, email) VALUES ('u1', 'Alice', 'alice@example.com'); -- Multiple rows INSERT INTO users (id, name) VALUES ('u1', 'Alice'), ('u2', 'Bob'); -- Object literal syntax (schemaless collections) INSERT INTO users { name: 'Alice', email: 'alice@example.com', age: 30 }; -- INSERT ... SELECT INSERT INTO archive SELECT * FROM orders WHERE created_at < '2025-01-01'; ``` Plain `INSERT` is strict: a duplicate primary key raises `unique_violation` (SQLSTATE `23505`). Use `ON CONFLICT` or `UPSERT` for "insert if absent" or "insert or overwrite" semantics. ## INSERT ... ON CONFLICT ```sql -- Skip rows that would collide with an existing PK (no error) INSERT INTO users (id, name) VALUES ('u1', 'Alice') ON CONFLICT DO NOTHING; -- Overwrite selected fields on conflict. EXCLUDED refers to the incoming row; -- bare column names refer to the existing row. INSERT INTO users (id, name, login_count) VALUES ('u1', 'Alice', 1) ON CONFLICT (id) DO UPDATE SET name = EXCLUDED.name, login_count = users.login_count + EXCLUDED.login_count; ``` `ON CONFLICT DO UPDATE` fires `AFTER UPDATE` triggers (not `AFTER INSERT`) when a row is overwritten — the trigger op tag is derived from storage prior-bytes, not the surface SQL verb. ## UPSERT Insert or overwrite if the primary key already exists: ```sql UPSERT INTO users (id, name, role) VALUES ('u1', 'Alice', 'admin'); -- Object literal syntax UPSERT INTO users { id: 'u1', name: 'Alice', role: 'admin' }; ``` `UPSERT` is equivalent to `INSERT ... ON CONFLICT () DO UPDATE SET = EXCLUDED.`. It fires `AFTER UPDATE` on overwrite and `AFTER INSERT` on first write. ## Bulk Import ```sql -- Auto-detected format (NDJSON, JSON array, or CSV) COPY users FROM '/path/to/users.ndjson'; COPY users FROM '/path/to/users.csv' WITH (FORMAT csv); ``` --- # UPDATE & DELETE (docs/sql/update-delete) --- title: UPDATE & DELETE description: Point and bulk mutations, transactions, and limitations. --- # UPDATE & DELETE ## UPDATE ```sql UPDATE users SET role = 'admin' WHERE id = 'u1'; UPDATE users SET status = 'inactive' WHERE last_login < '2025-01-01'; ``` ## DELETE ```sql DELETE FROM users WHERE id = 'u1'; DELETE FROM orders WHERE status = 'cancelled'; TRUNCATE users; ``` ## Transactions ```sql BEGIN; INSERT INTO orders (id, total) VALUES ('o1', 99.99); UPDATE inventory SET stock = stock - 1 WHERE id = 'item1'; COMMIT; -- Rollback BEGIN; DELETE FROM users WHERE id = 'u1'; ROLLBACK; -- Savepoints BEGIN; SAVEPOINT sp1; INSERT INTO users (id, name) VALUES ('u1', 'Alice'); ROLLBACK TO sp1; COMMIT; ``` ## Atomic Transfers ```sql -- Fungible (currency, resources) SELECT TRANSFER('wallets', 'player-A', 'player-B', 'gold', 500); -- Non-fungible (unique items) SELECT TRANSFER_ITEM('inventory', 'inventory', 'sword-of-doom', 'player-A', 'player-B'); ``` ## Limitations - `UPDATE/DELETE ... JOIN` is not supported. Rewrite as subquery: `DELETE FROM orders WHERE user_id IN (SELECT id FROM users WHERE ...)` - `UPDATE/DELETE` on timeseries collections is not supported (append-only). Use retention policies. --- # JOINs & Aggregations (docs/sql/joins-aggregations) --- title: JOINs & Aggregations description: Inner, outer, semi, and anti joins. Aggregate functions and GROUP BY. --- # JOINs & Aggregations ## JOINs ```sql -- Inner join SELECT u.name, o.total FROM users u JOIN orders o ON u.id = o.user_id; -- Left join SELECT u.name, o.total FROM users u LEFT JOIN orders o ON u.id = o.user_id; -- Cross join SELECT * FROM sizes CROSS JOIN colors; -- Semi join (EXISTS) SELECT * FROM users u WHERE EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.id); -- Anti join (NOT EXISTS) SELECT * FROM users u WHERE NOT EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.id); ``` JOINs work across collections of different storage types — you can join a KV collection with a strict document collection. ## Aggregate Functions | Function | Description | | ----------------------- | ----------------------- | | `COUNT(*)` | Row count | | `COUNT(DISTINCT col)` | Distinct count | | `SUM(col)` | Sum | | `AVG(col)` | Average | | `MIN(col)` / `MAX(col)` | Minimum / Maximum | ## Approximate Aggregates Mergeable across shards, usable in continuous aggregation: | Function | Description | | ------------------------------- | ------------------ | | `approx_count_distinct(col)` | HyperLogLog | | `approx_percentile(col, p)` | t-digest | | `approx_topk(col, k)` | SpaceSaving | ## Weighted Random Selection ```sql SELECT * FROM WEIGHTED_PICK('loot_table', weight => 'drop_rate', count => 1); SELECT * FROM WEIGHTED_PICK('gacha_pool', weight => 'probability', count => 10, SEED => 'player:pull'); ``` --- # Functions Reference (docs/sql/functions) --- title: Functions Reference description: Built-in scalar, aggregate, and engine-specific SQL functions. --- # Functions Reference ## String Functions `LENGTH`, `SUBSTR`, `UPPER`, `LOWER`, `TRIM`, `LTRIM`, `RTRIM`, `CONCAT` (or `||`), `REPLACE`, `SPLIT` ## Numeric Functions `ABS`, `CEIL`, `FLOOR`, `ROUND`, `SQRT`, `POWER`, `MOD` (or `%`), `GREATEST`, `LEAST` ## Date/Time Functions | Function | Description | | ------------------------------- | -------------------------------------- | | `NOW()` | Current UTC timestamp | | `CURRENT_TIMESTAMP()` | Alias for NOW() | | `EXTRACT(field FROM ts)` | Extract year, month, day, hour, etc. | | `DATE_TRUNC(unit, ts)` | Truncate to unit boundary | | `DATE_FORMAT(ts, fmt)` | Format timestamp as string | | `time_bucket(interval, ts)` | Truncate to interval boundary | `time_bucket` accepts: `'1s'`, `'5m'`, `'15m'`, `'1h'`, `'6h'`, `'1d'`, `'1w'`, or ISO 8601 (`'PT5M'`, `'P1D'`). ## Type Functions `CAST(expr AS type)`, `TRY_CAST(expr AS type)`, `expr::type` ## ID Generation `gen_uuid_v7()` — time-ordered UUID v7 ## KV Atomic Functions | Function | Description | | -------------------------------------- | ------------------------------ | | `KV_INCR(coll, key, delta [, TTL])` | Atomic increment | | `KV_DECR(coll, key, delta)` | Atomic decrement | | `KV_INCR_FLOAT(coll, key, delta)` | Float increment | | `KV_CAS(coll, key, expected, new)` | Compare-and-swap | | `KV_GETSET(coll, key, new)` | Atomic get-and-set | ## Leaderboard Functions | Function | Description | | --------------------------- | ------------------------ | | `RANK(index, key)` | Player's rank | | `TOPK(index, k)` | Top-k entries (TVF) | | `RANGE(index, min, max)` | Score range query (TVF) | | `SORTED_COUNT(index)` | Total entries | ## Rate Limiting Functions | Function | Description | | ---------------------------------------------- | -------------------------- | | `RATE_CHECK(gate, key, max_count, window_secs)` | Check and consume | | `RATE_REMAINING(gate, key, max_count, window_secs)` | Check remaining budget | | `RATE_RESET(gate, key)` | Reset a cooldown | ## Transfer Functions | Function | Description | | ------------------------------------------------------ | -------------------- | | `TRANSFER(coll, source, dest, field, amount)` | Fungible transfer | | `TRANSFER_ITEM(src_coll, dst_coll, item, src, dst)` | Non-fungible transfer| ## Timeseries Functions `ts_rate`, `ts_delta`, `ts_moving_avg`, `ts_ema`, `ts_interpolate`, `ts_percentile`, `ts_correlate`, `ts_lag`, `ts_lead`, `ts_rank`, `ts_stddev`, `ts_derivative`, `ts_zscore`, `ts_bollinger_upper/lower/mid/width`, `ts_moving_percentile` ## Vector Functions `vector_distance(col, vec)`, `rrf_score(vector_distance, bm25_score [, k1, k2])` ## Full-Text Functions `text_match(col, query)`, `bm25_score(col, query)`, `search_score(col, query)` ## Document Navigation `doc_get(payload, path)`, `doc_exists(payload, path)`, `doc_array_contains(payload, path, value)` ## CRDT Functions `crdt_state(coll, id)`, `crdt_apply(coll, id, delta)` --- # Vector Search Queries (docs/sql/vector-search) --- title: Vector Search Queries description: Nearest neighbor search, filtered vector search, and multi-vector queries. --- # Vector Search Queries ## Nearest Neighbor ```sql SEARCH articles USING VECTOR(embedding, ARRAY[0.1, 0.3, -0.2, ...], 10); ``` Returns the 10 nearest neighbors by the metric configured on the index (l2, cosine, inner_product, manhattan, chebyshev, hamming, jaccard, or pearson). ## Filtered Vector Search ```sql SELECT title, vector_distance(embedding, ARRAY[0.1, 0.3, ...]) AS score FROM articles WHERE category = 'machine-learning' AND id IN (SEARCH articles USING VECTOR(embedding, ARRAY[0.1, 0.3, ...], 10)); ``` The engine builds a Roaring Bitmap of matching IDs and selects the optimal strategy: pre-filter (selective filters), post-filter (broad filters), or brute-force (very selective). ## Distance Function ```sql SELECT id, vector_distance(embedding, $query_vec) AS dist FROM articles ORDER BY dist LIMIT 10; ``` ### Operator Forms | Operator | Function | Metric | | -------- | ------------------------------ | ----------------------- | | `<->` | `vector_distance` | L2 | | `<=>` | `vector_cosine_distance` | Cosine | | `<#>` | `vector_neg_inner_product` | Negative inner product | ```sql SELECT id FROM articles ORDER BY embedding <=> $query_vec LIMIT 10; ``` ### ANN Tuning (Named Arguments) `vector_distance` and its cosine / inner-product peers accept named tuning arguments via `=>`. JSON-string options are not accepted — every option is a typed, closed-set named argument: ```sql SELECT id, vector_distance( embedding, $query_vec, quantization => 'rabitq', oversample => 4, ef_search => 128, target_recall => 0.95 ) AS dist FROM articles ORDER BY dist LIMIT 10; ``` | Argument | Type | Notes | | ------------------- | ------- | ------------------------------------------------------------------------------------- | | `quantization` | string | `none`, `sq8`, `pq`, `binary`, `ternary`, `rabitq`, `bbq`, `opq` | | `oversample` | u8 | Candidates fetched before re-ranking. Default `3`. Final rerank set is `oversample × ef_search`. | | `query_dim` | u32 | Coarse-to-fine on first-N dims of Matryoshka embeddings. `None` = full dimensionality. | | `meta_token_budget` | u8 | MetaEmbed multivec scoring budget for MaxSim / PLAID | | `ef_search` | u32 | HNSW / Vamana search-time beam width. Default `64`. | | `target_recall` | f32 | Adaptive recall target. The cost-model planner picks oversample / ef_search to hit this. | Unknown names, duplicate keys, positional 3rd args, or wrong operators (`=` instead of `=>`) all return typed errors that list the canonical names. ### Quantization Choice | Codec | Bits/dim | Recall (typ.) | When to pick it | | --------- | -------- | ------------- | -------------------------------------------------------------- | | `none` | 32 | 100% | Small index (< 1M vectors), latency not critical | | `sq8` | 8 | ~99% | Balanced default for medium index sizes | | `pq` | ~2 | ~95% | Large memory-bound indexes; classic Product Quantization | | `opq` | ~2 | ~96% | PQ + learned random rotation; minor accuracy bump over `pq` | | `rabitq` | 1 | ~97% | Frontier 1-bit with `O(1/√D)` error bound (SIGMOD 2024) | | `bbq` | 1 | ~98% | Centroid-asymmetric 1-bit + 14-byte corrective; oversample-rerank | | `binary` | 1 | ~85% | Hamming-only, no rerank — for ultra-cold tiers | | `ternary` | 1.58 | ~96% | BitNet-style `{-1, 0, +1}` — cold/hot pack for AVX-512 | The cost-model planner (`target_recall`) will pick `oversample` and `ef_search` automatically once you set the recall target. Manually set those two only when you need a hard latency ceiling. ## Index Implementations You don't pick the underlying index directly — it's chosen by the planner from collection metadata + workload signals: - **HNSW** — in-memory hierarchical graph, the default for moderate-size indexes - **Vamana / DiskANN** — flat-beam SSD-resident graph for billion-scale on a single node (`Tier 2` of the vector frontier) - **NaviX adaptive-local filtered traversal** (VLDB 2025) — switches per-hop between standard / directed / blind heuristics based on local selectivity. Replaces classic ACORN-1 filtered ANN. - **SIEVE workload-driven subindex collections** — the planner builds specialized HNSW subindices for stable predicates (e.g. `tenant_id`) and routes filtered queries to them. - **MetaEmbed multi-vector + ColBERT MaxSim + PLAID** (ICLR 2026) — learnable Meta Tokens replace per-token explosion; budgeted MaxSim at query time via `meta_token_budget`. - **Matryoshka adaptive-dim querying** — coarse-to-fine ranking on the first-N dimensions of MRL embeddings via `query_dim`. - **SPFresh streaming updates** (SOSP 2023) — LIRE topology-aware local rebalancing; no full-rebuild stalls when vectors are added/removed. ## Vector-Primary Collections By default, vectors are an *index* attached to a column on a normal collection — the document/strict store is the source of truth, and the vector index is a side path. For pure-vector workloads (RAG corpora, recommendation memory, embedding stores) you can flip a collection into **vector-primary** mode, where the vector index becomes the primary access path and the document store is a metadata sidecar: ```sql CREATE COLLECTION corpus ( id UUID DEFAULT gen_uuid_v7(), embedding FLOAT[384], title TEXT, tenant_id UUID, created_at TIMESTAMP DEFAULT now() ) WITH ( primary='vector', vector_field='embedding', dim=384, metric='cosine', quantization='rabitq', m=32, ef_construction=200, payload_indexes=['tenant_id', 'created_at'] ); ``` | Option | Notes | | ----------------- | -------------------------------------------------------------------- | | `primary` | `'document'` (default), `'strict'`, `'kv'`, or `'vector'` | | `vector_field` | Required when `primary='vector'`. Column name of the embedding. | | `dim` | Required. Embedding dimensionality. | | `metric` | `'cosine'`, `'l2'`, `'inner_product'`, etc. | | `quantization` | Storage-level codec (same vocabulary as the query-time arg above). | | `m`, `ef_construction` | HNSW build-time parameters. | | `payload_indexes` | Per-field equality / range / boolean indexes over the metadata sidecar for filtered ANN. Replaces Pinecone metadata filters. | In vector-primary mode the planner treats the vector index as the source of truth for IDs; metadata fetches only happen for hit IDs. Cross-engine queries, CRDT sync, and SQL semantics all keep working — `primary='vector'` is purely an *access-path* hint, not a different engine. Default `primary='document'` is unchanged: the existing `CREATE VECTOR INDEX ON ...` syntax continues to work for vector-as-side-index workloads. ## Hybrid Vector + Text (RRF) ```sql SELECT title, rrf_score( vector_distance(embedding, $query_vec), bm25_score(body, 'transformer attention') ) AS score FROM articles LIMIT 10; ``` Reciprocal Rank Fusion merges BM25 text results with vector similarity in a single pass. ## Multi-Vector Collections can have multiple vector indexes on different embedding columns: ```sql CREATE VECTOR INDEX idx_text ON products METRIC cosine DIM 384; CREATE VECTOR INDEX idx_image ON products METRIC cosine DIM 512; ``` --- # Full-Text Search Queries (docs/sql/fulltext-search) --- title: Full-Text Search Queries description: BM25 text matching, fuzzy search, highlighting, and synonyms in SQL. --- # Full-Text Search Queries ## Basic Search ```sql SELECT title, bm25_score(body, 'distributed database rust') AS score FROM articles WHERE text_match(body, 'distributed database rust') ORDER BY score DESC LIMIT 20; ``` ## Fuzzy Search ```sql SELECT title FROM articles WHERE text_match(title, 'databse', { fuzzy: true, distance: 2 }); ``` ## Synonyms ```sql CREATE SYNONYM GROUP db_terms AS ('database', 'db', 'datastore'); -- Searching for 'db performance' now also matches 'database performance' ``` ## CJK Search CJK text is automatically tokenized via character bigrams: ```sql SELECT title FROM articles WHERE text_match(body, '全文検索'); ``` ## Hybrid Search (BM25 + Vector) ```sql SELECT title, rrf_score( vector_distance(embedding, $query_vec), bm25_score(body, 'distributed systems') ) AS score FROM articles LIMIT 10; ``` --- # Graph Queries (docs/sql/graph-queries) --- title: Graph Queries description: GRAPH commands, MATCH pattern matching, and algorithm execution in SQL. --- # Graph Queries ## Edge Operations ```sql GRAPH INSERT EDGE IN 'edges' FROM 'alice' TO 'bob' TYPE 'knows' PROPERTIES { since: 2020 }; GRAPH DELETE EDGE IN 'edges' FROM 'alice' TO 'bob' TYPE 'knows'; ``` The `IN ''` clause is **required** — edges are overlays on a named document collection. Statements without `IN` fail to parse. ## Traversal ```sql GRAPH TRAVERSE FROM 'alice' DEPTH 3; GRAPH TRAVERSE FROM 'alice' DEPTH 2 LABEL 'follows' DIRECTION out; GRAPH NEIGHBORS OF 'bob' LABEL 'follows' DIRECTION both; GRAPH PATH FROM 'alice' TO 'charlie' MAX_DEPTH 5 LABEL 'knows'; ``` ## MATCH Pattern Queries Cypher-subset pattern matching: ```sql MATCH (a:Person)-[:knows]->(b:Person)-[:knows]->(c:Person) WHERE a.name = 'Alice' RETURN b.name, c.name; -- Variable-length paths MATCH (u:User)-[:follows*2..3]->(rec:User) WHERE u.id = 'you' RETURN DISTINCT rec.id LIMIT 10; -- Anti-join MATCH (a:User)-[:follows]->(b:User) WHERE NOT EXISTS { MATCH (b)-[:blocked_by]->(a) } RETURN a.id, b.id; -- OPTIONAL MATCH MATCH (a:Person)-[:knows]->(b:Person) OPTIONAL MATCH (b)-[:works_at]->(c:Company) RETURN a.name, b.name, c.name; ``` ## Algorithms ```sql GRAPH ALGO PAGERANK ON social DAMPING 0.85 ITERATIONS 20 TOLERANCE 1e-7; GRAPH ALGO WCC ON knowledge_graph; GRAPH ALGO SSSP ON routes FROM 'city:chicago'; GRAPH ALGO COMMUNITY ON products ITERATIONS 10 RESOLUTION 1.0; GRAPH ALGO BETWEENNESS ON network SAMPLE 500; GRAPH ALGO KCORE ON collaboration; GRAPH ALGO TRIANGLES ON social MODE global; GRAPH ALGO DIAMETER ON web; ``` Available: `pagerank`, `wcc`, `label_propagation`, `lcc`, `sssp`, `betweenness`, `closeness`, `harmonic`, `degree`, `louvain`, `triangles`, `diameter`, `kcore`. ## GraphRAG ```sql GRAPH RAG FUSION ON entities QUERY $embedding VECTOR_FIELD 'embedding' VECTOR_TOP_K 50 EXPANSION_DEPTH 2 EDGE_LABEL 'related_to' FINAL_TOP_K 10 RRF_K (60.0, 35.0); ``` --- # Spatial Queries (docs/sql/spatial-queries) --- title: Spatial Queries description: OGC spatial predicates, distance queries, geofencing, and spatial joins. --- # Spatial Queries ## Proximity Search ```sql SELECT name, ST_Distance(location, ST_Point(-73.990, 40.750)) AS dist FROM restaurants WHERE ST_DWithin(location, ST_Point(-73.990, 40.750), 1000) ORDER BY dist; ``` ## Geofencing (Point-in-Polygon) ```sql SELECT name FROM restaurants WHERE ST_Within(location, ST_GeomFromGeoJSON('{ "type": "Polygon", "coordinates": [[[-74.0, 40.7], [-73.9, 40.7], [-73.9, 40.8], [-74.0, 40.8], [-74.0, 40.7]]] }')); ``` ## OGC Predicates | Function | Description | | ----------------- | ----------------------------------- | | `ST_Contains` | Geometry A contains geometry B | | `ST_Intersects` | Geometries share any space | | `ST_Within` | Geometry A is within geometry B | | `ST_DWithin` | Within a given distance | | `ST_Distance` | Distance between two geometries | | `ST_Intersection` | Intersection geometry | | `ST_Buffer` | Buffer around geometry | | `ST_Envelope` | Bounding box | | `ST_Union` | Union of geometries | ## Spatial Join ```sql SELECT r.name, z.zone_name FROM restaurants r, delivery_zones z WHERE ST_Contains(z.boundary, r.location); ``` ## H3 Hexagonal Indexing ```sql SELECT h3_to_string(h3_encode(40.748, -73.985, 9)) AS hex; ``` ## Hybrid Spatial-Vector ```sql SELECT name, vector_distance(embedding, $query_vec) AS similarity FROM restaurants WHERE ST_DWithin(location, ST_Point(-73.990, 40.750), 2000) AND embedding <-> $query_vec LIMIT 10; ``` --- # CRDT Operations (docs/sql/crdt-operations) --- title: CRDT Operations description: Read CRDT state and apply deltas via SQL. --- # CRDT Operations ## Read State ```sql SELECT crdt_state('collab_docs', 'doc123'); ``` Returns the current merged state of the CRDT document. ## Apply Delta ```sql SELECT crdt_apply('collab_docs', 'doc123', ''); ``` Applies a Loro CRDT delta to the document. On Origin, the delta goes through Raft consensus. On Lite, it is applied locally and queued for sync. ## Conflict Policies ```sql -- View current policy SHOW CONFLICT POLICY ON notes; -- Set policy ALTER COLLECTION notes SET conflict_policy = 'lww'; ALTER COLLECTION profiles SET conflict_policy = 'field_merge'; ``` See [CRDT Sync](/docs/crdt-sync/overview) for full sync protocol details. --- # Cross-Engine Fusion (RRF) (docs/sql/fusion-rrf) --- title: Cross-Engine Fusion (RRF) description: Reciprocal Rank Fusion for combining results from multiple engines in one query. --- # Cross-Engine Fusion (RRF) Reciprocal Rank Fusion merges ranked results from different engines — vector similarity, BM25 text ranking, graph hop distance — into a single result set. No application-level merging needed. ## How RRF Works For each result, compute: `score = Σ 1 / (k + rank_i)` where `k` is a constant (default 60) and `rank_i` is the result's rank from each source. ## Vector + Full-Text ```sql SELECT title, rrf_score( vector_distance(embedding, $query_vec), bm25_score(body, 'distributed database') ) AS score FROM articles LIMIT 10; ``` ## GraphRAG (Vector + Graph) ```sql GRAPH RAG FUSION ON entities QUERY $embedding VECTOR_FIELD 'embedding' VECTOR_TOP_K 50 EXPANSION_DEPTH 2 EDGE_LABEL 'related_to' FINAL_TOP_K 10 RRF_K (60.0, 35.0); ``` Vector search finds seed nodes, graph BFS expands context, RRF merges both rankings. ## Cross-Model Queries All engines share the same snapshot. A query that combines vector similarity, graph traversal, spatial filtering, and document field access sees a consistent point-in-time view. ```sql -- Spatial filter → vector rank → document fields SELECT name, vector_distance(embedding, $vec) AS sim FROM restaurants WHERE ST_DWithin(location, ST_Point(-73.99, 40.75), 2000) AND embedding <-> $vec LIMIT 10; ``` --- # EXPLAIN (docs/sql/explain) --- title: EXPLAIN description: View query execution plans and introspect session state. --- # EXPLAIN ## Query Plan ```sql EXPLAIN SELECT * FROM users WHERE age > 30; ``` Shows the logical and physical query plan. Useful for understanding how the query planner routes to engines and which indexes are used. ## Session Variables ```sql SET nodedb.consistency = 'eventual'; SHOW nodedb.consistency; SHOW ALL; RESET nodedb.consistency; ``` ## Change Tracking ```sql SHOW CHANGES FOR users SINCE '2025-01-01' LIMIT 100; ``` ## Introspection Commands ```sql SHOW COLLECTIONS; DESCRIBE users; SHOW INDEXES; SHOW TRIGGERS; SHOW FUNCTIONS; SHOW PROCEDURES; SHOW CHANGE STREAMS; SHOW TOPICS; SHOW SCHEDULES; SHOW CONTINUOUS AGGREGATES; SHOW MATERIALIZED VIEWS; SHOW CONNECTIONS; SHOW USERS; SHOW CLUSTER; SHOW NODES; SHOW RAFT GROUPS; SHOW AUDIT LOG LIMIT 100; ``` ## Limitations `EXPLAIN ANALYZE` is not yet supported. It requires instrumentation across the SPSC bridge to collect per-core execution stats from the Data Plane. --- # LIVE SELECT (docs/real-time/live-select) --- title: LIVE SELECT description: Register a query and receive matching changes in real time. No polling. --- # LIVE SELECT Register a query and receive matching inserts, updates, and deletes as they happen. ```sql LIVE SELECT * FROM orders WHERE total > 100.00; LIVE SELECT id, status FROM orders WHERE status != 'pending'; ``` ## Protocol Support - **pgwire** — Delivered as `NotificationResponse` messages. Works in `psql` and JDBC. - **WebSocket** — JSON frames on `/ws` - **NDB** — MessagePack frames on native connection ## Cancellation ```sql CANCEL LIVE SELECT ; ``` The subscription remains active until cancelled or the session ends. ## vs Change Streams LIVE SELECT is session-scoped and push-based. For durable, multi-consumer event streaming, use [Change Streams](/docs/real-time/change-streams). --- # Change Streams (CDC) (docs/real-time/change-streams) --- title: Change Streams (CDC) description: Durable, cursor-tracked mutation streams with consumer groups and external delivery. --- # Change Streams (CDC) Change streams provide durable access to the mutation log. Unlike LIVE SELECT, they survive reconnects, support consumer groups, and deliver to external systems. ```sql CREATE CHANGE STREAM order_events ON orders; -- With webhook delivery CREATE CHANGE STREAM order_events ON orders WITH (URL = 'https://hooks.example.com/orders'); -- With log compaction (keep only latest per key) CREATE CHANGE STREAM user_state ON users WITH (COMPACTION = 'key', KEY = 'id'); DROP CHANGE STREAM order_events; -- Drop atomically tears down the stream's consumer groups and persisted offset -- rows. Recreating a stream with the same name starts from the head; it does -- not resume at stale offsets. SHOW CHANGE STREAMS; ``` ## Pull-Based Consumption ```sql SHOW CHANGES FOR orders SINCE '2024-01-15T00:00:00Z' LIMIT 1000; ``` ## External Delivery **Webhook** — HTTP POST with retry, idempotency headers, and HMAC signing. **Kafka bridge** — Transactional exactly-once publishing (feature-gated `--features kafka`): ```sql CREATE CHANGE STREAM order_events ON orders WITH (DELIVERY = 'kafka', BROKERS = 'localhost:9092', TOPIC = 'orders'); ``` **SSE** — `GET /v1/streams/{stream}/events?group={group}` with `Accept: text/event-stream`. **HTTP long-poll** — `GET /v1/streams/{stream}/poll?group={group}&limit=100`. ## Streaming Materialized Views ```sql CREATE MATERIALIZED VIEW order_stats STREAMING AS SELECT time_bucket('5 minutes', event_time) AS bucket, count(*), sum(total) FROM order_changes WHERE event_type = 'INSERT' GROUP BY bucket; ``` O(1) per event. Supports COUNT, SUM, MIN, MAX, AVG. Watermark-driven finalization. --- # Consumer Groups (docs/real-time/consumer-groups) --- title: Consumer Groups description: Independent offset tracking for multiple consumers on the same stream or topic. --- # Consumer Groups Consumer groups track read positions independently, enabling multiple consumers to process the same stream at their own pace. ```sql -- Create consumer groups CREATE CONSUMER GROUP analytics ON order_changes; CREATE CONSUMER GROUP billing ON order_changes; -- Commit offset for a specific partition COMMIT OFFSET PARTITION 0 AT 42 ON order_changes CONSUMER GROUP analytics; -- Batch commit all partitions COMMIT OFFSETS ON order_changes CONSUMER GROUP analytics; DROP CONSUMER GROUP analytics ON order_changes; ``` ## On Topics Consumer groups also work with durable topics: ```sql CREATE CONSUMER GROUP processors ON order_events; SELECT * FROM TOPIC order_events CONSUMER GROUP processors LIMIT 100; COMMIT OFFSETS ON order_events CONSUMER GROUP processors; ``` Consumers resume from their last committed offset after disconnect. --- # Webhooks (docs/real-time/webhooks) --- title: Webhooks description: HTTP POST delivery for change stream events with retry and HMAC signing. --- # Webhooks Change streams can deliver events to external HTTP endpoints: ```sql CREATE CHANGE STREAM order_events ON orders WITH ( DELIVERY = 'webhook', URL = 'https://hooks.example.com/orders', RETRY = 3, TIMEOUT = '5s' ); ``` ## Headers Each POST includes: | Header | Description | | --------------------- | ------------------------------ | | `X-Idempotency-Key` | Unique per delivery attempt | | `X-Event-Sequence` | Monotonic sequence number | | `X-Partition` | Source partition | | `X-LSN` | WAL log sequence number | ## Retry Behavior Exponential backoff. 4xx client errors (except 429) are not retried. 5xx and timeouts retry up to the configured limit. Persistent failures go to the dead-letter queue. --- # Cron Scheduler (docs/real-time/cron-scheduler) --- title: Cron Scheduler description: Distributed cron scheduler for periodic SQL jobs with leader-aware execution. --- # Cron Scheduler The Event Plane includes a distributed cron scheduler. Jobs are evaluated per-second and dispatched through the Control Plane → Data Plane path. ```sql CREATE SCHEDULE nightly_cleanup CRON '0 2 * * *' AS BEGIN DELETE FROM sessions WHERE expires_at < now(); INSERT INTO maintenance_log { task: 'nightly_cleanup', ran_at: now() }; END; CREATE SCHEDULE refresh_stats CRON '*/5 * * * *' AS BEGIN REFRESH CONTINUOUS AGGREGATE order_stats; END; DROP SCHEDULE nightly_cleanup; SHOW SCHEDULES; ``` ## Cron Syntax Standard 5-field cron: `minute hour day-of-month month day-of-week`. | Expression | Meaning | | --------------- | --------------------------- | | `0 2 * * *` | 2:00 AM UTC daily | | `*/5 * * * *` | Every 5 minutes | | `0 0 * * 0` | Midnight Sunday | | `0 */6 * * *` | Every 6 hours | ## Leader-Aware In clustered mode, scheduled jobs run on the collection's shard leader. If a leader changes due to failover, the new leader picks up the schedule. Job history and missed execution policies are tracked in redb. ## LISTEN/NOTIFY PostgreSQL-compatible ephemeral notifications (session-scoped, cluster-wide): ```sql LISTEN order_events; NOTIFY order_events, 'order 123 shipped'; ``` For durable delivery, use [Change Streams](/docs/real-time/change-streams) or [Durable Topics](/docs/sql/ddl). --- # pgwire (PostgreSQL Protocol) (docs/connectivity/pgwire) --- title: pgwire (PostgreSQL Protocol) description: Standard PostgreSQL wire protocol. Any Postgres-compatible tool works with NodeDB. --- # pgwire NodeDB speaks the PostgreSQL wire protocol on port 6432. Any tool that speaks Postgres works. ```bash psql -h localhost -p 6432 ``` ## Supported Features - Simple Query protocol - Extended Query protocol (prepared statements: Parse, Bind, Describe, Execute) - `COPY FROM` (bulk import) - `LISTEN/NOTIFY` (cluster-wide) - SCRAM-SHA-256 authentication - TLS - Session variables - Server-side cursors (SCROLL, BACKWARD, MOVE, WITH HOLD) - Temporary tables (session-scoped) ## Compatible Clients Any PostgreSQL client library: libpq, JDBC, psycopg2, node-postgres, SQLAlchemy, Prisma, Diesel, tokio-postgres, and more. ## GUI Tools DBeaver and pgAdmin connect directly. Configure as a PostgreSQL connection to `localhost:6432`. --- # HTTP API (docs/connectivity/http-api) --- title: HTTP API description: REST API for SQL execution, streaming, health checks, and Prometheus metrics. --- # HTTP API REST API on port 6480 for web clients and services. ## Endpoints ### Execute SQL ```bash curl -X POST http://localhost:6480/v1/query \ -H "Authorization: Bearer ndb_..." \ -H "Content-Type: application/json" \ -H "Accept: application/vnd.nodedb.v1+json" \ -d '{"sql": "SELECT * FROM users LIMIT 10"}' ``` ### Stream Results (NDJSON) ```bash curl -X POST http://localhost:6480/v1/query/stream \ -d '{"sql": "SELECT * FROM large_table"}' ``` ### Health & Readiness ```bash curl http://localhost:6480/healthz # k8s readiness — 503 until startup completes curl http://localhost:6480/health/live # liveness probe curl http://localhost:6480/health/ready # WAL recovered, ready for queries ``` ### Prometheus Metrics ```bash curl http://localhost:6480/metrics ``` 70+ system metrics: per-engine, per-core, connection, query, replication, storage. Latency histogram with 13 buckets. ### PromQL ``` http://localhost:6480/v1/obsv/api ``` Full Prometheus query engine. Point Grafana at this URL as a Prometheus data source. ### CDC Streams ``` GET /v1/streams/{stream}/events?group={group} # SSE GET /v1/streams/{stream}/poll?group={group} # Long-poll ``` ### WebSocket `/v1/ws` endpoint for JSON-RPC: SQL execution, LIVE SELECT delivery, session reconnect. ## Versioning All non-probe routes are under the `/v1/` prefix. JSON responses carry `Content-Type: application/vnd.nodedb.v1+json; charset=utf-8`. Clients may opt into version negotiation by sending `Accept: application/vnd.nodedb.v1+json`; sending only an unsupported `application/vnd.nodedb.vN+json` returns `406 Not Acceptable`. Probe routes (`/healthz`, `/health/*`, `/metrics`) are unversioned and always reachable, including during startup. --- # Native Protocol (NDB) (docs/connectivity/native-protocol) --- title: Native Protocol (NDB) description: Binary MessagePack protocol for the ndb CLI, Rust SDK, and FFI/WASM bindings. --- # Native Protocol (NDB) Binary MessagePack protocol on port 6433. Used by the `ndb` CLI, Rust SDK (`nodedb-client`), and FFI/WASM bindings. ## Two Modes **SQL** — SQL text transported as a MessagePack message. Same parser and planner as pgwire. **Native opcodes** — Typed messages that skip SQL parsing. Used by SDKs for hot-path operations: ```rust // Native mode — typed, skip SQL parsing let user = client.get("users", "u1").await?; client.put("users", "u1", &doc).await?; client.vector_search("articles", &query_vec, 10, None).await?; // SQL mode — flexible, any query let rows = client.sql("SELECT * FROM users WHERE age > 30").await?; ``` Both modes produce the same `PhysicalPlan` and execute identically. ## Connection ```bash # ndb CLI ./target/release/ndb ./target/release/ndb --host localhost --port 6433 ``` ## Handshake Every native connection performs a versioned handshake before any opcode frame. SDKs do this automatically on first use; you only need this section if you're implementing a wire-level client. ### `HelloFrame` (client → server) | Field | Width | Value | | -------------- | ----- | --------------------------------------- | | `magic` | 4 B | `NDBH` | | `proto_min` | u16 | Minimum protocol version client accepts | | `proto_max` | u16 | Maximum protocol version client supports | | `capabilities` | u64 | Bitmask of optional features (send `0` for none; unknown bits ignored by server) | ### `HelloAckFrame` (server → client) | Field | Width | Value | | ---------------- | -------------------- | ----------------------------------------------------- | | `magic` | 4 B | `NDBA` | | `proto_version` | u16 | Negotiated version (`max(proto_min_client, proto_min_server) ≤ v ≤ min(proto_max_client, proto_max_server)`) | | `capabilities` | u64 | Server-side capability bitmask | | `server_version` | length-prefixed UTF-8 | Build identifier (e.g. `"nodedb 0.1.0+abc123"`) | | `limits` | `Limits` struct | Per-op caps the server enforces (see below) | ### `HelloErrorFrame` Returned when no protocol version overlaps. Carries a typed `code` (`VersionMismatch`) and a UTF-8 reason. The connection is closed after the frame is sent. ### Server-enforced `Limits` The `HelloAckFrame` carries the server's per-op caps. SDKs surface these via `client.limits()`. Sending a request that exceeds a cap returns a typed `LimitExceeded { limit_name, value, max }` error. | Field | Type | Caps | | ---------------------- | ------------- | ----------------------------------------------- | | `max_vector_dim` | `Option` | Vector embedding dimensionality | | `max_top_k` | `Option` | `top_k` for any retrieval op | | `max_scan_limit` | `Option` | Result set size for scans | | `max_batch_size` | `Option` | Rows per batch INSERT/UPSERT | | `max_crdt_delta_bytes` | `Option` | Single CRDT delta payload | | `max_query_text_bytes` | `Option` | SQL text length | | `max_graph_depth` | `Option` | `MAX_DEPTH` for graph traversal | `None` means uncapped. Defaults are uncapped — operators set caps via configuration. ## Capabilities `Capabilities` is a typed wrapper around the `u64` bitmask returned in `HelloAckFrame`. SDK consumers query specific features via accessor methods rather than testing raw bits, so feature additions never break clients: ```rust let caps = client.capabilities(); if caps.has_graphrag_fusion() { ... } if caps.has_continuous_aggregates() { ... } ``` The `Capabilities::has(bit)` escape hatch is available for forward compatibility, but typed accessors are preferred. ## Server Identity ```rust client.proto_version() // u16 — negotiated protocol version client.server_version() // String — server build identifier client.limits() // &Limits — per-op caps client.capabilities() // Capabilities — typed feature flags ``` --- # RESP (Redis Protocol) (docs/connectivity/resp) --- title: RESP (Redis Protocol) description: Redis-compatible wire protocol for KV operations. Disabled by default. --- # RESP (Redis Protocol) NodeDB speaks RESP2 for KV operations. Existing Redis clients work out of the box. ## Enable ```toml # nodedb.toml [server.ports] resp = 6381 ``` Or: `NODEDB_PORT_RESP=6381` ## Usage ```bash redis-cli -p 6381 SELECT sessions # switch to a KV collection SET sess_abc '{"user":"alice"}' EX 3600 GET sess_abc DEL sess_abc INCR page_views ZADD leaderboard 1500 player-123 ZRANK leaderboard player-123 SUBSCRIBE sessions ``` ## Supported Commands `GET`, `SET` (EX/PX/NX/XX), `DEL`, `EXISTS`, `MGET`, `MSET`, `EXPIRE`, `PEXPIRE`, `TTL`, `PTTL`, `PERSIST`, `SCAN`, `KEYS`, `HGET`, `HMGET`, `HSET`, `FLUSHDB`, `DBSIZE`, `SUBSCRIBE`, `PUBLISH`, `PING`, `ECHO`, `SELECT`, `INFO`, `QUIT`, `INCR`, `DECR`, `INCRBY`, `DECRBY`, `INCRBYFLOAT`, `GETSET`, `ZADD`, `ZREM`, `ZRANK`, `ZRANGE`, `ZCARD`, `ZSCORE`. ## Same Data, Two Interfaces RESP accesses the same KV data as SQL. A key set via `redis-cli` is queryable via `SELECT * FROM sessions WHERE key = 'sess_abc'`. --- # ILP (InfluxDB Line Protocol) (docs/connectivity/ilp) --- title: ILP (InfluxDB Line Protocol) description: High-throughput timeseries ingest via InfluxDB Line Protocol. Disabled by default. --- # ILP (InfluxDB Line Protocol) NodeDB accepts metrics via InfluxDB Line Protocol over TCP for high-throughput timeseries ingest. ## Enable ```toml [server.ports] ilp = 8086 ``` Or: `NODEDB_PORT_ILP=8086` ## Usage ```bash echo "cpu,host=web-01,region=us-east usage=72.5,mem=84.2 1609459200000000000" | nc localhost 8086 ``` ## Telegraf Integration ```toml # telegraf.conf [[outputs.socket_writer]] address = "tcp://localhost:8086" data_format = "influx" ``` Any ILP-compatible client (Telegraf, Vector, InfluxDB client libraries) works. ## Adaptive Batching NodeDB auto-tunes batch sizes based on ingest rate. Per-series core routing eliminates cross-core contention. No configuration needed. --- # Sync Protocol (WebSocket) (docs/connectivity/sync-protocol) --- title: Sync Protocol (WebSocket) description: WebSocket-based CRDT sync for NodeDB-Lite clients on phones, browsers, and desktops. --- # Sync Protocol WebSocket-based sync on port 9090 for NodeDB-Lite clients. CRDT deltas flow between edge devices and Origin. ## How It Works 1. Lite client connects via WebSocket to Origin 2. Client sends its current LSN watermark 3. Origin sends any missed deltas since that watermark 4. Client sends locally accumulated CRDT deltas 5. Origin validates constraints (UNIQUE, FK, CHECK) via Raft 6. Committed deltas are broadcast to all connected clients with matching shape subscriptions ## Shape Subscriptions Devices subscribe to a subset of data: ```sql -- Client only receives data matching this filter SUBSCRIBE SHAPE ON users WHERE user_id = $me; ``` Changes within a device's shape are pushed in real time. Changes outside the shape are not sent. ## Compensation Hints If a local write violates a constraint on Origin, a typed `CompensationHint` is sent back: - `Rename` — duplicate key detected - `Merge` — concurrent edits to the same document - `CreateParent` — FK target doesn't exist - `Adjust` — CHECK constraint violation The application handles the conflict — no silent data loss. --- # Rust Client (docs/connectivity/rust-client) --- title: Rust Client description: The nodedb-client crate connects over the NDB protocol with both SQL and native modes. --- # Rust Client The `nodedb-client` crate connects over the NDB protocol (port 6433). It supports both SQL and native modes on the same connection. ## Installation ```toml [dependencies] nodedb-client = "0.1" ``` ## Usage ```rust use nodedb_client::NodeDbRemote; let client = NodeDbRemote::connect("localhost:6433").await?; // SQL mode let rows = client.sql("SELECT * FROM users WHERE age > 30").await?; // Native mode (skip SQL parsing) let user = client.get("users", "u1").await?; client.put("users", "u1", &doc).await?; client.vector_search("articles", &query_vec, 10, None).await?; ``` ## The NodeDb Trait Both `NodeDbRemote` (network client) and `NodeDbLite` (embedded) implement the `NodeDb` trait. Application code can be generic over the trait — same logic works in-process or over the network. ```rust async fn search(db: &D, query: &[f32]) -> Result> { db.vector_search("articles", query, 10, None).await } ``` --- # Python Driver (docs/connectivity/python-driver) --- title: Python Driver description: Connect to NodeDB from Python using any PostgreSQL driver. --- # Python Driver NodeDB speaks PostgreSQL's wire protocol, so any Python PostgreSQL driver works. ## psycopg2 ```python import psycopg2 conn = psycopg2.connect(host="localhost", port=6432, dbname="nodedb") cur = conn.cursor() cur.execute("SELECT * FROM users WHERE age > %s", (30,)) rows = cur.fetchall() conn.close() ``` ## asyncpg ```python import asyncpg conn = await asyncpg.connect("postgresql://localhost:6432/nodedb") rows = await conn.fetch("SELECT * FROM users WHERE age > $1", 30) await conn.close() ``` ## SQLAlchemy ```python from sqlalchemy import create_engine engine = create_engine("postgresql://localhost:6432/nodedb") with engine.connect() as conn: result = conn.execute("SELECT * FROM users LIMIT 10") ``` ## ORMs Django, SQLAlchemy ORM, Prisma, and other PostgreSQL-compatible ORMs work with NodeDB's pgwire protocol. --- # Ecosystem (psql, ORMs, GUI Tools) (docs/connectivity/ecosystem) --- title: Ecosystem (psql, ORMs, GUI Tools) description: Compatible tools, ORMs, connection poolers, and GUI clients. --- # Ecosystem NodeDB's pgwire compatibility means most PostgreSQL tools work out of the box. ## psql ```bash psql -h localhost -p 6432 ``` Full support for `\d`, `\dt`, prepared statements, COPY, and LISTEN/NOTIFY. ## GUI Tools **DBeaver** — Connect as PostgreSQL to `localhost:6432`. Schema browser, query editor, and result export work. **pgAdmin** — Connect as PostgreSQL. Dashboard and query tool work. Some admin features may not apply. ## ORMs | Framework | Connection string | | ------------ | ---------------------------------------------- | | SQLAlchemy | `postgresql://localhost:6432/nodedb` | | Django | `HOST=localhost PORT=6432 ENGINE=django.db.backends.postgresql` | | Prisma | `postgresql://localhost:6432/nodedb` | | Diesel | `postgres://localhost:6432/nodedb` | | GORM (Go) | `host=localhost port=6432 dbname=nodedb` | ## Connection Poolers PgBouncer and pgcat work with NodeDB. Configure them to connect to `localhost:6432`. Use transaction-level pooling for best results with NodeDB's session variable support. --- # Configuration (docs/administration/configuration) --- title: Configuration description: Server configuration via nodedb.toml and environment variables. --- # Configuration NodeDB is configured via `nodedb.toml`. Environment variables take precedence. ```toml [server] host = "127.0.0.1" data_dir = "/var/lib/nodedb" memory_limit = "4GiB" data_plane_cores = 4 max_connections = 1024 log_format = "text" [server.ports] native = 6433 pgwire = 6432 http = 6480 resp = 6381 # optional ilp = 8086 # optional [server.tls] cert_path = "/etc/nodedb/tls/server.crt" key_path = "/etc/nodedb/tls/server.key" ``` ## Environment Variables | Config field | Env var | Default | | ------------------ | ------------------------- | ------------ | | `host` | `NODEDB_HOST` | `127.0.0.1` | | `ports.pgwire` | `NODEDB_PORT_PGWIRE` | `6432` | | `ports.native` | `NODEDB_PORT_NATIVE` | `6433` | | `ports.http` | `NODEDB_PORT_HTTP` | `6480` | | `ports.resp` | `NODEDB_PORT_RESP` | disabled | | `ports.ilp` | `NODEDB_PORT_ILP` | disabled | | `data_dir` | `NODEDB_DATA_DIR` | `~/.local/share/nodedb` | | `memory_limit` | `NODEDB_MEMORY_LIMIT` | `1GiB` | | `data_plane_cores` | `NODEDB_DATA_PLANE_CORES` | CPUs - 1 | | `max_connections` | `NODEDB_MAX_CONNECTIONS` | `4096` | | `log_format` | `NODEDB_LOG_FORMAT` | `text` | ## Audit Configuration ```toml [audit] level = "standard" # minimal, standard, full, forensic ``` ## Auth Configuration ```toml [auth.jwt] providers = [ { issuer = "https://your-domain.auth0.com/", audience = "your-api" }, ] ``` ## TLS per Protocol ```toml [server.tls] native = true pgwire = true http = true resp = true ilp = false ``` --- # Authentication (docs/administration/authentication) --- title: Authentication description: Password auth (SCRAM-SHA-256), API keys, JWKS (JWT), and mTLS. --- # Authentication NodeDB supports multiple authentication methods simultaneously. ## Password Auth (SCRAM-SHA-256) ```sql CREATE USER alice WITH PASSWORD 'strong_password'; CREATE USER bob WITH PASSWORD 'secret' ROLE readonly; ``` ```bash psql -h localhost -p 6432 -U alice ``` ## API Keys ```sql CREATE API KEY 'my-service' ROLE readwrite; DROP API KEY 'my-service'; ``` ```bash curl -H "Authorization: Bearer " http://localhost:6480/v1/query ``` ## JWKS (JWT) Multi-provider support (Auth0, Clerk, Supabase, Firebase, Keycloak, Cognito): ```toml [auth.jwt] providers = [{ issuer = "https://your-domain.auth0.com/", audience = "your-api" }] ``` JWT claims map to `$auth.*` session variables for RLS: | Claim | Variable | Usage | | --------- | ---------------- | ------------------------------ | | `sub` | `$auth.id` | `WHERE user_id = $auth.id` | | `role` | `$auth.role` | `WHERE $auth.role = 'admin'` | | `org_id` | `$auth.org_id` | `WHERE org_id = $auth.org_id` | | `scope` | `$auth.scopes` | Scope-based access control | Supported algorithms: RS256, ES256. ## mTLS ```toml [tls] cert = "/path/to/server.crt" key = "/path/to/server.key" client_ca = "/path/to/ca.crt" # enables mTLS ``` ## Auth Priority 1. mTLS → 2. JWT Bearer → 3. API key → 4. SCRAM-SHA-256 --- # Roles & Permissions (RBAC) (docs/administration/rbac) --- title: Roles & Permissions (RBAC) description: Role-based access control with GRANT, REVOKE, and built-in role hierarchy. --- # Roles & Permissions (RBAC) ## Built-in Roles | Role | Permissions | | -------------- | ------------------------------ | | `readonly` | SELECT on all collections | | `readwrite` | SELECT, INSERT, UPDATE, DELETE | | `admin` | All operations + DDL | | `tenant_admin` | Admin within a tenant | | `superuser` | Unrestricted (cross-tenant) | Higher roles inherit all permissions of lower roles. ## Custom Roles ```sql CREATE ROLE analyst; CREATE ROLE data_engineer; ``` ## Granting Permissions ```sql GRANT SELECT ON orders TO analyst; GRANT INSERT, UPDATE ON orders TO data_engineer; GRANT ALL ON orders TO admin; GRANT EXECUTE ON FUNCTION full_name TO analyst; GRANT BACKUP ON TENANT acme TO ops_user; ``` ## Revoking ```sql REVOKE INSERT ON orders FROM analyst; ``` ## SECURITY DEFINER Functions and triggers can execute with owner's permissions: ```sql CREATE FUNCTION admin_count() RETURNS INT SECURITY DEFINER AS BEGIN RETURN (SELECT COUNT(*) FROM audit_log); END; ``` ## Introspection ```sql SHOW GRANTS FOR analyst; SHOW PERMISSIONS; ``` --- # Row-Level Security (docs/administration/rls) --- title: Row-Level Security description: Per-row filtering based on authenticated user context. Works across all eight engines. --- # Row-Level Security (RLS) RLS policies filter rows transparently on every query. No application code changes needed. ```sql CREATE RLS POLICY user_orders ON orders FOR READ USING (customer_id = $auth.id); CREATE RLS POLICY user_write ON orders FOR WRITE USING (customer_id = $auth.id); CREATE RLS POLICY admin_bypass ON orders FOR READ USING ($auth.role = 'admin' OR customer_id = $auth.id); CREATE RLS POLICY org_access ON projects FOR ALL USING (org_id = $auth.org_id); ``` ## Policy Types | Type | Applies to | | ------- | ---------------------- | | `READ` | SELECT | | `WRITE` | INSERT, UPDATE, DELETE | | `ALL` | Both | ## Permissive vs Restrictive Multiple policies: permissive (default, OR-combined) or restrictive (AND-combined): ```sql CREATE RLS POLICY org_filter ON docs FOR READ USING (org_id = $auth.org_id) RESTRICTIVE; CREATE RLS POLICY not_deleted ON docs FOR READ USING (status != 'deleted') RESTRICTIVE; ``` ## Cross-Engine Behavior RLS filters are injected at plan time, before engine dispatch: - **Vector search** — unauthorized vectors excluded from results - **Graph traversal** — edges to invisible nodes are skipped - **FTS** — only matching documents passing RLS are returned - **KV** — blocked rows return empty ## Managing ```sql SHOW RLS POLICIES; DROP RLS POLICY user_orders ON orders; ``` --- # TLS & mTLS (docs/administration/tls) --- title: TLS & mTLS description: Transport encryption for all protocols with optional mutual TLS. --- # TLS & mTLS ## Server TLS ```toml [server.tls] cert_path = "/etc/nodedb/tls/server.crt" key_path = "/etc/nodedb/tls/server.key" ``` Per-protocol TLS toggle: ```toml [server.tls] native = true pgwire = true http = true resp = true ilp = false # disable TLS for high-throughput ILP ingest ``` ## mTLS (Mutual TLS) ```toml [tls] cert = "/path/to/server.crt" key = "/path/to/server.key" client_ca = "/path/to/ca.crt" crl = "/path/to/revocation.crl" ``` When `client_ca` is set, clients must present a valid certificate signed by the CA. ## Encryption at Rest - **Data volumes** — AES-256-XTS - **WAL segments** — AES-256-GCM with per-file data encryption keys - **Lite devices** — AES-256-GCM + Argon2id key derivation --- # Audit Logging (docs/administration/audit-logging) --- title: Audit Logging description: Hash-chained audit trail with configurable levels and SIEM export. --- # Audit Logging Tamper-evident, hash-chained audit log. Every entry includes a SHA-256 hash of the previous entry — if any record is modified, the chain breaks. ```sql SHOW AUDIT LOG; SHOW AUDIT LOG LIMIT 50; ``` ## Audit Levels ```toml [audit] level = "standard" ``` | Level | Records | | ---------- | ---------------------------------------------- | | `minimal` | Auth events (login, failure, denial) | | `standard` | + admin actions, DDL, sessions, config changes | | `full` | + every query, RLS denials | | `forensic` | + row-level mutations, CRDT deltas | ## Key Events AuthSuccess, AuthFailure, AuthzDenied, PrivilegeChange, SessionConnect/Disconnect, AdminAction, TenantCreated/Deleted, SnapshotBegin/End, RestoreBegin/End, CertRotation, KeyRotation, NodeJoined/Left, QueryExec, RlsDenied, RowChange. ## SIEM Export ```sql CREATE CHANGE STREAM audit_export ON _system.audit DELIVERY WEBHOOK 'https://siem.example.com/ingest' WITH (format = 'json', hmac_secret = 'your-secret'); ``` --- # Multi-Tenancy (docs/administration/multi-tenancy) --- title: Multi-Tenancy description: Tenant isolation, quotas, backup, and GDPR purge. --- # Multi-Tenancy Each tenant has fully isolated storage, indexes, and security policies. Cross-tenant data access is impossible by design. ## Creating Tenants ```sql CREATE TENANT acme; CREATE USER alice WITH PASSWORD 'secret' ROLE readwrite TENANT 42; ``` ## Quotas ```sql ALTER TENANT acme SET QUOTA max_qps = 5000; ALTER TENANT acme SET QUOTA max_storage_bytes = 53687091200; ALTER TENANT acme SET QUOTA max_connections = 50; SHOW TENANT USAGE FOR acme; SHOW TENANT QUOTA FOR acme; EXPORT USAGE FOR TENANT acme PERIOD '2026-03' FORMAT 'json'; ``` ## Backup & Restore ```sql BACKUP TENANT acme TO '/backups/acme.bak'; RESTORE TENANT acme FROM '/backups/acme.bak' DRY RUN; RESTORE TENANT acme FROM '/backups/acme.bak'; ``` Backups cover all 7 engines. Encrypted with AES-256-GCM using the tenant WAL key. ## GDPR Purge ```sql DROP TENANT acme; -- catalog metadata only PURGE TENANT acme CONFIRM; -- ALL data across all engines (permanent) ``` ## Isolation Model | Layer | Isolation | | ------- | --------------------------------------------- | | Storage | Separate key prefixes per tenant | | Indexes | Tenant-scoped, no cross-tenant overlap | | WAL | Per-tenant segments with per-tenant encryption | | Queries | Tenant ID injected at plan time | | Audit | Per-tenant audit entries | --- # Monitoring (docs/administration/monitoring) --- title: Monitoring description: Prometheus metrics, OpenTelemetry, health checks, and Grafana integration. --- # Monitoring ## Prometheus Metrics ```bash curl http://localhost:6480/metrics ``` 70+ system metrics: per-engine, per-core, connection, query, replication, and storage. Latency histogram with 13 buckets. ## PromQL Engine Full Prometheus query engine at `/obsv/api`. Point Grafana at this URL as a Prometheus data source. Supports all Tier 1+2+3 functions (rate, irate, delta, histogram_quantile, holt_winters, etc.). ## Prometheus Remote Write/Read Use NodeDB as a long-term Prometheus storage backend: ```yaml remote_write: - url: "http://nodedb:6480/obsv/api/v1/write" remote_read: - url: "http://nodedb:6480/obsv/api/v1/read" ``` ## OpenTelemetry - **OTLP ingest** — Metrics, traces, and logs via HTTP (4318) and gRPC (4317) - **OTLP export** — Push NodeDB's own traces/metrics to any OTLP collector Feature-gated: `--features otel`, `--features promql`, `--features monitoring`. ## Health Checks ```bash curl http://localhost:6480/healthz # k8s readiness — 503 until startup completes curl http://localhost:6480/health/live # liveness probe curl http://localhost:6480/health/ready # WAL recovered, ready for queries ``` ## Key Metrics | Metric Category | Examples | | --------------- | ---------------------------------------------- | | Engine | Per-engine memory, query count, latency | | Core | Per-core CPU, queue depth, io_uring completions | | Connection | Active connections, auth failures | | Replication | Raft log lag, replication latency | | Storage | WAL fsync latency, segment count, compaction debt | --- # Backup & Recovery (docs/administration/backup-recovery) --- title: Backup & Recovery description: Per-tenant backup, point-in-time restore, and WAL replay. --- # Backup & Recovery ## Backup ```sql BACKUP TENANT acme TO '/backups/acme-2026-04.bak'; ``` Backups cover all 7 engines: documents, indexes, vectors, graph edges, KV tables, timeseries, and CRDT state. Encrypted with AES-256-GCM using the tenant WAL key. The backup envelope embeds catalog rows and the source tombstone set alongside segment data, so a restored snapshot reconstructs the catalog deterministically and refuses to resurrect collections tombstoned before the backup was taken. Each `StoredCollection` row carries a `size_bytes_estimate` field surfaced through `_system.dropped_collections` for sizing the L2 cleanup queue before `PURGE`. ## Validate ```sql RESTORE TENANT acme FROM '/backups/acme-2026-04.bak' DRY RUN; ``` Validates the backup without restoring. ## Restore ```sql RESTORE TENANT acme FROM '/backups/acme-2026-04.bak'; ``` ## Crash Recovery On startup, NodeDB: 1. Scans WAL segments from the last checkpoint 2. Validates CRC32C checksums 3. Replays valid records to reconstruct state 4. Discards partially written records The Event Plane resumes from its LSN watermark — no events are lost. ## WAL Archiving WAL segments can be archived for point-in-time recovery. Old segments are eligible for cleanup once all records have been flushed to L1 segments. --- # Cluster Operations (docs/administration/cluster-operations) --- title: Cluster Operations description: Cluster setup, node management, shard rebalancing, and rolling upgrades. --- # Cluster Operations ## Introspection ```sql SHOW CLUSTER; SHOW NODES; SHOW RAFT GROUPS; ``` ## Adding Nodes New nodes join the cluster and receive vShard assignments during rebalancing. ## Shard Rebalancing When nodes join or leave, vShards redistribute automatically: 1. Coordinator selects vShards based on load metrics 2. Target node replicates the vShard's data 3. Ownership transfers atomically 4. Old node drops the vShard data Transparent to clients — requests forward during migration. ## Rolling Upgrades 1. Drain a node (move its vShard leadership to other nodes) 2. Upgrade the binary 3. Restart the node 4. Repeat for each node Raft leader election handles temporary unavailability. ## Failure Recovery - **Single node failure** — Raft elects new leaders for affected vShards - **Minority failure** — Cluster remains available with quorum - **Majority failure** — Cluster becomes read-only until quorum restores --- # Memory Management (docs/administration/memory-management) --- title: Memory Management description: Per-engine memory budgets, jemalloc arenas, and memory governance. --- # Memory Management ## Memory Limit ```toml [server] memory_limit = "4GiB" ``` Or: `NODEDB_MEMORY_LIMIT=4GiB` ## Per-Engine Budgets Memory is divided among engines. No single engine can starve others. Default allocation: | Budget | Default | | -------------------- | ------- | | Vector | 30% | | Sparse / Metadata | 15% | | CRDT | 10% | | Timeseries | 10% | | Query Execution | 20% | | Unallocated headroom | 15% | ## Per-Core Arenas Each Data Plane core is pinned to a dedicated jemalloc arena via `nodedb-mem`. This eliminates allocator lock contention in the TPC architecture. ## Backpressure When memory pressure rises: - Engine memtables flush to disk earlier - SPSC bridge reduces read depth at 85% utilization - New reads suspend at 95% utilization --- # Performance Tuning (docs/administration/performance-tuning) --- title: Performance Tuning description: Data Plane cores, io_uring tuning, engine-specific parameters, and query optimization. --- # Performance Tuning ## Data Plane Cores ```toml [server] data_plane_cores = 4 # default: CPUs - 1 ``` More cores = more parallel shards. Each core is an isolated, shared-nothing shard. ## io_uring The Data Plane uses io_uring for NVMe I/O. Ensure: - Linux kernel 5.1+ (5.11+ recommended for full io_uring features) - NVMe storage (io_uring benefits are minimal on spinning disks) - `ulimit -l unlimited` for locked memory ## Vector Engine | Parameter | Default | Tune for | | ----------------- | ------- | --------------------- | | M | 16 | Higher = more recall | | EF_CONSTRUCTION | 200 | Higher = better graph | | Quantization | none | SQ8/PQ for large datasets | ## FTS Engine | Parameter | Default | Tune for | | ----------------- | ------- | --------------------- | | Memtable threshold | 32M postings | Higher = fewer flushes | | Compaction levels | 8x8 | More levels = less write amplification | ## Query Optimization ```sql EXPLAIN SELECT * FROM users WHERE age > 30; ``` Check that: - Vector searches use the HNSW index (not brute force) - Spatial queries use the R*-tree (check for `SpatialOp`) - Columnar queries show predicate pushdown and block skip - FTS queries use the inverted index (not full scan) --- # Production Checklist (docs/administration/production-checklist) --- title: Production Checklist description: Pre-production checklist for deploying NodeDB in production. --- # Production Checklist ## Infrastructure - [ ] Linux kernel 5.11+ (full io_uring support) - [ ] NVMe storage for Data Plane I/O - [ ] Sufficient locked memory (`ulimit -l unlimited`) - [ ] Memory limit set appropriately (`NODEDB_MEMORY_LIMIT`) - [ ] Data Plane cores configured (`NODEDB_DATA_PLANE_CORES`) ## Security - [ ] TLS enabled for all protocols - [ ] Authentication configured (SCRAM, JWT, API keys, or mTLS) - [ ] RBAC roles defined and assigned - [ ] RLS policies for multi-tenant data - [ ] Audit logging level set (at least `standard`) - [ ] Default passwords changed ## Replication - [ ] Replication factor >= 3 for production data - [ ] Cluster has odd number of nodes (3, 5, 7) for Raft quorum - [ ] Cross-region learner replicas if needed ## Monitoring - [ ] Prometheus scraping `/metrics` - [ ] Grafana dashboards configured - [ ] Health check endpoint monitored (`/health/ready`) - [ ] Alerting on key metrics (WAL fsync latency, replication lag, memory pressure) ## Backup - [ ] Regular tenant backups scheduled - [ ] Backup validation (DRY RUN) tested - [ ] Restore procedure documented and tested - [ ] WAL archiving configured for PITR ## Operations - [ ] Rolling upgrade procedure documented - [ ] Shard rebalancing tested - [ ] Failure recovery tested (single node, minority failure) --- # CRDT Sync Overview (docs/crdt-sync/overview) --- title: CRDT Sync Overview description: AP on the edge, CP in the cloud. Loro-backed offline-first sync with constraint validation. --- # CRDT Sync Overview NodeDB bridges **AP** (available, partition-tolerant) on edge devices with **CP** (consistent, partition-tolerant) in the cloud. ## How It Works 1. **Device writes locally** — Available immediately, no network required 2. **CRDT delta produced** — Each write generates a Loro CRDT delta 3. **Connectivity returns** — Deltas sync to Origin via WebSocket (port 9090) 4. **Origin validates** — SQL constraints (UNIQUE, FK, CHECK) checked at Raft commit 5. **Committed** — Delta replicated via Raft, broadcast to connected clients 6. **Conflict** — Failed constraints produce a typed `CompensationHint` back to device ## Shape Subscriptions Devices subscribe to a subset of data — not the entire database: ```sql SUBSCRIBE SHAPE ON users WHERE user_id = $me; SUBSCRIBE SHAPE ON tax_rates WHERE jurisdiction IN ('US-CA', 'US-NY'); ``` Changes within a device's shape are pushed in real time. ## Multiple Devices Multiple devices converge to the same state regardless of operation order. Loro's operation-based CRDTs guarantee eventual consistency without coordination. ## Local Engine Capabilities All eight engines work locally on NodeDB-Lite: document, KV, columnar, timeseries, spatial, vector, array, plus graph traversal and FTS overlays — with sub-millisecond reads and no network dependency. --- # Conflict Policies (docs/crdt-sync/conflict-policies) --- title: Conflict Policies description: Declarative per-collection conflict resolution strategies. --- # Conflict Policies ## Setting Policies ```sql CREATE COLLECTION notes WITH (conflict_policy = 'lww'); CREATE COLLECTION profiles WITH (conflict_policy = 'field_merge'); ``` ## Available Policies | Policy | Behavior | | ------------- | ------------------------------------------------ | | `lww` | Last-writer-wins based on timestamp (default) | | `field_merge` | Per-field merge — concurrent edits to different fields merge automatically | ## Field-Level Merge With `field_merge`, concurrent edits to different fields on the same document are both preserved. Concurrent edits to the same field use the LWW tiebreaker. ## Per-Collection Configuration The `PolicyRegistry` stores resolution strategies per collection. Each collection can have its own policy independent of others. ```sql ALTER COLLECTION notes SET conflict_policy = 'field_merge'; SHOW CONFLICT POLICY ON notes; ``` --- # Constraint Validation (docs/crdt-sync/constraint-validation) --- title: Constraint Validation description: SQL constraints enforced at sync time on Origin with typed compensation hints. --- # Constraint Validation SQL constraints are checked on Origin at sync time, not on the device. This keeps local writes fast (no network round-trip) while preserving global consistency. ## Enforced Constraints - **UNIQUE** — Duplicate key → `CompensationHint::Rename` or `CompensationHint::Merge` - **FOREIGN KEY** — Dangling reference → `CompensationHint::CreateParent` or `CompensationHint::Discard` - **CHECK** — Constraint violation → `CompensationHint::Adjust` ## Compensation Hints When a local write violates a constraint on Origin, a typed `CompensationHint` is sent back to the device. The application handles the conflict: - **Rename** — The document ID or key already exists. Offer the user to rename. - **Merge** — Concurrent edits to the same document. Offer to merge. - **CreateParent** — A FK target doesn't exist. Offer to create it or discard the child. - **Adjust** — A CHECK constraint failed. Show the violation and let the user fix it. No silent data loss — the application always decides. ## State Constraints ```sql ALTER COLLECTION invoices ADD CONSTRAINT invoice_flow ON COLUMN status TRANSITIONS ( 'draft' -> 'submitted', 'submitted' -> 'approved' BY ROLE 'manager', 'approved' -> 'issued' BY ROLE 'accountant' ); ``` State transitions are validated at sync time. Invalid transitions (e.g., `draft -> issued`) are rejected with a compensation hint. --- # Dead-Letter Queue (docs/crdt-sync/dead-letter-queue) --- title: Dead-Letter Queue description: Handling persistently failing CRDT deltas with diagnostic metadata. --- # Dead-Letter Queue Deltas that persistently fail constraint validation on Origin are routed to a dead-letter queue (DLQ). Each entry includes diagnostic metadata: the original delta, the constraint that failed, and the compensation hint. ## Inspection Operators can inspect DLQ entries to understand why syncs are failing and take corrective action. ## Resolution Options - **Replay** — Fix the constraint issue and replay the delta - **Discard** — Permanently discard the failed delta - **Manual merge** — Resolve the conflict manually and apply a corrected delta ## Trigger DLQ AFTER triggers that persistently fail (after retry with exponential backoff) also route to the DLQ. This applies to both CRDT sync triggers and regular async triggers. --- # Offline Sync Patterns (docs/crdt-sync/offline-sync) --- title: Offline Sync Patterns description: Patterns for provisional numbering, state machines, and offline-first workflows. --- # Offline Sync Patterns ## Provisional Numbering Official document numbers (invoices, receipts) must be sequential and gap-free. Offline devices can't reserve numbers from Origin. **Pattern:** 1. Create documents with provisional ULID-based identifiers (`DRAFT-01JXQ3...`) 2. Sync to Origin via CRDT 3. Origin assigns official sequential number via `nextval('invoice_seq')` 4. Official number syncs back to device ```sql -- Device (offline) INSERT INTO invoices (invoice_number, customer_id, amount, status) VALUES ('DRAFT-01JXQ3...', 'cust-005', 750.00, 'draft'); -- Origin (on sync) — trigger assigns official number -- Invoice becomes INV-26-04-00024 ``` ## State Machine Transitions ```sql ALTER COLLECTION invoices ADD CONSTRAINT invoice_flow ON COLUMN status TRANSITIONS ( 'draft' -> 'submitted', 'submitted' -> 'approved' BY ROLE 'manager', 'approved' -> 'issued' BY ROLE 'accountant', 'issued' -> 'voided' BY ROLE 'controller' ); ``` Devices can create drafts and submit them. Approval requires `manager` role — only available on Origin. ## Tax Rate Sync Sync relevant jurisdiction rates to devices via shape subscriptions: ```sql SUBSCRIBE SHAPE ON tax_rates WHERE jurisdiction IN ('US-CA', 'US-NY'); ``` Lookup at invoice time with temporal query. Reconcile on sync if rates changed. ## NodeDB-Lite Usage ```swift // iOS let db = NodeDbLite.open(path: "invoices.db") db.execute("INSERT INTO invoices ...") db.sync(url: "wss://origin.example.com/sync", token: authToken) ``` ```javascript // Browser (WASM) const db = await NodeDbLite.open("invoices"); await db.execute("INSERT INTO invoices ..."); await db.sync("wss://origin.example.com/sync"); ``` All eight engines work locally with sub-millisecond reads. CRDT sync is transparent. --- # Error Codes (docs/reference/error-codes) --- title: Error Codes description: Numeric error codes returned by NodeDB with descriptions and resolution hints. --- # Error Codes NodeDB uses `NodeDbError` — a struct with numeric `ErrorCode`, human-readable `message`, machine-matchable `ErrorDetails`, and optional chained `cause`. ## Error Categories | Code Range | Category | Examples | | ---------- | ---------- | --------------------------------- | | 1000–1099 | Write path | Constraint violation, deadline | | 1100–1199 | Read path | Collection not found | | 1200–1299 | Query | Parse error, plan error | | 2000–2099 | Auth | Authorization denied | | 3000–3099 | Sync | Delta rejected, sync error | | 4000–4099 | Storage | Segment corrupted, WAL error | | 5000–5099 | Config | Bad request, invalid config | | 6000–6099 | Cluster | No quorum, shard unavailable | | 7000–7099 | Memory | Budget exceeded | | 8000–8099 | Encryption | Key error, decrypt failure | | 9000–9099 | Internal | Unexpected internal error | ## Common Errors | Error | Code | Resolution | | ---------------------- | ---- | --------------------------- | | `COLLECTION_NOT_FOUND` | 1100 | Check collection name | | `TYPE_MISMATCH` | 1020 | Check data types in query | | `CONSTRAINT_VIOLATION` | 1000 | Duplicate key on plain INSERT maps to SQLSTATE `23505` (`unique_violation`). Use `UPSERT` or `INSERT ... ON CONFLICT DO NOTHING / DO UPDATE` for non-error semantics. | | `DEADLINE_EXCEEDED` | 1002 | Query took too long | | `INSUFFICIENT_BALANCE` | 1022 | Not enough for TRANSFER | | `OVERFLOW` | 1021 | i64 overflow on INCR | | `AUTHORIZATION_DENIED` | 2000 | Check RBAC grants | | `SYNC_DELTA_REJECTED` | 3001 | CRDT sync constraint failed | ## Error Construction Errors are created via constructors: `NodeDbError::storage("detail")`, `NodeDbError::collection_not_found("users")`, etc. --- # SQLSTATE Mapping (docs/reference/sqlstate-mapping) --- title: SQLSTATE Mapping description: PostgreSQL-compatible SQLSTATE codes returned over pgwire. --- # SQLSTATE Mapping NodeDB maps internal error codes to PostgreSQL SQLSTATE codes for pgwire compatibility. | NodeDB Error | SQLSTATE | PostgreSQL Meaning | | --------------------- | -------- | ---------------------------- | | Syntax error | 42601 | Syntax error | | Collection not found | 42P01 | Undefined table | | Column not found | 42703 | Undefined column | | Type mismatch | 42846 | Cannot coerce | | Unique violation | 23505 | Unique violation | | FK violation | 23503 | Foreign key violation | | Check violation | 23514 | Check violation | | Not null violation | 23502 | Not null violation | | Permission denied | 42501 | Insufficient privilege | | Numeric overflow | 22003 | Numeric value out of range | | Division by zero | 22012 | Division by zero | | Rate limit exceeded | 54001 | Too many resources | | Internal error | XX000 | Internal error | --- # Type OIDs (docs/reference/type-oids) --- title: Type OIDs description: PostgreSQL type OID mappings for pgwire protocol compatibility. --- # Type OIDs NodeDB maps its types to PostgreSQL OIDs for pgwire compatibility. | NodeDB Type | PostgreSQL Type | OID | | ----------- | --------------- | ---- | | STRING | text | 25 | | INT | int8 | 20 | | FLOAT | float8 | 701 | | BOOL | bool | 16 | | TIMESTAMP | timestamptz | 1184 | | UUID | uuid | 2950 | | DECIMAL | numeric | 1700 | | ARRAY | jsonb | 3802 | | OBJECT | jsonb | 3802 | | GEOMETRY | bytea | 17 | Complex types (ARRAY, OBJECT) are serialized as JSONB over pgwire. GEOMETRY is serialized as WKB (bytea). --- # Wire Protocol Specs (docs/reference/wire-protocol-specs) --- title: Wire Protocol Specs description: Protocol specifications for NDB (native), pgwire extensions, and HTTP API. --- # Wire Protocol Specs ## NDB (Native Protocol) Binary MessagePack over TCP (port 6433). Two message types: - **Sql** — SQL text as MessagePack string. Parsed by sqlparser. - **Native opcodes** — Typed messages (Get, Put, VectorSearch, etc.) that skip SQL parsing. Both produce the same `PhysicalPlan`. ## pgwire Extensions NodeDB extends the PostgreSQL wire protocol with: - `LIVE SELECT` — async notification delivery via `NotificationResponse` - `GRAPH` commands — parsed as SQL, executed via EngineRules - `MATCH` patterns — Cypher-subset parsed inline - `SEARCH ... USING VECTOR` — vector search syntax - Session variables: `SET nodedb.consistency = 'eventual'` ## HTTP API REST endpoints on port 6480: All data routes are under `/v1/`. Probe routes (`/healthz`, `/health/*`, `/metrics`) are unversioned and always reachable. | Method | Path | Description | | ------ | ----------------------------- | ---------------------------------------- | | POST | `/v1/query` | Execute SQL | | POST | `/v1/query/stream` | Stream NDJSON results | | GET | `/v1/status` | Node status | | GET | `/v1/cluster/status` | Cluster status | | POST | `/v1/auth/exchange-key` | Exchange API key for session token | | POST | `/v1/auth/session` | Create session | | DELETE | `/v1/auth/session` | Delete session | | POST | `/v1/collections/{name}/crdt/apply` | Apply CRDT delta | | GET | `/v1/streams/{id}/events` | SSE named-stream events | | GET | `/v1/streams/{id}/poll` | Long-poll named-stream | | GET | `/v1/cdc/{collection}` | SSE CDC stream | | GET | `/v1/cdc/{collection}/poll` | Long-poll CDC | | GET | `/v1/obsv/api/v1/...` | PromQL endpoints (Prom-spec suffix) | | WS | `/v1/ws` | WebSocket JSON-RPC | | GET | `/healthz` | k8s readiness (503 until startup ready) | | GET | `/health/live` | Liveness probe | | GET | `/health/ready` | Readiness probe | | POST | `/health/drain` | Cooperative drain | | GET | `/metrics` | Prometheus metrics | JSON responses carry `Content-Type: application/vnd.nodedb.v1+json; charset=utf-8`. Clients sending only `Accept: application/vnd.nodedb.vN+json` for an unsupported `N` get `406 Not Acceptable`. --- # WAL & Segment Formats (docs/reference/wal-segment-formats) --- title: WAL & Segment Formats description: On-disk format specifications for WAL records and segment files. --- # WAL & Segment Formats ## WAL Record Format ``` ┌──────────┬─────────────────┬────────────┬─────┬───────────┬───────────┬─────────────┬─────────┐ │ magic │ format_version │ record_type│ lsn │ tenant_id │ vshard_id │ payload_len │ crc32c │ │ 4 bytes │ 2 bytes │ 2 bytes │ 8B │ 4 bytes │ 2 bytes │ 4 bytes │ 4 bytes │ └──────────┴─────────────────┴────────────┴─────┴───────────┴───────────┴─────────────┴─────────┘ ``` - Page size: 4 KiB or 16 KiB (O_DIRECT alignment) - CRC32C per page for bit-rot detection - Segmented with auto-rollover - AES-256-GCM encryption (optional, per-record) ## Segment File Footer All persistent segment files embed a footer: ``` ┌─────────────────┬────────────┬──────────┬─────────┬─────────┐ │ format_version │ created_by │ checksum │ min_lsn │ max_lsn │ └─────────────────┴────────────┴──────────┴─────────┴─────────┘ ``` - Compaction preserves monotonic LSN ordering - Footer enables crash-safe validation on startup ## Columnar Segment Format (NDBS) - Magic bytes: `NDBS` - Versioned footer with CRC32C - Per-column codec metadata - 1024-row blocks with min/max/null-count statistics - Delete bitmaps (Roaring) --- # Engine Configuration (docs/reference/engine-config) --- title: Engine Configuration description: Per-engine configuration options for memory budgets, index parameters, and compression. --- # Engine Configuration ## Vector Engine | Parameter | Default | Description | | ----------------- | ------- | --------------------------------------- | | M | 16 | Max HNSW connections per node per layer | | EF_CONSTRUCTION | 200 | Search width during index build | | Memory budget | 30% | Percentage of total memory | ## Graph Engine Graph, KV, and FTS engines do not have dedicated budget config fields. They draw from the sparse budget. | Parameter | Default | Description | | ---------- | ------- | ---------------------- | | Compaction | 10% | Buffer merge threshold | ## FTS Engine | Parameter | Default | Description | | ------------------ | ----------- | -------------------------- | | Memtable threshold | 32M entries | Flush to segment threshold | | Compaction levels | 8x8 | Level-based compaction | ## Columnar / Timeseries / Spatial Engines The `timeseries` peer engine has its own memory budget knob `timeseries_budget_fraction` (default 10%); `columnar` and `spatial` draw from the columnar engine pool. | Parameter | Default | Description | | ------------ | ------- | -------------------------- | | Block size | 1024 | Rows per block | | partition_by | none | Time partition interval | | retention | none | Data retention period | ## KV Engine KV draws from the sparse budget. There is no dedicated `kv_budget_fraction` config field. ## Memory Budget Fields These are the actual config fields controlling memory allocation: | Config field | Default | Controls | | --------------------------- | ------- | --------------------------------- | | `vector_budget_fraction` | 0.30 | HNSW index and vector data | | `sparse_budget_fraction` | 0.15 | Graph, KV, FTS, and metadata | | `crdt_budget_fraction` | 0.10 | CRDT state and sync buffers | | `timeseries_budget_fraction`| 0.10 | Timeseries peer engine | | `query_budget_fraction` | 0.20 | Query execution working memory | --- # Limits & Defaults (docs/reference/limits-defaults) --- title: Limits & Defaults description: System limits, default values, and capacity constraints. --- # Limits & Defaults ## Per-Op Server Limits Server-enforced caps announced to every native client in the `HelloAckFrame.limits` struct. Defaults are uncapped (`None`); operators set caps via configuration. Exceeding a cap returns a typed `LimitExceeded { limit_name, value, max }` error rather than truncating silently. | Limit | Applies to | | ---------------------- | ------------------------------------------------- | | `max_vector_dim` | Vector embedding dimensionality | | `max_top_k` | `top_k` for vector / FTS / hybrid retrieval | | `max_scan_limit` | Result set size on scans | | `max_batch_size` | Rows per batch INSERT/UPSERT | | `max_crdt_delta_bytes` | Single CRDT delta payload | | `max_query_text_bytes` | SQL text length | | `max_graph_depth` | `MAX_DEPTH` for graph traversal | SDKs surface these via `client.limits()`. See [Native Protocol — Handshake](../connectivity/native-protocol#handshake). ## Connection Limits | Setting | Default | Max | | ---------------- | ------- | ---------- | | Max connections | 1024 | Configurable | | Data Plane cores | CPUs-1 | CPU count | ## Collection Limits | Limit | Value | | -------------------- | -------------- | | Collection name | 255 chars | | Field name | 255 chars | | Document size | 16 MiB | | Vector dimensions | 65536 | | Edge properties size | 16 MiB | ## Query Limits | Limit | Value | | ----------------------- | ----------- | | SQL text length | 1 MiB | | Result set (default) | No limit | | MATCH MAX_DEPTH default | 10 | | BFS MAX_VISITED default | 10000 | ## Default Ports | Protocol | Port | | -------- | ---- | | pgwire | 6432 | | NDB | 6433 | | HTTP | 6480 | | Sync | 9090 | | RESP | off | | ILP | off | ## Default Memory | Setting | Default | | ------------ | ------------ | | memory_limit | 1 GiB | | Docker | 75% of RAM | --- # Feature Flags (docs/reference/feature-flags) --- title: Feature Flags description: Compile-time feature gates for optional capabilities. --- # Feature Flags NodeDB uses Cargo feature flags for optional capabilities. | Feature | What it enables | | ------------ | ------------------------------------------------------ | | `promql` | PromQL query engine at `/obsv/api` | | `otel` | OpenTelemetry OTLP ingest and export | | `grafana` | Grafana data source endpoints | | `monitoring` | Meta-feature: enables `promql` + `otel` + `grafana` | | `kafka` | Kafka bridge for change stream delivery | The `nodedb-fts` crate defines additional language features (`lang-ja`, `lang-zh`, `lang-ko`, `lang-th`) for dictionary-based CJK/Thai segmentation. These are crate-level features, not workspace-level. ## Building with Features ```bash cargo build --release --features promql,otel,monitoring cargo build --release --all-features ``` ## Runtime Configuration Feature-gated capabilities are configurable at runtime via `nodedb.toml` and environment variables once compiled in.