---
# AI & LLM Applications (use-cases/ai-llm)

---
title: AI & LLM Applications
description: Build RAG pipelines, agent memory stores, and hybrid semantic search on NodeDB — vector, graph, full-text, and document engines in one query.
---

Modern AI applications need at least three databases to function: a vector store for embeddings, a graph database for entity relationships, and a search index for keyword recall. NodeDB collapses all three into one, with a fourth engine — schemaless Document — for storing raw LLM outputs and conversation history.

## Engines used

| Engine | Role |
|---|---|
| **Vector** | HNSW similarity search, filtered ANN, multivec MaxSim |
| **Full-Text Search** | BM25 keyword recall, 27-language analyzers |
| **Graph** | Entity relationships, GraphRAG context expansion |
| **Document (schemaless)** | Conversation history, LLM outputs, agent state |

## Hybrid RAG retrieval

A basic RAG retrieval combines vector similarity with BM25 keyword recall, then fuses the two ranked lists with Reciprocal Rank Fusion. `rrf_score(...)` does the fusion inside the planner — one statement, no application-side merge loop.

```sql
-- Hybrid search: vector ANN + BM25, fused with RRF
SELECT
  id,
  content,
  metadata,
  rrf_score(
    vector_distance(embedding, $query_vec),
    bm25_score(content, $query_text)
  ) AS score
FROM documents
WHERE metadata ->> 'source' = 'internal'
ORDER BY score DESC
LIMIT 10;
```

## GraphRAG: expand context through entity relationships

Entities and their relationships are graph edges layered on the same `documents` collection — graph is an overlay, not a separate store. After vector search finds seed chunks, the fusion DSL walks those edges and merges the rankings:

```sql
-- Wire entities as edges on the documents collection
GRAPH INSERT EDGE IN 'documents' FROM $chunk_id  TO $entity_id TYPE 'mentions';
GRAPH INSERT EDGE IN 'documents' FROM $entity_a  TO $entity_b  TYPE 'related_to' PROPERTIES { weight: 0.8 };

-- Vector search seeds → BFS along 'related_to' → RRF merge of vector rank + hop distance
GRAPH RAG FUSION ON documents
  QUERY $query_vec
  VECTOR_FIELD 'embedding'
  VECTOR_TOP_K 50
  EXPANSION_DEPTH 2
  EDGE_LABEL 'related_to'
  FINAL_TOP_K 10
  RRF_K (60.0, 35.0);
```

Need keyword relevance in the same pass? Add the `BM25` leg and the `RRF_K` tuple becomes a triple — vector, graph expansion, and BM25 all fused in one plan:

```sql
GRAPH RAG FUSION ON documents
  QUERY $query_vec VECTOR_FIELD 'embedding' VECTOR_TOP_K 50
  BM25 $query_text ON 'content'
  EXPANSION_DEPTH 2 EDGE_LABEL 'related_to' FINAL_TOP_K 10
  RRF_K (60.0, 35.0, 50.0);
```

You can also walk the entity graph directly with a Cypher-subset `MATCH`:

```sql
MATCH (c:Chunk)-[:mentions]->(e:Entity)-[:related_to*1..2]->(related:Entity)
WHERE c.id = $chunk_id
RETURN DISTINCT related.id
LIMIT 20;
```

## Agent memory store

Long-running agents need persistent, queryable memory. A schemaless Document collection stores arbitrary turn payloads; a vector index makes semantic recall fast.

```sql
CREATE COLLECTION agent_memory;
CREATE VECTOR INDEX idx_mem ON agent_memory METRIC cosine DIM 1536;

-- Store a turn (object-literal insert on a schemaless collection)
INSERT INTO agent_memory
  { session_id: $sid, turn: $n, role: $role, content: $text, embedding: $vec, created_at: now() };

-- Recall: recent turns from this session + semantically similar turns from any session
(SELECT content, role, created_at
 FROM agent_memory
 WHERE session_id = $sid
 ORDER BY turn DESC
 LIMIT 10)
UNION ALL
(SELECT content, role, created_at
 FROM agent_memory
 WHERE session_id <> $sid
 ORDER BY embedding <=> $query_vec
 LIMIT 5)
ORDER BY created_at;
```

## Filtered vector search

Production vector workloads almost always filter by metadata before ranking. NodeDB's NaviX adaptive-local traversal keeps ANN recall high even with tight filters — the planner builds a roaring bitmap of matching IDs and picks pre-filter, post-filter, or brute-force based on selectivity.

```sql
-- Similar products, restricted to in-stock items in the user's region
SELECT id, name, price, embedding <=> $query_vec AS distance
FROM products
WHERE in_stock = true
  AND region = $region
  AND category_id = ANY($category_ids)
ORDER BY embedding <=> $query_vec
LIMIT 20;
```

<Callout kind="tip" title="Matryoshka adaptive dimensions">
Store full-precision embeddings and rank on the first N dimensions for a fast first pass via the `query_dim` tuning argument — then re-rank the top candidates at full precision. No reindexing:

```sql
SELECT id, vector_distance(embedding, $query_vec, query_dim => 256) AS distance
FROM products
WHERE in_stock = true
ORDER BY distance
LIMIT 100;
```
</Callout>

## Why not a dedicated vector database?

Dedicated vector databases force you to maintain a separate store for every non-vector data shape. When your RAG pipeline needs keyword fallback, entity relationships, and conversation history, you end up with four systems, four consistency boundaries, and cross-system joins that travel over the network.

NodeDB keeps everything in one process with a single shared snapshot. A hybrid search + graph expansion + memory recall query is one SQL statement with zero inter-process calls.


---
# Fintech & Trading (use-cases/fintech-trading)

---
title: Fintech & Trading
description: Tick ingestion, OHLCV aggregates, bitemporal audit ledgers, and risk models — unified in NodeDB without Kafka fan-out or separate OLAP clusters.
---

Financial systems typically spread across a time-series database for tick data, a columnar store for analytics, a strict relational database for ledger records, and a key-value cache for positions. NodeDB unifies all four workloads, adds bitemporal support for regulatory audit trails, and handles the ingestion throughput fintech demands.

## Engines used

| Engine | Role |
|---|---|
| **Timeseries** | Tick ingestion, OHLCV, continuous aggregates, retention |
| **Columnar** | Risk models, portfolio analytics, aggregation-heavy OLAP |
| **Key-Value** | Current positions, order-book snapshots, hot-path cache |
| **Document (strict)** | Immutable transaction records, bitemporal audit ledger |

## Tick ingestion and OHLCV aggregates

The Timeseries engine is append-only with per-column compression, a `TIME_KEY` column driving partition-by-time and block skip, and built-in retention.

```sql
CREATE COLLECTION ticks (
  ts      TIMESTAMP TIME_KEY,
  symbol  VARCHAR,
  price   FLOAT,
  volume  BIGINT
) WITH (engine='timeseries', partition_by='1d', retention='90d');

-- Ingest ticks (batch insert for throughput)
INSERT INTO ticks (ts, symbol, price, volume) VALUES
  ('2026-01-15 09:30:00.123', 'AAPL', 185.42, 1200),
  ('2026-01-15 09:30:00.456', 'AAPL', 185.45,  800);

-- 1-minute OHLCV bars
SELECT DISTINCT
  time_bucket('1m', ts)     AS bucket,
  symbol,
  first_value(price) OVER w_ord AS open,
  max(price)         OVER w     AS high,
  min(price)         OVER w     AS low,
  last_value(price)  OVER w_ord AS close,
  sum(volume)        OVER w     AS volume
FROM ticks
WHERE symbol = $symbol
  AND ts >= now() - INTERVAL '1 hour'
WINDOW
  w     AS (PARTITION BY time_bucket('1m', ts), symbol),
  w_ord AS (w ORDER BY ts ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
ORDER BY bucket DESC;
```

## Bitemporal audit ledger

Regulators demand both when a transaction occurred (valid time) and when it was recorded in the system (system time). NodeDB tracks system time *in the key* (derived from the WAL LSN at Raft commit) and valid time *in the value* — no application-side audit table.

```sql
-- Bitemporal transaction ledger on the strict Document engine
CREATE COLLECTION transactions (
  id          UUID DEFAULT gen_uuid_v7(),
  account_id  UUID,
  symbol      VARCHAR,
  quantity    DECIMAL,
  price       DECIMAL,
  side        VARCHAR,
  executed_at TIMESTAMP
) WITH (engine='document_strict', bitemporal=true);

-- Record a trade with explicit valid time
INSERT INTO transactions (account_id, symbol, quantity, price, side, executed_at, valid_time)
VALUES ($account_id, 'MSFT', 100, 374.50, 'BUY',
        '2026-01-15 10:00:00', '2026-01-15 10:00:00');

-- Regulatory query: account state as the database recorded it on a past day,
-- valid as of noon that day. Times are milliseconds since the Unix epoch.
SELECT symbol, quantity, price, side, executed_at
FROM transactions
AS OF SYSTEM TIME 1768521599000   -- 2026-01-15 23:59:59 UTC
AS OF VALID  TIME 1768478400000   -- 2026-01-15 12:00:00 UTC
WHERE account_id = $account_id
ORDER BY executed_at;

-- Full lineage of a single transaction in system-time order
SELECT id, side, quantity, price, system_time
FROM transactions
AS OF SYSTEM TIME NULL   -- special: returns every version
WHERE id = $txn_id
ORDER BY system_time ASC;
```

## Current positions — Key-Value hot path

The KV engine provides O(1) point lookups for current position state, updated on every fill.

```sql
CREATE COLLECTION positions (key TEXT PRIMARY KEY) WITH (engine='kv');

-- Upsert the position record on a fill
UPSERT INTO positions
  { key: $account_symbol, symbol: $symbol, quantity: $qty, cost_basis: $cost_basis, updated_at: now() };

-- Atomic running-quantity adjustment without read-modify-write
SELECT KV_INCR_FLOAT('positions', $account_symbol_qty_key, $delta_qty);

-- Single-millisecond position lookup
SELECT * FROM positions WHERE key = $account_symbol;
```

## Portfolio risk analytics — Columnar engine

Risk calculations scan every position in a portfolio, applying factor models across millions of rows. The Columnar engine with predicate pushdown and per-column compression handles this without a separate OLAP cluster.

```sql
CREATE COLLECTION portfolio_risk (
  account_id      UUID,
  as_of_date      TIMESTAMP TIME_KEY,
  market_value    DECIMAL,
  beta            FLOAT,
  daily_pnl       DECIMAL,
  position_status VARCHAR
) WITH (engine='columnar');

-- Value-at-Risk across all active positions
SELECT
  account_id,
  sum(market_value)                      AS total_exposure,
  sum(market_value * beta)               AS beta_adjusted,
  approx_percentile(daily_pnl, 0.05)     AS var_95
FROM portfolio_risk
WHERE as_of_date >= now() - INTERVAL '1 day'
  AND position_status = 'active'
GROUP BY account_id
HAVING sum(market_value) > 100000;
```

## Cross-engine P&L attribution

One query joins the timeseries tick data with the KV position record — zero network hops between engines, one shared snapshot.

```sql
-- Intraday unrealised P&L: latest tick price vs. cost basis
SELECT
  p.value ->> 'symbol'                AS symbol,
  (p.value ->> 'quantity')::numeric   AS quantity,
  (p.value ->> 'cost_basis')::numeric AS cost_basis,
  t.last_price,
  ((t.last_price - (p.value ->> 'cost_basis')::numeric)
    * (p.value ->> 'quantity')::numeric) AS unrealised_pnl
FROM positions p
CROSS JOIN LATERAL (
  SELECT price AS last_price
  FROM ticks
  WHERE symbol = p.value ->> 'symbol'
  ORDER BY ts DESC
  LIMIT 1
) t;
```

<Callout kind="tip" title="Continuous aggregates">
Define a continuous aggregate once and NodeDB refreshes it incrementally as new ticks arrive — no cron job, no separate aggregation pipeline:

```sql
CREATE CONTINUOUS AGGREGATE ohlcv_1m ON ticks AS
  SELECT time_bucket('1m', ts) AS bucket, symbol,
         min(price) AS low, max(price) AS high, sum(volume) AS volume
  FROM ticks GROUP BY bucket, symbol
WITH (refresh_interval = '5s');
```
</Callout>

## Why not a dedicated time-series database?

Purpose-built time-series databases force all non-tick data — ledger records, positions, risk factors — into external systems. Cross-system queries (tick price × position quantity × risk factor) travel over the network and require application-side joins. NodeDB keeps every engine co-located: cross-engine queries reduce to in-process roaring-bitmap intersections over a shared surrogate identity space.


---
# Healthcare (use-cases/healthcare)

---
title: Healthcare
description: Schema-enforced patient records, bitemporal clinical timelines, and row-level security for HIPAA-grade data governance — all in NodeDB.
---

Healthcare data has two hard requirements that pull in opposite directions: strict schema enforcement for regulatory compliance, and flexible temporal queries for clinical timelines that capture both when care was delivered and when it was recorded. NodeDB satisfies both with its strict Document engine and native bitemporal support, while RBAC and row-level security enforce access control at the storage layer rather than in application middleware.

## Engines used

| Engine | Role |
|---|---|
| **Document (strict)** | Patient records, encounter notes, medication orders — schema-enforced Binary Tuples |
| **Timeseries** | Vital signs, lab results, continuous monitoring streams |
| **Full-Text Search** | Clinical note search, ICD/SNOMED code lookup |
| **Columnar** | Population health analytics, cohort queries, reporting |

## Schema-enforced patient records

The strict Document engine (`engine='document_strict'`) validates every write against the declared columns and types — bad data is rejected at the storage layer, not caught later in application code. `bitemporal=true` adds system-time and valid-time tracking.

```sql
CREATE COLLECTION patients (
  id          UUID DEFAULT gen_uuid_v7(),
  mrn         VARCHAR,
  dob         DATE,
  given_name  VARCHAR,
  family_name VARCHAR,
  allergies   JSON,
  org_id      UUID,
  created_at  TIMESTAMP DEFAULT now()
) WITH (engine='document_strict', bitemporal=true);

-- NodeDB rejects table-level constraints; use a unique index for MRN uniqueness
CREATE UNIQUE INDEX ON patients(mrn);

INSERT INTO patients (mrn, dob, given_name, family_name, allergies, org_id)
VALUES ('MRN-00001234', '1982-07-14', 'Jane', 'Doe', '["penicillin"]', $org_id);
```

## Bitemporal clinical timeline

Clinical decisions depend on what was known at the time of treatment, not just what is currently recorded. System time is stored in the key (derived from the WAL LSN at Raft commit); valid time is client-assigned and stored in the value. Corrections add a new version — old versions remain queryable.

```sql
CREATE COLLECTION diagnoses (
  id           UUID DEFAULT gen_uuid_v7(),
  patient_id   UUID,
  encounter_id UUID,
  icd10_code   VARCHAR,
  description  VARCHAR,
  is_primary   BOOLEAN
) WITH (engine='document_strict', bitemporal=true);

-- Initial diagnosis, valid as of when the condition was present
INSERT INTO diagnoses (patient_id, encounter_id, icd10_code, description, is_primary, valid_time)
VALUES ($patient_id, $enc_id, 'J18.9', 'Pneumonia, unspecified', true, '2026-01-10');

-- Later correction: it was actually J18.1. New version, same patient.
INSERT INTO diagnoses (patient_id, encounter_id, icd10_code, description, is_primary, valid_time)
VALUES ($patient_id, $enc_id, 'J18.1', 'Lobar pneumonia, unspecified', true, '2026-01-10');

-- Audit query: what the database held on Jan 11, valid as of Jan 12.
-- AS OF times are milliseconds since the Unix epoch.
SELECT icd10_code, description
FROM diagnoses
AS OF SYSTEM TIME 1768175999000   -- 2026-01-11 23:59:59 UTC
AS OF VALID  TIME 1768176000000   -- 2026-01-12 00:00:00 UTC
WHERE patient_id = $patient_id;

-- Full version history of a patient's diagnoses, in system-time order
SELECT icd10_code, description, system_time
FROM diagnoses
AS OF SYSTEM TIME NULL   -- special: returns every version
WHERE patient_id = $patient_id
ORDER BY system_time ASC;
```

## Vital signs — Timeseries engine

Continuous monitoring produces millions of readings per patient per day. The Timeseries engine's append-only design and per-column compression handle this efficiently, with per-collection retention.

```sql
CREATE COLLECTION vitals (
  recorded_at TIMESTAMP TIME_KEY,
  patient_id  UUID,
  metric      VARCHAR,
  value       FLOAT,
  unit        VARCHAR
) WITH (engine='timeseries', partition_by='1d', retention='3650d');

INSERT INTO vitals (recorded_at, patient_id, metric, value, unit) VALUES
  (now(), $patient_id, 'heart_rate',  72,   'bpm'),
  (now(), $patient_id, 'spo2',        98.5, '%'),
  (now(), $patient_id, 'systolic_bp', 118,  'mmHg');

-- 5-minute rolling averages for a deterioration alert
SELECT
  time_bucket('5m', recorded_at) AS bucket,
  avg(CASE WHEN metric = 'heart_rate' THEN value END) AS avg_hr,
  avg(CASE WHEN metric = 'spo2'       THEN value END) AS avg_spo2
FROM vitals
WHERE patient_id = $patient_id
  AND recorded_at >= now() - INTERVAL '30 minutes'
GROUP BY bucket
ORDER BY bucket;
```

## Row-level security

RLS policies filter rows transparently on every query — no application code can bypass them. Policies are injected at plan time, before engine dispatch, so they apply to vector, graph, FTS, and KV access too.

```sql
-- Records are visible only within the patient's organization
CREATE RLS POLICY patient_org ON patients FOR READ
  USING (org_id = $auth.org_id);

-- Writes are limited to the same organization
CREATE RLS POLICY patient_org_write ON patients FOR WRITE
  USING (org_id = $auth.org_id);

-- Restrictive policy: only clinical roles may read patient records at all
CREATE RLS POLICY clinical_roles_only ON patients FOR READ
  USING ($auth.role = 'physician' OR $auth.role = 'nurse'
         OR $auth.role = 'pharmacist' OR $auth.role = 'admin') RESTRICTIVE;
```

Substitution is fail-closed: if the session's auth context lacks a referenced variable, the query rejects the row rather than allowing wide-open access.

## Population health cohort query — Columnar engine

Aggregate queries across millions of patients use the Columnar engine with predicate pushdown and per-column compression.

```sql
CREATE COLLECTION encounters (
  id             UUID DEFAULT gen_uuid_v7(),
  patient_id     UUID,
  admit_date     TIMESTAMP,
  discharge_date TIMESTAMP TIME_KEY
) WITH (engine='columnar');

-- 30-day readmission rate by primary-diagnosis chapter
SELECT
  SUBSTR(d.icd10_code, 1, 3)    AS icd_chapter,
  count(DISTINCT e.patient_id)  AS discharges,
  count(DISTINCT r.patient_id)  AS readmissions,
  round(
    count(DISTINCT r.patient_id)::numeric
    / nullif(count(DISTINCT e.patient_id), 0) * 100,
    1
  ) AS readmit_rate_pct
FROM encounters e
JOIN diagnoses d ON d.encounter_id = e.id AND d.is_primary = true
LEFT JOIN encounters r
  ON r.patient_id = e.patient_id
  AND r.admit_date BETWEEN e.discharge_date AND e.discharge_date + INTERVAL '30 days'
  AND r.id <> e.id
WHERE e.discharge_date >= now() - INTERVAL '90 days'
GROUP BY icd_chapter
ORDER BY readmit_rate_pct DESC;
```

## Clinical note search — Full-Text Search overlay

FTS runs as an overlay on a Document collection, indexing note content without a separate Elasticsearch cluster. Negation uses the `NOT` keyword (at least one positive term required).

```sql
CREATE SEARCH INDEX ON clinical_notes FIELDS note_text ANALYZER 'english';

-- Notes mentioning a symptom cluster, excluding ones flagged resolved
SELECT patient_id, note_date, note_text, bm25_score(note_text, 'dyspnea tachycardia NOT resolved') AS score
FROM clinical_notes
WHERE text_match(note_text, 'dyspnea tachycardia NOT resolved')
ORDER BY score DESC
LIMIT 25;
```

<Callout kind="warning" title="HIPAA and de-identification">
NodeDB's RBAC, RLS, and TLS form the access-control perimeter. De-identification, minimum-necessary determinations, and BAA obligations are legal and operational responsibilities — consult your compliance team before exposing data to analytics pipelines.
</Callout>


---
# Geospatial & Logistics (use-cases/geospatial-logistics)

---
title: Geospatial & Logistics
description: Real-time fleet tracking, OGC spatial predicates, graph-based routing, and time-stamped position history — all in NodeDB without PostGIS or a separate graph database.
---

Logistics platforms need at least three data systems: a spatial database for geographic queries, a time-series store for position history, and a graph database for route planning. NodeDB provides all three in one binary — OGC-compliant spatial predicates over an R*-tree index, a `TIME_KEY`-partitioned timeseries engine, and 13 graph algorithms over a CSR adjacency index — and they share a single snapshot.

## Engines used

| Engine | Role |
|---|---|
| **Spatial** | Fleet positions, zone boundaries, proximity queries, OGC predicates |
| **Timeseries** | Position history, speed logs, ETA telemetry |
| **Graph** | Road network, depot-to-destination routing, delivery sequencing |
| **Key-Value** | Live vehicle state, driver assignments, hot-path lookups |

## Real-time fleet tracking

A `GEOMETRY SPATIAL_INDEX` column on a `WITH (engine='spatial')` collection gets an automatic R*-tree. Proximity queries over thousands of vehicles complete in milliseconds.

```sql
CREATE COLLECTION vehicles (
  vehicle_id VARCHAR PRIMARY KEY,
  location   GEOMETRY SPATIAL_INDEX,
  heading    FLOAT,
  speed_kmh  FLOAT,
  updated_at TIMESTAMP
) WITH (engine='spatial');

CREATE COLLECTION zones (
  zone_id   VARCHAR PRIMARY KEY,
  name      VARCHAR,
  zone_type VARCHAR,
  boundary  GEOMETRY SPATIAL_INDEX,
  active    BOOLEAN
) WITH (engine='spatial');

-- Update a vehicle's live position on each GPS ping
INSERT INTO vehicles (vehicle_id, location, heading, speed_kmh, updated_at)
VALUES ($vehicle_id, ST_Point($lon, $lat), $heading, $speed, now())
ON CONFLICT (vehicle_id) DO UPDATE SET
  location   = EXCLUDED.location,
  heading    = EXCLUDED.heading,
  speed_kmh  = EXCLUDED.speed_kmh,
  updated_at = EXCLUDED.updated_at;

-- All vehicles within 5 km of a point, recently seen
SELECT
  vehicle_id,
  ST_Distance(location, ST_Point($lon, $lat)) AS distance_m,
  speed_kmh,
  heading
FROM vehicles
WHERE ST_DWithin(location, ST_Point($lon, $lat), 5000)
  AND updated_at >= now() - INTERVAL '2 minutes'
ORDER BY distance_m;
```

## Zone containment and geofencing

Delivery zones, restricted areas, and service regions are polygons. Containment is a single OGC predicate.

```sql
-- Which active zone contains this delivery point?
SELECT z.zone_id, z.name, z.zone_type
FROM zones z
WHERE ST_Contains(z.boundary, ST_Point($delivery_lon, $delivery_lat))
  AND z.active = true;

-- Deliveries per zone in the last day (spatial join)
SELECT z.name, count(d.id) AS deliveries
FROM deliveries d
JOIN zones z ON ST_Contains(z.boundary, d.delivery_point)
WHERE d.delivered_at >= now() - INTERVAL '1 day'
GROUP BY z.name
ORDER BY deliveries DESC;
```

## Position history — Timeseries engine

Every GPS ping is appended to a timeseries collection for audit, replay, and ETA recalculation — without bloating the live-position spatial index.

```sql
CREATE COLLECTION position_history (
  ts          TIMESTAMP TIME_KEY,
  vehicle_id  VARCHAR,
  location    GEOMETRY SPATIAL_INDEX,
  speed_kmh   FLOAT,
  odometer_km FLOAT
) WITH (engine='timeseries', partition_by='1d', retention='365d');

INSERT INTO position_history (ts, vehicle_id, location, speed_kmh, odometer_km)
VALUES (now(), $vehicle_id, ST_Point($lon, $lat), $speed, $odometer);

-- Reconstruct a vehicle's route for a shift
SELECT ts, location, speed_kmh
FROM position_history
WHERE vehicle_id = $vehicle_id
  AND ts BETWEEN $shift_start AND $shift_end
ORDER BY ts;

-- Average speed per hour to surface traffic patterns
SELECT time_bucket('1h', ts) AS hour, vehicle_id, avg(speed_kmh) AS avg_speed
FROM position_history
WHERE ts >= now() - INTERVAL '7 days'
GROUP BY hour, vehicle_id;
```

## Graph-based routing

The road network is stored as graph edges overlaid on a document collection. NodeDB's `GRAPH PATH` does hop-shortest paths; `GRAPH ALGO SSSP` (Dijkstra) does weighted single-source shortest paths over an edge cost property — no separate graph database.

```sql
CREATE COLLECTION road_network;   -- holds the nodes; edges are an overlay

-- Each road segment is a weighted edge
GRAPH INSERT EDGE IN 'road_network' FROM $node_a TO $node_b TYPE 'road'
  PROPERTIES { distance_m: 120, travel_time_s: 14 };

-- Shortest hop path between two nodes along 'road' edges
GRAPH PATH FROM $depot_node TO $dest_node MAX_DEPTH 50 LABEL 'road';

-- Weighted shortest paths from a depot to every reachable node (Dijkstra)
GRAPH ALGO SSSP ON road_network FROM $depot_node;
```

## Delivery sequencing — cross-engine query

Combine spatial proximity (which deliveries fall in which zone) with a sequencing pass (visit order within each zone) in one query.

```sql
WITH zone_deliveries AS (
  SELECT
    d.id            AS delivery_id,
    d.address_point,
    z.zone_id,
    z.depot_point
  FROM deliveries d
  JOIN zones z ON ST_Contains(z.boundary, d.address_point)
  WHERE d.scheduled_date = current_date
    AND d.status = 'pending'
)
SELECT
  zone_id,
  row_number() OVER (
    PARTITION BY zone_id
    ORDER BY ST_Distance(address_point, depot_point)
  ) AS visit_order,
  delivery_id,
  address_point
FROM zone_deliveries
ORDER BY zone_id, visit_order;
```

<Callout kind="tip" title="Spatial + Graph in one process">
Spatial and Graph share the same global surrogate identity space, so you can match a road node by geographic proximity (Spatial R*-tree) and immediately run a traversal or `GRAPH ALGO SSSP` from it — no ID translation, no network round-trip between engines.
</Callout>

## Live ETAs via LIVE SELECT

Push position changes to rider apps the moment a vehicle moves — no polling.

```sql
-- The rider app subscribes to one vehicle's position stream
LIVE SELECT vehicle_id, location, speed_kmh FROM vehicles
WHERE vehicle_id = $vehicle_id;

-- On each GPS ping the subscriber receives the new row, recomputes the ETA
-- against the routing graph, and pushes it to the app.
```


---
# Gaming & Social (use-cases/gaming-social)

---
title: Gaming & Social
description: Social graphs, real-time leaderboards, offline-first mobile with CRDT sync, and live change streams — without Redis, Neo4j, or a separate pub/sub broker.
---

Gaming and social platforms stress databases in ways most workloads do not: millions of concurrent writes, graph traversals for friend feeds and matchmaking, sub-millisecond leaderboard reads, and mobile clients that must work offline and sync automatically. NodeDB handles all of these as peer engines and overlays in a single deployment.

## Engines used

| Engine | Role |
|---|---|
| **Graph** | Friend networks, guild membership, match-history traversal |
| **Key-Value** | Leaderboards (sorted index), session tokens, inventory hot path |
| **Document (schemaless)** | Player profiles, game state, match records with CRDT sync |
| **Timeseries** | Match telemetry, event logs, achievement timestamps |
| **Full-Text Search** | Player search by username, guild discovery |

## Social graph — friends, guilds, follows

Graph edges are an overlay on a document collection — here `players`. The CSR adjacency index handles millions of nodes and edges; friend-of-friend, mutual connections, and guild traversals run as Cypher-subset `MATCH` patterns.

```sql
CREATE COLLECTION players;

-- Add a friendship (bidirectional → two edges)
GRAPH INSERT EDGE IN 'players' FROM $player_a TO $player_b TYPE 'friend';
GRAPH INSERT EDGE IN 'players' FROM $player_b TO $player_a TYPE 'friend';

-- Friends-of-friends: people you might know (2 hops, excluding direct friends)
MATCH (me:Player)-[:friend]->(friend:Player)-[:friend]->(candidate:Player)
WHERE me.id = $me
  AND candidate.id <> $me
  AND NOT EXISTS { MATCH (me)-[:friend]->(candidate) }
RETURN DISTINCT candidate.id, candidate.username
LIMIT 20;

-- Guild members reachable through a player's guilds
MATCH (me:Player)-[:member_of]->(guild:Guild)<-[:member_of]-(member:Player)
WHERE me.id = $me AND member.id <> $me
RETURN member.id, member.username, guild.name
LIMIT 50;
```

## Real-time leaderboard — sorted index

A `SORTED INDEX` on the score column gives O(log n) rank lookups and top-k scans without a Redis sidecar. The leaderboard functions (`RANK`, `TOPK`, `SORTED_COUNT`) operate directly on it.

```sql
CREATE COLLECTION scores (player_id TEXT PRIMARY KEY) WITH (engine='kv');
CREATE SORTED INDEX lb ON scores (score DESC) KEY player_id;

-- Record a player's score after a match, keeping their best
INSERT INTO scores (player_id, score) VALUES ($player_id, $new_score)
ON CONFLICT (player_id) DO UPDATE
  SET score = greatest(scores.score, EXCLUDED.score);

-- Top 100 global ranks
SELECT * FROM TOPK('lb', 100);

-- A single player's rank, and the total leaderboard size
SELECT RANK('lb', $player_id) AS rank, SORTED_COUNT('lb') AS total;

-- Everyone in a score band (e.g. a bracket)
SELECT * FROM RANGE('lb', $min_score, $max_score);
```

## Offline-first mobile — CRDT sync

Mobile games must function without a network connection. NodeDB-Lite runs the same engines embedded on iOS and Android; declaring a `conflict_policy` on a collection makes its writes mergeable, and Lite syncs deltas to Origin automatically when connectivity returns.

```sql
-- Mergeable player state: concurrent edits to different fields both survive
CREATE COLLECTION player_state WITH (conflict_policy = 'field_merge');

-- The device subscribes only to its own slice of the data
SUBSCRIBE SHAPE ON player_state WHERE player_id = $me;

-- Offline: the player collects items, finishes quests — ordinary local writes
UPSERT INTO player_state
  { player_id: $me, inventory: $inventory, quests: $quests, updated_at: now() };

-- When the network returns, NodeDB-Lite ships the CRDT deltas to Origin. The
-- collection's conflict_policy decides the outcome: 'field_merge' merges
-- concurrent edits to different fields; same-field clashes fall back to LWW.
```

<Callout kind="tip" title="Conflict policies">
`lww` (last-writer-wins) is the default; `field_merge` keeps concurrent edits to different fields. Set it at creation or with `ALTER COLLECTION player_state SET conflict_policy = 'field_merge'`. See the [Conflict Policies docs](/docs/crdt-sync/conflict-policies).
</Callout>

## Live match updates — LIVE SELECT

Push score updates and match events to spectators and dashboards in real time without a separate pub/sub broker. `LIVE SELECT` registers a filtered query and streams matching rows over pgwire, WebSocket, or the native protocol.

```sql
-- A spectator subscribes to one match's event feed
LIVE SELECT * FROM match_events WHERE match_id = $match_id;

-- The match engine inserts events as they occur:
INSERT INTO match_events (match_id, event_type, player_id, data, ts)
VALUES ($match_id, 'kill', $player_id, $event_data, now());

-- Every subscriber receives the row immediately. No polling, no WebSocket
-- server. Cancel with: CANCEL LIVE SELECT <subscription_id>;
```

## Match history and telemetry — Timeseries engine

Every match generates thousands of telemetry events. The Timeseries engine ingests them at high throughput and makes them queryable for replay, anti-cheat, and balance analytics.

```sql
CREATE COLLECTION match_telemetry (
  ts        TIMESTAMP TIME_KEY,
  match_id  UUID,
  player_id UUID,
  event     VARCHAR,
  data      JSON
) WITH (engine='timeseries', partition_by='1d', retention='90d');

INSERT INTO match_telemetry (ts, match_id, player_id, event, data)
VALUES (now(), $match_id, $player_id, 'damage_dealt', '{"amount": 342, "weapon": "rifle"}');

-- Average damage per weapon in a match — for balance analysis
SELECT
  data ->> 'weapon'             AS weapon,
  avg((data ->> 'amount')::int) AS avg_damage,
  count(*)                      AS events
FROM match_telemetry
WHERE match_id = $match_id
  AND event = 'damage_dealt'
GROUP BY weapon
ORDER BY avg_damage DESC;
```

## Player search — Full-Text Search overlay

Username and guild discovery uses the FTS overlay on the `players` collection — no Elasticsearch cluster.

```sql
CREATE SEARCH INDEX ON players FIELDS username, bio ANALYZER 'simple' FUZZY true;

SELECT id, username, bio, bm25_score(username, $term) AS score
FROM players
WHERE text_match(username, $term, { fuzzy: true, distance: 1 })
ORDER BY score DESC
LIMIT 10;
```


---
# IoT & Edge (use-cases/iot-edge)

---
title: IoT & Edge
description: High-throughput sensor ingestion, ND array tiles for batch telemetry, retention policies, and CRDT edge-to-cloud sync via NodeDB-Lite — one stack from device to cloud.
---

IoT stacks typically span a time-series database for sensor data, an object store or array database for bulk telemetry, and a synchronisation layer for edge devices. NodeDB covers all three: the Timeseries engine handles streaming ingestion, the Array engine stores multi-dimensional batch telemetry (replacing InfluxDB + TileDB separately), and NodeDB-Lite runs the same engines embedded on edge hardware, syncing deltas to Origin over CRDT when connectivity allows.

## Engines used

| Engine | Role |
|---|---|
| **Timeseries** | Streaming sensor ingestion, continuous aggregates, retention |
| **Array** | ND telemetry tiles (spectrometry, radar, camera frames), batch analytics |
| **Document (schemaless)** | Device registry, configuration, alert rules with CRDT sync |
| **Key-Value** | Last-known device state, command queues |

## Streaming sensor ingestion — Timeseries engine

The Timeseries engine is append-only, with a `TIME_KEY` column driving partition-by-time and block skip, plus per-collection retention.

```sql
CREATE COLLECTION sensor_readings (
  ts        TIMESTAMP TIME_KEY,
  device_id VARCHAR,
  metric    VARCHAR,
  value     FLOAT,
  unit      VARCHAR
) WITH (engine='timeseries', partition_by='1d', retention='180d');

-- Bulk import (NDJSON / JSON array / CSV auto-detected)
COPY sensor_readings FROM '/var/spool/readings.ndjson';

-- Or INSERT for low-volume devices
INSERT INTO sensor_readings (ts, device_id, metric, value, unit) VALUES
  (now(), 'device-001', 'temperature', 23.4, 'C'),
  (now(), 'device-001', 'humidity',    61.2, '%'),
  (now(), 'device-001', 'co2_ppm',     412,  'ppm');

-- 15-minute summaries
SELECT
  time_bucket('15m', ts) AS bucket,
  device_id,
  avg(CASE WHEN metric = 'temperature' THEN value END) AS avg_temp,
  max(CASE WHEN metric = 'co2_ppm'     THEN value END) AS max_co2
FROM sensor_readings
WHERE ts >= now() - INTERVAL '24 hours'
GROUP BY bucket, device_id
ORDER BY bucket DESC;
```

For line-protocol producers (Telegraf, Vector), enable the ILP listener (`ports.ilp = 8086`) and push directly:

```bash
echo "env,device=device-001 temperature=23.4,humidity=61.2 1735689600000000000" | nc localhost 8086
```

## Multi-dimensional telemetry — Array engine

Instruments like spectrometers, LiDAR scanners, and thermal cameras produce ND arrays of readings. The Array engine stores them in compressed Z-order tiles with per-tile ND-MBR statistics — replacing a separate TileDB or Zarr deployment. It uses its own DDL family (`CREATE ARRAY`), not `CREATE COLLECTION`.

```sql
-- A 3D spectrometry array: (device, wavelength, time). Dimensions are
-- integer-typed with half-open domains [lo, hi).
CREATE ARRAY spectral_readings
  DIMS (
    device_id     INT64 DOMAIN [0, 100000),
    wavelength_nm INT32 DOMAIN [300, 1100),
    ts_epoch      INT64 DOMAIN [0, 9223372036854775807)
  )
  ATTRS (intensity FLOAT32, noise_floor FLOAT32)
  TILE_EXTENTS (1, 128, 3600)
  WITH (cell_order = 'Z-ORDER', audit_retain_ms = 7776000000);  -- 90 days

-- Insert cells from a scan session
INSERT INTO ARRAY spectral_readings (device_id, wavelength_nm, ts_epoch, intensity, noise_floor) VALUES
  ($dev, 500, $t, 0.41, 0.02),
  ($dev, 501, $t, 0.43, 0.02),
  ($dev, 502, $t, 0.40, 0.02);

-- Force in-memory tiles to durable storage
SELECT ARRAY_FLUSH('spectral_readings');

-- Slice: wavelengths 500–600 nm for one device over a time window
SELECT wavelength_nm, avg(intensity) AS avg_intensity
FROM ARRAY_SLICE(
  'spectral_readings',
  { device_id: [$dev, $dev + 1), wavelength_nm: [500, 600), ts_epoch: [$t0, $t1) },
  ['intensity']
)
GROUP BY wavelength_nm
ORDER BY wavelength_nm;

-- Reduce a dimension: total intensity per wavelength, collapsing time
SELECT * FROM ARRAY_AGG('spectral_readings', 'intensity', 'SUM', 'ts_epoch');
```

## Edge deployment — NodeDB-Lite

Edge gateways run NodeDB-Lite: the full engine set in an embeddable library with no network dependencies. Data is written locally and synced to Origin via Loro CRDT deltas when connectivity is available.

```sql
-- On the edge device (NodeDB-Lite, embedded in a Rust/C/Swift process)

-- Local timeseries writes survive network outages
INSERT INTO local_readings (ts, sensor_id, metric, value)
VALUES (now(), $sensor_id, 'vibration_g', $g);

-- Device configuration arrives via a shape subscription scoped to this device
SUBSCRIBE SHAPE ON device_config WHERE device_id = $me;
SELECT value FROM device_config WHERE key = 'alert_thresholds';
```

```swift
// iOS / embedded Swift
let db = NodeDbLite.open(path: "edge.db")
db.execute("INSERT INTO local_readings ...")
db.sync(url: "wss://origin.example.com/sync", token: authToken)
```

<Callout kind="tip" title="Sync on reconnect">
CRDT sync is transparent — the edge process doesn't implement retry logic, conflict resolution, or delta tracking. Declare a `conflict_policy` on the collection (`lww` or `field_merge`) and the engine produces and ships the deltas; Origin validates SQL constraints at Raft commit and replies with a `CompensationHint` if a write loses.
</Callout>

## Last-known state — Key-Value engine

The KV engine stores the current state of every device for sub-millisecond dashboard reads without scanning the timeseries collection.

```sql
CREATE COLLECTION device_state (key TEXT PRIMARY KEY) WITH (engine='kv');

-- Updated on every sensor publish
UPSERT INTO device_state { key: $device_id, last_seen: now(), status: $status, battery_pct: $batt };

-- Dashboard: current state of many devices at once
SELECT * FROM device_state WHERE key = ANY($device_ids);
```

## Alerting via LIVE SELECT

Trigger downstream alert handlers the moment a sensor value crosses a threshold — no polling loop.

```sql
-- Subscribe to readings that breach the CO2 alert threshold
LIVE SELECT device_id, value, ts FROM sensor_readings
WHERE metric = 'co2_ppm'
  AND value > 1000;

-- The handler receives each qualifying row immediately and fans out to
-- PagerDuty, SMS, or a dashboard socket. Cancel with: CANCEL LIVE SELECT <id>;
```

## Retention and downsampling

Raw readings age out via the collection's `retention`. A continuous aggregate keeps incrementally-maintained hourly summaries — no cron job, no separate pipeline.

```sql
CREATE CONTINUOUS AGGREGATE sensor_hourly ON sensor_readings AS
  SELECT
    time_bucket('1h', ts) AS bucket,
    device_id,
    metric,
    avg(value) AS avg_val,
    min(value) AS min_val,
    max(value) AS max_val
  FROM sensor_readings
  GROUP BY bucket, device_id, metric
WITH (refresh_interval = '5m');
```

## Why not InfluxDB + TileDB?

A line-protocol time-series database can't store multi-dimensional instrument output, so you bolt on TileDB or Zarr — and now telemetry lives in two systems with two query languages and no shared identity. NodeDB keeps streaming readings and ND array tiles in one process: a query can prefilter cells by the same surrogate-identity bitmaps the rest of the engines use, and the edge runs the identical engine set embedded.


---
# E-commerce & Search (use-cases/ecommerce-search)

---
title: E-commerce & Search
description: Hybrid BM25 + vector product search, RRF fusion ranking, recommendation graphs, and real-time inventory analytics — without Elasticsearch, a vector database, or a separate OLAP cluster.
---

E-commerce search stacks are notoriously complex: Elasticsearch for keyword search, a vector database for semantic similarity, a recommendation engine backed by a graph database, and a data warehouse for sales analytics. NodeDB provides all four as peer engines (plus Graph and FTS overlays) sharing one storage core, one query planner, and one operational surface.

## Engines used

| Engine | Role |
|---|---|
| **Full-Text Search** | BM25 keyword search, faceted filtering, autocomplete |
| **Vector** | Semantic product search, visual similarity, embedding-based recommendations |
| **Graph** | "Customers also bought", category navigation, brand-product relationships |
| **Document (schemaless)** | Product catalog, flexible attributes, variant metadata |
| **Columnar** | Sales analytics, inventory reporting, cohort analysis |

## Hybrid product search

The strongest e-commerce search combines keyword precision with semantic recall, then fuses the ranked lists. `rrf_score(...)` does the fusion in the planner — no application-side merge, no Elasticsearch + Pinecone glue code.

```sql
CREATE COLLECTION products;
CREATE SEARCH INDEX ON products FIELDS title, description ANALYZER 'english' FUZZY true;
-- One vector index per embedding column on the same collection.
CREATE VECTOR INDEX idx_text  ON products (text_embedding)  METRIC cosine DIM 384;
CREATE VECTOR INDEX idx_image ON products (image_embedding) METRIC cosine DIM 512;

-- Hybrid search: BM25 keyword + vector semantic, fused with RRF
SELECT
  id,
  title,
  price,
  brand,
  rrf_score(
    vector_distance(text_embedding, $query_vec),
    bm25_score(description, $query_text)
  ) AS relevance
FROM products
WHERE in_stock     = true
  AND price        BETWEEN $min_price AND $max_price
  AND category_id  = ANY($category_ids)
ORDER BY relevance DESC
LIMIT 20;
```

## Autocomplete and faceted filtering

```sql
-- Autocomplete: fuzzy title match, ranked by popularity
SELECT DISTINCT id, title
FROM products
WHERE text_match(title, $prefix, { fuzzy: true, distance: 1 })
ORDER BY popularity_score DESC
LIMIT 8;

-- Facet counts: products per brand in the current keyword result set
SELECT brand, count(*) AS count
FROM products
WHERE text_match(description, $query_text)
  AND in_stock = true
GROUP BY brand
ORDER BY count DESC
LIMIT 15;
```

## Visual similarity search

For fashion, furniture, and home goods, image-embedding similarity lets shoppers find products that look alike — searching the `image_embedding` index created above, right alongside the catalog row:

```sql
-- "Shop similar items" — products whose image embedding is closest to the
-- viewed one, restricted to the same in-stock category.
SELECT id, title, price, image_url,
       image_embedding <=> $viewed_vec AS visual_distance
FROM products
WHERE category_id = $category_id
  AND in_stock = true
  AND id <> $viewed_product_id
ORDER BY image_embedding <=> $viewed_vec
LIMIT 12;
```

## Recommendation graph — "Customers also bought"

Co-purchase relationships are graph edges layered on the `products` collection — graph is an overlay, you don't create a separate "graph collection." Traversal-based recommendations run alongside the catalog.

```sql
-- A batch job recomputes co-purchase counts and (re)writes the edge with the
-- current weight — DELETE then INSERT, since GRAPH INSERT EDGE is insert-only.
GRAPH DELETE EDGE IN 'products' FROM $product_a TO $product_b TYPE 'co_purchased';
GRAPH INSERT EDGE IN 'products' FROM $product_a TO $product_b TYPE 'co_purchased'
  PROPERTIES { weight: $copurchase_count };

-- "Customers also bought": direct co-purchase neighbours of a product,
-- ranked by co-purchase weight. Graph nodes are document rows, so their
-- catalog fields are available in RETURN.
MATCH (seed:Product)-[e:co_purchased]->(rec:Product)
WHERE seed.id = $product_id
  AND rec.in_stock = true
RETURN rec.id, rec.title, rec.price, e.weight AS affinity
ORDER BY e.weight DESC
LIMIT 8;
```

## Real-time inventory — LIVE SELECT

Push in-stock / out-of-stock transitions to product pages without polling. `LIVE SELECT` registers a query and streams matching changes over pgwire, WebSocket, or the native protocol.

```sql
-- A shopper viewing their wishlist subscribes to back-in-stock events
LIVE SELECT id, title FROM products
WHERE id = ANY($wishlist_ids)
  AND in_stock = true;

-- When a back-ordered item is restocked, every subscriber gets the row
-- delivered immediately. Cancel with: CANCEL LIVE SELECT <subscription_id>;
```

## Sales analytics — Columnar engine

The Columnar engine handles aggregation-heavy analytics across millions of orders without a separate data warehouse — per-column compression, block statistics, and predicate pushdown.

```sql
CREATE COLLECTION orders (
  id          TEXT PRIMARY KEY,
  placed_at   TIMESTAMP TIME_KEY,
  status      TEXT
) WITH (engine='columnar');

-- Revenue by category, last 30 days
SELECT
  c.name                       AS category,
  count(DISTINCT o.id)         AS orders,
  sum(oi.quantity * oi.price)  AS revenue,
  avg(oi.quantity * oi.price)  AS avg_order_value
FROM orders o
JOIN order_items oi  ON oi.order_id    = o.id
JOIN products    p   ON p.id           = oi.product_id
JOIN categories  c   ON c.id           = p.category_id
WHERE o.placed_at >= now() - INTERVAL '30 days'
  AND o.status    = 'completed'
GROUP BY c.name
ORDER BY revenue DESC;

-- Conversion funnel: views → add-to-cart → purchase
SELECT
  time_bucket('1d', e.ts) AS day,
  count(CASE WHEN e.event = 'product_view' THEN 1 END) AS views,
  count(CASE WHEN e.event = 'add_to_cart'  THEN 1 END) AS add_to_cart,
  count(CASE WHEN e.event = 'purchase'     THEN 1 END) AS purchases,
  round(
    count(CASE WHEN e.event = 'purchase' THEN 1 END)::numeric
    / nullif(count(CASE WHEN e.event = 'product_view' THEN 1 END), 0) * 100,
    2
  ) AS conversion_pct
FROM events e
WHERE e.ts >= now() - INTERVAL '14 days'
GROUP BY day
ORDER BY day;
```

<Callout kind="tip" title="Matryoshka embeddings for tiered search">
Store full-precision embeddings and rank the first N dimensions for a fast first pass via the `query_dim` tuning argument, then re-rank the top candidates at full precision — no reindexing:

```sql
SELECT id, title, vector_distance(text_embedding, $query_vec, query_dim => 256) AS distance
FROM products
WHERE in_stock = true
ORDER BY distance
LIMIT 100;
```
</Callout>

## Why not Elasticsearch + a vector database?

Elasticsearch handles keyword search but requires a separate vector database for semantic recall, a graph database for recommendations, and a data warehouse for analytics. Synchronising product-catalog changes across four systems is the dominant operational burden. NodeDB keeps every engine in the same process: a hybrid search + recommendation + inventory check is one SQL statement, not four API calls.