Array Engine
NodeDB's array engine stores multi-dimensional sparse data with bitemporal support — system time (when the cell was written) and valid time (when the cell represents). Cells are coordinate-tuple indexed, grouped into tiles, compressed per-tile, and queryable via SQL table-valued functions.
This is a peer engine but uses its own DDL family (CREATE ARRAY) — not CREATE COLLECTION ... WITH (engine='array').
When to Use
- Genomics:
(chromosome × position × sample × allele)— replaces TileDB-VCF - Single-cell biology:
(gene × cell × condition × replicate)— replaces TileDB-SOMA - Earth observation:
(lat × lon × band × time)raster cubes — replaces Zarr / TileDB-Geo - Climate models:
(lat × lon × level × time × variable)— replaces HDF5 + Dask - Astronomy:
(RA × Dec × wavelength × time)— replaces custom Zarr stacks - Sparse ML features:
(user × item × context)— replaces specialized matrix-factorization systems
Key Features
- ND coordinate-tuple keying — arbitrary number of dimensions; only materialized cells are stored
- Tile-based compression — cells grouped into tiles; each tile independently compressed (ALP, FastLanes, Gorilla, LZ4 via
nodedb-codec) - Z-order indexing — Hilbert/Z-order curve linearization for spatial locality and fast range queries
- Per-tile ND MBR statistics — minimum bounding rectangle skip; queries prune entire tiles before decompressing
- Bitemporal — both system time (audit trail) and valid time (temporal semantics) tracked per tile
- Row-major or column-major layout —
cell_orderchosen at creation - Cross-engine surrogate identity — array cells participate in cross-engine bitmap intersections alongside vector / graph / document / columnar
- Distributed — tiles vShard-routed; queries scatter-gather across cores and nodes
- WAL-durable + Raft-replicated — same durability guarantees as the rest of NodeDB
- Tile-level retention —
audit_retain_msenables GDPR / data-minimization compliance
DDL Syntax
CREATE ARRAY spatial_grid
DIMS (
x INT64 DOMAIN [0, 1000),
y INT64 DOMAIN [0, 1000),
z INT64 DOMAIN [0, 1000)
)
ATTRS (
temperature FLOAT32,
pressure FLOAT32,
humidity FLOAT32
)
TILE_EXTENTS (64, 64, 64)
WITH (
cell_order = 'Z-ORDER',
audit_retain_ms = 86400000
);
| Parameter | Required | Default | Description |
DIMS | Yes | — | Dimensions. Each has a name, type (INT32, INT64, FLOAT64), and half-open domain [lo, hi). |
ATTRS | Yes | — | Attributes (cell values). Each has a name and type (FLOAT32, FLOAT64, INT32, INT64, STRING). |
TILE_EXTENTS | Yes | — | Tile extent per dimension; all > 0. Determines cell locality and compression block granularity. |
cell_order | No | 'Z-ORDER' | 'Z-ORDER' (Hilbert curve) or 'ROW-MAJOR'. Affects spatial cache locality. |
audit_retain_ms | No | NULL | Tiles older than now - audit_retain_ms (system time) become eligible for purge. NULL = keep all. |
ALTER NDARRAY <name> SET (audit_retain_ms = ...) updates retention; DROP ARRAY <name> is two-phase like DROP COLLECTION.
Insert
CREATE ARRAY elevation_map
DIMS (
lon FLOAT64 DOMAIN [-180, 180),
lat FLOAT64 DOMAIN [-90, 90)
)
ATTRS (height FLOAT32)
TILE_EXTENTS (256, 256);
INSERT INTO ARRAY elevation_map (lon, lat, height) VALUES
(-73.5, 40.7, 10.5),
(-73.6, 40.8, 12.3),
(-73.7, 40.6, 8.9);
-- Force the in-memory tiles to durable storage
SELECT NDARRAY_FLUSH('elevation_map');
Query Functions
Array queries are expressed as table-valued functions in FROM. System time and valid time apply via AS OF clauses.
NDARRAY_SLICE — multi-dimensional range
SELECT * FROM NDARRAY_SLICE(
'elevation_map',
{lon: [-74.0, -73.0), lat: [40.0, 41.0)},
['height'], -- attribute projection (optional)
1000 -- max cells (optional)
);
| Parameter | Required | Type | Description |
array | Yes | STRING | Array name |
bounds | Yes | OBJECT | { dim: [lo, hi) }. Omitted dims = full range. |
attrs | No | ARRAY[STRING] | Attributes to project. NULL = all attributes. |
limit | No | INT64 | Max cells returned. NULL = no limit. |
NDARRAY_PROJECT — attribute projection
SELECT * FROM NDARRAY_PROJECT('spatial_grid', ['temperature', 'pressure']);
NDARRAY_AGG — reduce a dimension
Aggregates an attribute over a dimension, reducing dimensionality:
-- Sum temperature over x; result keeps y and z
SELECT * FROM NDARRAY_AGG('spatial_grid', 'temperature', 'SUM', 'x');
Reducers: 'SUM', 'AVG', 'MIN', 'MAX', 'COUNT'.
NDARRAY_ELEMENTWISE — between two arrays of the same shape
SELECT * FROM NDARRAY_ELEMENTWISE('current_grid', 'baseline_grid', 'SUBTRACT', 'temperature');
Maintenance
SELECT NDARRAY_FLUSH('spatial_grid'); -- force memtable flush
SELECT NDARRAY_COMPACT('spatial_grid'); -- merge tile versions, reclaim space
NDARRAY_FLUSH always returns {result: true} on success; failure raises. Compaction also runs automatically in the background.
Bitemporal Queries
Every array cell carries two times:
- System time — when the value was written (audit trail, compliance, point-in-time recovery)
- Valid time — when the value represents (forecasts, backdated corrections, scientific replays)
-- Read cells as the array existed in the past
SELECT * FROM NDARRAY_SLICE('data', {x: [0, 100), y: [0, 100)}, ['value'])
AS OF SYSTEM TIME 1700000000000;
-- Read cells whose valid-time interval includes a given moment
SELECT * FROM NDARRAY_SLICE('forecast', {x: [0, 100), y: [0, 100)}, ['temp'])
AS OF VALID TIME 1700000000000;
-- Both clauses combined
SELECT * FROM NDARRAY_SLICE('forecast', {x: [0, 100), y: [0, 100)}, ['temp'])
AS OF SYSTEM TIME 1700000000000 AS OF VALID TIME 1700000001000;
System-time–based retention is the path to GDPR and data-minimization compliance: audit_retain_ms makes tiles older than the window eligible for irreversible purge during compaction.
Cross-Engine Queries
Array cells participate in surrogate-identity bitmaps with the rest of the engines, so a single query can prefilter by vector neighborhood and slice an array:
SELECT *
FROM NDARRAY_SLICE('spatial_data', {x: [0, 1000), y: [0, 1000)}, ['attr1', 'attr2'])
WHERE id IN (
SEARCH vectors USING VECTOR(embedding, $query, 100)
);
See Architecture Overview for the cross-engine identity model.
Performance
- Tile-level parallelism — each tile is read and processed on its own core
- Compression — typical 5–20× depending on data homogeneity
- Range queries — Z-order layout gives cache-friendly access; ND-MBR per-tile stats prune irrelevant tiles before decompression
- Sparse-friendly — only materialized cells are stored; implicit zeros and empty regions cost nothing