Array Engine

NodeDB's array engine stores multi-dimensional sparse data with bitemporal support — system time (when the cell was written) and valid time (when the cell represents). Cells are coordinate-tuple indexed, grouped into tiles, compressed per-tile, and queryable via SQL table-valued functions.

This is a peer engine but uses its own DDL family (CREATE ARRAY) — not CREATE COLLECTION ... WITH (engine='array').

When to Use

  • Genomics: (chromosome × position × sample × allele) — replaces TileDB-VCF
  • Single-cell biology: (gene × cell × condition × replicate) — replaces TileDB-SOMA
  • Earth observation: (lat × lon × band × time) raster cubes — replaces Zarr / TileDB-Geo
  • Climate models: (lat × lon × level × time × variable) — replaces HDF5 + Dask
  • Astronomy: (RA × Dec × wavelength × time) — replaces custom Zarr stacks
  • Sparse ML features: (user × item × context) — replaces specialized matrix-factorization systems

Key Features

  • ND coordinate-tuple keying — arbitrary number of dimensions; only materialized cells are stored
  • Tile-based compression — cells grouped into tiles; each tile independently compressed (ALP, FastLanes, Gorilla, LZ4 via nodedb-codec)
  • Z-order indexing — Hilbert/Z-order curve linearization for spatial locality and fast range queries
  • Per-tile ND MBR statistics — minimum bounding rectangle skip; queries prune entire tiles before decompressing
  • Bitemporal — both system time (audit trail) and valid time (temporal semantics) tracked per tile
  • Row-major or column-major layoutcell_order chosen at creation
  • Cross-engine surrogate identity — array cells participate in cross-engine bitmap intersections alongside vector / graph / document / columnar
  • Distributed — tiles vShard-routed; queries scatter-gather across cores and nodes
  • WAL-durable + Raft-replicated — same durability guarantees as the rest of NodeDB
  • Tile-level retentionaudit_retain_ms enables GDPR / data-minimization compliance

DDL Syntax

CREATE ARRAY spatial_grid
  DIMS (
    x INT64 DOMAIN [0, 1000),
    y INT64 DOMAIN [0, 1000),
    z INT64 DOMAIN [0, 1000)
  )
  ATTRS (
    temperature FLOAT32,
    pressure    FLOAT32,
    humidity    FLOAT32
  )
  TILE_EXTENTS (64, 64, 64)
  WITH (
    cell_order = 'Z-ORDER',
    audit_retain_ms = 86400000
  );
ParameterRequiredDefaultDescription
DIMSYesDimensions. Each has a name, type (INT32, INT64, FLOAT64), and half-open domain [lo, hi).
ATTRSYesAttributes (cell values). Each has a name and type (FLOAT32, FLOAT64, INT32, INT64, STRING).
TILE_EXTENTSYesTile extent per dimension; all > 0. Determines cell locality and compression block granularity.
cell_orderNo'Z-ORDER''Z-ORDER' (Hilbert curve) or 'ROW-MAJOR'. Affects spatial cache locality.
audit_retain_msNoNULLTiles older than now - audit_retain_ms (system time) become eligible for purge. NULL = keep all.

ALTER NDARRAY <name> SET (audit_retain_ms = ...) updates retention; DROP ARRAY <name> is two-phase like DROP COLLECTION.

Insert

CREATE ARRAY elevation_map
  DIMS (
    lon FLOAT64 DOMAIN [-180, 180),
    lat FLOAT64 DOMAIN [-90, 90)
  )
  ATTRS (height FLOAT32)
  TILE_EXTENTS (256, 256);

INSERT INTO ARRAY elevation_map (lon, lat, height) VALUES
  (-73.5, 40.7, 10.5),
  (-73.6, 40.8, 12.3),
  (-73.7, 40.6, 8.9);

-- Force the in-memory tiles to durable storage
SELECT NDARRAY_FLUSH('elevation_map');

Query Functions

Array queries are expressed as table-valued functions in FROM. System time and valid time apply via AS OF clauses.

NDARRAY_SLICE — multi-dimensional range

SELECT * FROM NDARRAY_SLICE(
  'elevation_map',
  {lon: [-74.0, -73.0), lat: [40.0, 41.0)},
  ['height'],   -- attribute projection (optional)
  1000          -- max cells (optional)
);
ParameterRequiredTypeDescription
arrayYesSTRINGArray name
boundsYesOBJECT{ dim: [lo, hi) }. Omitted dims = full range.
attrsNoARRAY[STRING]Attributes to project. NULL = all attributes.
limitNoINT64Max cells returned. NULL = no limit.

NDARRAY_PROJECT — attribute projection

SELECT * FROM NDARRAY_PROJECT('spatial_grid', ['temperature', 'pressure']);

NDARRAY_AGG — reduce a dimension

Aggregates an attribute over a dimension, reducing dimensionality:

-- Sum temperature over x; result keeps y and z
SELECT * FROM NDARRAY_AGG('spatial_grid', 'temperature', 'SUM', 'x');

Reducers: 'SUM', 'AVG', 'MIN', 'MAX', 'COUNT'.

NDARRAY_ELEMENTWISE — between two arrays of the same shape

SELECT * FROM NDARRAY_ELEMENTWISE('current_grid', 'baseline_grid', 'SUBTRACT', 'temperature');

Maintenance

SELECT NDARRAY_FLUSH('spatial_grid');    -- force memtable flush
SELECT NDARRAY_COMPACT('spatial_grid');  -- merge tile versions, reclaim space

NDARRAY_FLUSH always returns {result: true} on success; failure raises. Compaction also runs automatically in the background.

Bitemporal Queries

Every array cell carries two times:

  • System time — when the value was written (audit trail, compliance, point-in-time recovery)
  • Valid time — when the value represents (forecasts, backdated corrections, scientific replays)
-- Read cells as the array existed in the past
SELECT * FROM NDARRAY_SLICE('data', {x: [0, 100), y: [0, 100)}, ['value'])
AS OF SYSTEM TIME 1700000000000;

-- Read cells whose valid-time interval includes a given moment
SELECT * FROM NDARRAY_SLICE('forecast', {x: [0, 100), y: [0, 100)}, ['temp'])
AS OF VALID TIME 1700000000000;

-- Both clauses combined
SELECT * FROM NDARRAY_SLICE('forecast', {x: [0, 100), y: [0, 100)}, ['temp'])
AS OF SYSTEM TIME 1700000000000 AS OF VALID TIME 1700000001000;

System-time–based retention is the path to GDPR and data-minimization compliance: audit_retain_ms makes tiles older than the window eligible for irreversible purge during compaction.

Cross-Engine Queries

Array cells participate in surrogate-identity bitmaps with the rest of the engines, so a single query can prefilter by vector neighborhood and slice an array:

SELECT *
FROM NDARRAY_SLICE('spatial_data', {x: [0, 1000), y: [0, 1000)}, ['attr1', 'attr2'])
WHERE id IN (
  SEARCH vectors USING VECTOR(embedding, $query, 100)
);

See Architecture Overview for the cross-engine identity model.

Performance

  • Tile-level parallelism — each tile is read and processed on its own core
  • Compression — typical 5–20× depending on data homogeneity
  • Range queries — Z-order layout gives cache-friendly access; ND-MBR per-tile stats prune irrelevant tiles before decompression
  • Sparse-friendly — only materialized cells are stored; implicit zeros and empty regions cost nothing
View page sourceLast updated on May 2, 2026 by Farhan Syah