Array Engine

NodeDB's array engine stores multi-dimensional sparse data with bitemporal support — system time (when the cell was written) and valid time (when the cell represents). Cells are coordinate-tuple indexed, grouped into tiles, compressed per-tile, and queryable via SQL table-valued functions.

This is a peer engine but uses its own DDL family (CREATE ARRAY) — not CREATE COLLECTION ... WITH (engine='array').

When to Use

  • Genomics: (chromosome × position × sample × allele) — replaces TileDB-VCF
  • Single-cell biology: (gene × cell × condition × replicate) — replaces TileDB-SOMA
  • Earth observation: (lat × lon × band × time) raster cubes — replaces Zarr / TileDB-Geo
  • Climate models: (lat × lon × level × time × variable) — replaces HDF5 + Dask
  • Astronomy: (RA × Dec × wavelength × time) — replaces custom Zarr stacks
  • Sparse ML features: (user × item × context) — replaces specialized matrix-factorization systems

Key Features

  • ND coordinate-tuple keying — arbitrary number of dimensions; only materialized cells are stored
  • Tile-based compression — cells grouped into tiles; each tile independently compressed (ALP, FastLanes, Gorilla, LZ4 via nodedb-codec)
  • Z-order indexing — Hilbert/Z-order curve linearization for spatial locality and fast range queries
  • Per-tile ND MBR statistics — minimum bounding rectangle skip; queries prune entire tiles before decompressing
  • Bitemporal — both system time (audit trail) and valid time (temporal semantics) tracked per tile
  • Row-major or column-major layoutcell_order chosen at creation
  • Cross-engine surrogate identity — array cells participate in cross-engine bitmap intersections alongside vector / graph / document / columnar
  • Distributed — tiles vShard-routed; queries scatter-gather across cores and nodes
  • WAL-durable + Raft-replicated — same durability guarantees as the rest of NodeDB
  • Tile-level retentionaudit_retain_ms enables GDPR / data-minimization compliance

DDL Syntax

CREATE ARRAY spatial_grid
  DIMS (
    x INT64 [0..1000],
    y INT64 [0..1000],
    z INT64 [0..1000]
  )
  ATTRS (
    temperature FLOAT64 NOT NULL,
    pressure    FLOAT64,
    humidity    FLOAT64
  )
  TILE_EXTENTS (64, 64, 64)
  CELL_ORDER ZORDER
  TILE_ORDER ROW_MAJOR
  WITH (
    prefix_bits = 8,
    audit_retain_ms = 86400000,
    minimum_audit_retain_ms = 3600000
  );
ParameterRequiredDefaultDescription
DIMSYesDimensions. Each has a name, type (INT64, FLOAT64, TIMESTAMP_MS, STRING), and optional domain [lo..hi].
ATTRSYesAttributes (cell values). Each has a name, type (INT64, FLOAT64, STRING, BYTES), and optional NOT NULL.
TILE_EXTENTSYesTile extent per dimension; all > 0. Determines cell locality and compression block granularity.
CELL_ORDERNoHILBERTUnquoted: ROW_MAJOR, COL_MAJOR, HILBERT, ZORDER (Z-order curve). Affects cell layout.
TILE_ORDERNoHILBERTUnquoted: ROW_MAJOR, COL_MAJOR, HILBERT, ZORDER. Affects tile layout.
prefix_bitsNo8Range 1–16. Bits used in prefix codec for compression. Higher = finer granularity, lower = better compression.
audit_retain_msNoNULLTiles older than now - audit_retain_ms (system time) become eligible for purge. NULL = keep all.
minimum_audit_retain_msNoNULLMinimum retention even if audit_retain_ms is lower. Used to enforce compliance minimums.

ALTER ARRAY <name> SET (audit_retain_ms = ...) updates retention; DROP ARRAY <name> is two-phase like DROP COLLECTION.

Insert

CREATE ARRAY elevation_map
  DIMS (
    lon FLOAT64 [-180..180],
    lat FLOAT64 [-90..90]
  )
  ATTRS (height FLOAT64)
  TILE_EXTENTS (256, 256);

INSERT INTO ARRAY elevation_map (lon, lat, height) VALUES
  (-73.5, 40.7, 10.5),
  (-73.6, 40.8, 12.3),
  (-73.7, 40.6, 8.9);

-- Force the in-memory tiles to durable storage
SELECT ARRAY_FLUSH('elevation_map');

Query Functions

Array queries are expressed as table-valued functions in FROM. System time and valid time apply via AS OF clauses.

ARRAY_SLICE — multi-dimensional range

SELECT * FROM ARRAY_SLICE(
  'elevation_map',
  {lon: [-74.0, -73.0), lat: [40.0, 41.0)},
  ['height'],   -- attribute projection (optional)
  1000          -- max cells (optional)
);
ParameterRequiredTypeDescription
arrayYesSTRINGArray name
boundsYesOBJECT{ dim: [lo, hi) }. Omitted dims = full range.
attrsNoARRAY[STRING]Attributes to project. NULL = all attributes.
limitNoINT64Max cells returned. NULL = no limit.

ARRAY_PROJECT — attribute projection

SELECT * FROM ARRAY_PROJECT('spatial_grid', ['temperature', 'pressure']);

ARRAY_AGG — reduce a dimension

Aggregates an attribute over a dimension, reducing dimensionality:

-- Sum temperature over x; result keeps y and z
SELECT * FROM ARRAY_AGG('spatial_grid', 'temperature', 'SUM', 'x');

Reducers: 'SUM', 'AVG', 'MIN', 'MAX', 'COUNT'.

ARRAY_ELEMENTWISE — between two arrays of the same shape

SELECT * FROM ARRAY_ELEMENTWISE('current_grid', 'baseline_grid', 'SUBTRACT', 'temperature');

Maintenance

SELECT ARRAY_FLUSH('spatial_grid');    -- force memtable flush
SELECT ARRAY_COMPACT('spatial_grid');  -- merge tile versions, reclaim space

ARRAY_FLUSH always returns {result: true} on success; failure raises. Compaction also runs automatically in the background.

Bitemporal Queries

Every array cell carries two times:

  • System time — when the value was written (audit trail, compliance, point-in-time recovery)
  • Valid time — when the value represents (forecasts, backdated corrections, scientific replays)
-- Read cells as the array existed in the past
SELECT * FROM ARRAY_SLICE('data', {x: [0, 100), y: [0, 100)}, ['value'])
AS OF SYSTEM TIME 1700000000000;

-- Read cells whose valid-time interval includes a given moment
SELECT * FROM ARRAY_SLICE('forecast', {x: [0, 100), y: [0, 100)}, ['temp'])
AS OF VALID TIME 1700000000000;

-- Both clauses combined
SELECT * FROM ARRAY_SLICE('forecast', {x: [0, 100), y: [0, 100)}, ['temp'])
AS OF SYSTEM TIME 1700000000000 AS OF VALID TIME 1700000001000;

System-time–based retention is the path to GDPR and data-minimization compliance: audit_retain_ms makes tiles older than the window eligible for irreversible purge during compaction.

Cross-Engine Queries

Array cells participate in surrogate-identity bitmaps with the rest of the engines, so a single query can prefilter by vector neighborhood and slice an array:

SELECT *
FROM ARRAY_SLICE('spatial_data', {x: [0, 1000), y: [0, 1000)}, ['attr1', 'attr2'])
WHERE id IN (
  SEARCH vectors USING VECTOR(embedding, $query, 100)
);

See Architecture Overview for the cross-engine identity model.

Performance

  • Tile-level parallelism — each tile is read and processed on its own core
  • Compression — typical 5–20× depending on data homogeneity
  • Range queries — Z-order layout gives cache-friendly access; ND-MBR per-tile stats prune irrelevant tiles before decompression
  • Sparse-friendly — only materialized cells are stored; implicit zeros and empty regions cost nothing
View page sourceLast updated on Jun 10, 2026 by Farhan Syah