Full-Text Search Engine

Block-Max WAND (BMW) optimized BM25 ranking with 16 Snowball stemmers, 27-language stop words, CJK bigram tokenization, posting compression, fuzzy matching, and native hybrid fusion with vector search.

When to Use

Text search across documents, articles, products, logs
Search-as-you-type with fuzzy matching
Multilingual content search (including CJK, Arabic, Hindi)
Hybrid retrieval: keyword matching + semantic similarity

SQL Usage

CREATE COLLECTION articles;
CREATE SEARCH INDEX ON articles FIELDS title, body ANALYZER 'english' FUZZY true;

-- Basic search
SELECT title, bm25_score(body, 'distributed database') AS score
FROM articles
WHERE text_match(body, 'distributed database')
ORDER BY score DESC LIMIT 20;

-- Fuzzy search
SELECT title FROM articles WHERE text_match(title, 'databse', { fuzzy: true, distance: 2 });

-- Hybrid BM25 + vector (RRF)
SELECT title, rrf_score(
    vector_distance(embedding, $query_vec),
    bm25_score(body, 'distributed systems')
) AS score
FROM articles
LIMIT 10;

-- Synonyms
CREATE SYNONYM GROUP db_terms AS ('database', 'db', 'datastore');

Analyzers

Analyzer	Behavior
`standard`	NFD normalize, lowercase, English stop/stem
`simple`	Lowercase + whitespace split
`keyword`	Entire input as a single token
`cjk_bigram`	CJK bigram tokenization
`ngram:2:4`	Character n-grams (min:max)
`edge_ngram`	Prefix-anchored n-grams for autocomplete

16 language-specific analyzers: ar, da, nl, en, fi, fr, de, hu, it, no, pt, ro, ru, es, sv, tr.

CJK text is automatically routed to bigram tokenizer regardless of configured analyzer. Optional dictionary segmentation via feature gates: lang-ja, lang-zh, lang-ko, lang-th.

Internals

BMW scoring — WAND pivot selection + 128-doc block pruning via precomputed upper bounds
Posting compression — Delta-encoded, variable-width bitpacked doc IDs with SIMD unpack (SSE2/NEON)
SmallFloat fieldnorms — 1-byte length quantization (4x space reduction)
LSM storage — In-memory memtable → immutable segments → level-based compaction (8x8 tiering)
AND-first with OR fallback — Tries AND; falls back to OR with coverage penalty if zero results
Phrase proximity boost — Consecutive tokens at consecutive positions get up to 3x score boost