Asteroid Database (LSM-Vec)

Asteroid Database is the open-source engine — LSM-Vec — that powers Asteroid Cloud. You can embed it directly (C++ library or Python bindings) or self-host it. This section documents the engine's interfaces; code blocks switch between C++ and Python (pybind).

LSM-Vec combines an HNSW graph index with Aster (a RocksDB fork providing graph-oriented LSM-tree storage). Layer-0 graph edges are persisted on disk; upper layers stay in memory. To run it as an HTTP service with the same API as Asteroid Cloud, see Run the server.

Quickstart

Build the libraries (see Build from source), then open a database, insert, and search.

#include "lsm_vec_db.h"
using namespace lsm_vec;

LSMVecDBOptions opts;
opts.dim = 128;
opts.vector_file_path = "./db/vectors.bin";
opts.reinit = true;                       // start fresh

std::unique_ptr<LSMVecDB> db;
Status s = LSMVecDB::Open("./db", opts, &db);

std::vector<float> v(128, 0.1f);
db->Insert(1, v);

SearchOptions so; so.k = 10; so.ef_search = 128;
std::vector<SearchResult> results;
db->SearchKnn(v, so, &results);

db->Close();

The engine Python module is import lsm_vec (pybind bindings) — not the same as the Cloud client lsmvec-client (import lsmvec_client).

Insert / update / delete

Insert takes an id and a vector, with an optional JSON metadata string. Update replaces a vector; Delete tombstones it. Payloads have their own getters/setters.

db->Insert(1, vec);
db->Insert(2, vec, R"({"category":"docs"})");   // with metadata
db->SetPayload(1, R"({"category":"docs"})");
db->Update(1, new_vec);
db->Delete(2);

SearchKnn returns results ordered by ascending distance. An overload accepts a metadata filter (same predicate syntax as Filter by metadata).

SearchOptions so; so.k = 10; so.ef_search = 128;
std::vector<SearchResult> results;

// plain k-NN
db->SearchKnn(query, so, &results);

// with a metadata filter
db->SearchKnn(query, so, R"({"category":{"$eq":"docs"}})", &results);

Bulk build

There are two ways to populate an index, with different memory profiles:

Memory note. The in-memory build holds all vectors and the full graph in RAM during the build, so peak memory is high. Once it finishes, memory drops back to the normal small, disk-oriented footprint — later insert and search use the same low memory as the incremental path.

// flat: n * dim contiguous float32 values
BulkBuildOptions bopts;        // num_threads + RNN-Descent params
db->BulkBuild(Span<const float>(flat.data(), flat.size()), n, bopts);

Configuration & defaults

Pass LSMVecDBOptions to Open. The service defaults are m=8, m_max=24, ef_construction=32.

FieldDefaultDescription
dim0Required. Vector dimensionality.
metrickL2kL2 or kCosine.
m8HNSW links per node at layer 0.
m_max24Max neighbors at upper layers.
ef_construction32Candidate pool during construction.
vector_storage_type10 = flat file, 1 = paged + cached.
paged_max_cached_pages40964 KB pages in the page cache.
reinitfalsetrue = wipe on open; false = reopen.
vector_file_path""Path for the vector storage file.

Tuning: higher m / ef_construction → better recall, slower indexing; higher ef_search → better recall, slower queries.

See also: Reference & Deployment (HTTP API, client reference, build from source, run the server) · Asteroid Cloud.