Organized by learning path and use case.
Start here. These work immediately with synthetic data.
| Example | Lines | What It Teaches |
|---|---|---|
01_basic_search |
63 | Minimal HNSW: add vectors, search |
02_measure_recall |
91 | How to validate an ANN index |
03_quick_benchmark |
190 | Benchmark with bundled data (no downloads) |
cargo run --example 01_basic_search --release
cargo run --example 02_measure_recall --release
cargo run --example 03_quick_benchmark --release # bench: 10K x 384
JIN_DATASET=quick cargo run --example 03_quick_benchmark --release # CI: 2K x 128Realistic scenarios with synthetic data. Demonstrate when/why to use each algorithm.
| Example | Lines | Algorithm | Teaches |
|---|---|---|---|
semantic_search_demo |
334 | HNSW | Document search with categories |
ivf_pq_demo |
321 | IVF-PQ | Billion-scale with compression |
lid_demo |
342 | LID | Intrinsic dimensionality estimation |
lid_outlier_detection |
186 | LID | Anomaly detection via LID |
rabitq_demo |
294 | RaBitQ | Randomized binary quantization |
cargo run --example semantic_search_demo --release
cargo run --example ivf_pq_demo --release --features ivf_pqCompare against standard ANN benchmark datasets from ann-benchmarks.com.
| Dataset | Vectors | Dims | Size | Difficulty |
|---|---|---|---|---|
quick |
2K | 128 | ~1MB | Easy (CI) |
bench |
10K | 384 | ~16MB | Medium |
hard |
10K | 768 | ~31MB | Hard (realistic: topics + duplicates + hard-tail queries) |
Difficulty progression based on He et al. "On the Difficulty of Nearest Neighbor Search" (ICML 2012):
- quick: Well-separated clusters, standard queries. Reaches 99%+ recall.
- bench: Moderate overlap, adversarial queries. Reaches ~93% at ef=200.
- hard: Anisotropic topic mixture + near-duplicates + a small hard query tail. Expect lower recall at the same ef.
cargo run --example 03_quick_benchmark --release # bench (default)
JIN_DATASET=quick cargo run --example 03_quick_benchmark --release # CI
JIN_DATASET=hard cargo run --example 03_quick_benchmark --release # stress test| Example | Dataset | Vectors | Dims | Distance | Size |
|---|---|---|---|---|---|
glove_benchmark |
GloVe-25 | 1.18M | 25 | Angular | 121MB |
sift_benchmark |
SIFT-128 | 1M | 128 | Euclidean | 501MB |
hnsw_benchmark |
Synthetic | config | config | config | - |
Both have synthetic fallbacks if data isn't available.
# Real datasets (requires download)
cargo run --example glove_benchmark --release -- --full
cargo run --example sift_benchmark --release --features hdf5For serious benchmarking, download from ann-benchmarks.com:
| Dataset | Dims | Best For | Why |
|---|---|---|---|
| GloVe-25 | 25 | Quick iteration | Smallest, fast downloads |
| GloVe-100 | 100 | Realistic text | Common word embedding dim |
| SIFT-128 | 128 | Euclidean baseline | Standard image features |
| NYTimes-256 | 256 | Text embeddings | Closer to modern dims |
| Fashion-MNIST | 784 | High-dim | Tests curse of dimensionality |
| GIST-960 | 960 | Stress test | Near modern embedding dims |
Modern embedding models (OpenAI, Cohere) use 768-3072 dims. The ann-benchmarks datasets are smaller but still useful for algorithm comparison.
Recent research algorithms. Useful for understanding state-of-the-art.
| Example | Algorithm | Paper |
|---|---|---|
dual_branch_demo |
Dual-Branch HNSW | LID-based insertion |
dual_branch_hnsw_demo |
Dual-Branch variant | Skip bridges |
evoc_demo |
EVōC | Hierarchical clustering |
These are more complex and require reading the accompanying paper.
Do you have < 10K vectors?
└─> Brute force (no index needed)
Do you need streaming inserts with theoretical guarantees?
└─> Hash/LSH-style approaches (see `sketchir`)
Are you memory-constrained (> 1M vectors)?
└─> IVF-PQ (see ivf_pq_demo)
Default choice:
└─> HNSW (see 01_basic_search, semantic_search_demo)
# Quick smoke test of all algorithms
for ex in 01_basic_search 02_measure_recall semantic_search_demo; do
cargo run --example $ex --release
done