Approximate nearest-neighbor (ANN) search in Rust.
This repo is a working surface for:
-
graph-based ANN (HNSW + variants)
-
coarse-to-fine indexes (IVF-PQ, ScaNN-style ideas)
-
quantization / compression experiments
-
Guide: GUIDE.md
-
Datasets: doc/datasets.md
-
Testing: TESTING.md
-
Research notes: docs/ANN_RESEARCH_2024_2026.md
Add the crate:
[dependencies]
jin = "0.1.0"Build an HNSW index and query it:
use jin::hnsw::HNSWIndex;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// NOTE: `HNSWIndex` currently uses cosine distance.
// That implies L2-normalized vectors.
let vectors: Vec<Vec<f32>> = vec![
vec![1.0, 0.0, 0.0, 0.0],
vec![0.0, 1.0, 0.0, 0.0],
vec![0.0, 0.0, 1.0, 0.0],
vec![0.0, 0.0, 0.0, 1.0],
];
let query = vec![1.0, 0.0, 0.0, 0.0];
let mut index = HNSWIndex::new(4, 16, 32)?; // dim, m, m_max
for (id, v) in vectors.iter().enumerate() {
index.add_slice(id as u32, v)?;
}
index.build()?;
let results = index.search(&query, 2, 50)?; // k, ef_search
println!("{results:?}");
Ok(())
}Given a query vector, find the top-�k most similar vectors from a collection. Brute force computes all N distances (�O(N) per query). For 1,000,000 vectors, that�’s 1,000,000 distance computations per query.
ANN systems trade exactness for speed: they aim for high recall at much lower latency.
HNSW builds a multi-layer graph where each point has:
- a few long edges (good for jumping across the space)
- and more local edges near the bottom (good for refinement)
A query does a greedy, coarse-to-fine walk:
- start from an entry point at the top layer
- greedily descend toward the query through progressively denser layers
- maintain a candidate set (size �
ef_search) at the bottom to avoid getting stuck
A more accurate mental model than “shortcuts” is: HNSW is a cheap way to keep multiple plausible local minima alive until you can locally refine.
Layer 2 (coarse): o---------o
\ /
\ o /
\ | /
Layer 1: o---o---o-o---o---o
\ | /
\ | /
Layer 0 (dense): o--o--o--o--o--o--o--o
^
keep ~ef_search candidates here,
return the best k
In HNSW, ef_search controls how many candidates you keep during the bottom-layer search.
Larger values usually increase recall, at the cost of query time.
Notes:
- This plot is from
jin’s bundled “quick” profile (it’s meant to show the shape of the curve). - It does not justify a universal claim like “ef_search=50–100 gives >95% recall” for all datasets.
Higher M generally improves recall, but increases build time and memory.
Notes:
- These plots are for the labeled settings (e.g. 1K vectors for build-time; dim=128, M=16 for memory).
- Treat them as sanity checks, not as a stable performance contract.
Different components currently assume different distance semantics. This is intentional�ly surfaced here because it�’s an easy place to make silent mistakes (e.g. forgetting to normalize vectors).
| Component | Metric | Notes |
|---|---|---|
hnsw::HNSWIndex |
cosine distance | Fast path assumes L2-normalized vectors |
ivf_pq::IVFPQIndex |
cosine distance | Uses dot-based cosine distance for IVF + PQ |
scann::SCANNIndex |
inner product / cosine | Uses dot products; reranking uses cosine distance |
hnsw::dual_branch::DualBranchHNSW |
L2 distance | Experimental implementation |
quantization |
Hamming-like / binary distances | See quantization::simd_ops::hamming_distance and ternary helpers |
Planned direction: make distance semantics explicit via a shared metric/normalization contract so that “same input vectors” means “same meaning” across indexes.
| Type | Implementations |
|---|---|
| Graph | HNSW, NSW, Vamana (DiskANN), SNG |
| Hash | LSH, MinHash, SimHash |
| Partition | IVF-PQ, ScaNN |
| Quantization | PQ, RaBitQ |
[dependencies]
jin = { version = "0.1.0", features = ["hnsw"] }hnsw�— HNSW graph index (default)lsh�— locality-sensitive hashingivf_pq�— inverted file with product quantizationpersistence�— WAL-based durability
Benchmarks:
cargo benchExample benchmark driver:
cargo run --example 03_quick_benchmark --release
JIN_DATASET=hard cargo run --example 03_quick_benchmark --releaseSee examples/ for more: semantic search, IVF-PQ, LSH, LID, and real dataset benchmarks.
- Malkov & Yashunin (2018). Efficient and robust approximate nearest neighbor search using HNSW graphs
- Subramanya et al. (2019). DiskANN: Fast Accurate Billion-point Nearest Neighbor Search



