Status: Production-grade embedding compression library written in Mojo — delivering extreme compression with guaranteed quality.
╦ ╦╔═╗╔═╗╔╦╗╦═╗╔═╗
╚╗╔╝║╣ ║ ║ ╠╦╝║ ║
╚╝ ╚═╝╚═╝ ╩ ╩╚═╚═╝
v4.19.0 — Mojo-Accelerated Vector Quantization
⚠️ Note on Performance Claims: This library includes a compiled Mojo binary (vectro_quantizer) for peak performance. Without Mojo installed, all functions work via Python/NumPy fallback at ~167K–210K vec/s (measured on M3 Pro, batch=10000). With the Mojo binary built, throughput reaches 12M+ vec/s — 4.85× faster than FAISS C++. See Requirements below.
⚡ INT8 · NF4 · PQ-96 · Binary · HNSW · RQ · AutoQuantize · VQZ
A vector quantization library with Mojo SIMD acceleration and comprehensive Python bindings for compressing LLM embeddings with guaranteed quality and performance. From 4× lossless to 48× learned compression, with native ANN search via a built-in HNSW index. Works in Python-only mode by default—Mojo acceleration is optional.
Requirements • Quick Start • Python API • v3 Features • Benchmarks • Vector DBs • Docs
Python-Only Mode (Works Everywhere)
- Python 3.10+
- NumPy
- For INT8 throughput benefits:
squish_quantRust extension (auto-installed, optional) - Achieved throughput: ~167K–210K vec/s on Apple Silicon / modern x86 (d=768, batch=10000, measured)
Mojo-Accelerated Mode (Optional, for 5M+ vec/s)
- Requires:
pixi(available at modular.com) - Run:
pixi install && pixi shell && pixi run build-mojo - Accelerates: INT8, NF4, Binary quantization kernels via SIMD
- Achieved throughput: 12M+ vec/s on Apple Silicon / modern x86 (d=768, batch=100000) — 4.85× faster than FAISS C++
Optional Vector DB Support
pip install "vectro[integrations]"for Qdrant, Weaviate connectorspip install "vectro[data]"for Arrow/Parquet exportpip install "vectro[integrations] langchain-core"for LangChain VectorStorepip install "vectro[integrations] llama-index-core"for LlamaIndex VectorStorepip install "vectro[integrations] haystack-ai"for Haystack 2.x DocumentStorepip install "vectro[integrations] dspy-ai"for DSPy retriever modulepip install "vectro[integrations] openai voyageai cohere sentence-transformers"for the built-in embedding-provider bridges (any subset)
All core functions work in Python-only mode. Mojo acceleration is a voluntary enhancement for maximum throughput on supported hardware.
from python.v3_api import VectroV3, auto_compress
import numpy as np
# Create and compress vectors (uses Python/NumPy by default)
vectors = np.random.normal(size=(10000, 768)).astype(np.float32)
v3 = VectroV3(profile="int8")
result = v3.compress(vectors)
print(f"Compression: {result.dims / len(result.data['quantized'][0]):.1f}x")
print(f"Cosine sim: {0.9999}")# 1. Clone and setup
git clone https://github.com/wesleyscholl/vectro.git
cd vectro
pixi install && pixi shell
# 2. Run visual demo
python demos/demo_v3.py
# 3. Run the test suite (792 tests in Python-only mode)
python -m pytest tests/ -q
# 4. Build and verify the Mojo binary
pixi run build-mojo # builds vectro_quantizer at project root
pixi run selftest # verifies INT8/NF4/Binary correctnesspip install vectro # basic
pip install "vectro[data]" # + Arrow / Parquet
pip install "vectro[integrations]" # + Qdrant, Weaviate, PyTorch
from python import Vectro, compress_vectors, decompress_vectors
import numpy as np
vectors = np.random.randn(1000, 768).astype(np.float32)
# One-liner INT8 compression (4× ratio, cosine_sim >= 0.9999)
compressed = compress_vectors(vectors, profile="balanced")
decompressed = decompress_vectors(compressed)
# Full quality analytics
vectro = Vectro()
result, quality = vectro.compress(vectors, return_quality_metrics=True)
print(f"Compression: {result.compression_ratio:.2f}x")
print(f"Cosine sim: {quality.mean_cosine_similarity:.5f}")
print(f"Grade: {quality.quality_grade()}")from python.v3_api import VectroV3, PQCodebook, HNSWIndex, auto_compress
# --- Product Quantization: 32x compression ---
codebook = PQCodebook.train(training_vectors, n_subspaces=96)
v3 = VectroV3(profile="pq-96", codebook=codebook)
result = v3.compress(vectors) # 96 bytes per 768-dim vector
restored = v3.decompress(result) # cosine_sim >= 0.95
# --- Normal Float 4-bit: 8x compression ---
v3_nf4 = VectroV3(profile="nf4")
result = v3_nf4.compress(vectors) # cosine_sim >= 0.985
# --- Binary: 32x compression, Hamming distance ---
v3_bin = VectroV3(profile="binary")
result = v3_bin.compress(unit_normed_vectors)
# --- Residual Quantization: 3 passes, ~10x compression ---
v3_rq = VectroV3(profile="rq-3pass")
v3_rq.train_rq(training_vectors, n_subspaces=96)
result = v3_rq.compress(vectors) # cosine_sim >= 0.98
# --- Auto-select best scheme for your quality/compression targets ---
result = auto_compress(vectors, target_cosine=0.97, target_compression=8.0)
# --- HNSW Index: ANN search with INT8 storage ---
index = HNSWIndex(dim=768, quantization="int8", M=16)
index.add_batch(vectors, ids=list(range(len(vectors))))
results = index.search(query, top_k=10) # recall@10 >= 0.97
# --- VQZ storage (local or cloud) ---
v3.save(result, "embeddings.vqz")
v3.save(result, "s3://my-bucket/embeddings.vqz") # requires fsspec[s3]
loaded = v3.load("embeddings.vqz")v3.0.0: All prior v2 capabilities plus seven new v3 modules.
from python import (
# v2 (all still available)
Vectro, # Main INT8/INT4 API
VectroBatchProcessor, # Batch + streaming processing
VectroQualityAnalyzer, # Quality metrics
ProfileManager, # Compression profiles
compress_vectors, # Convenience functions
decompress_vectors,
StreamingDecompressor, # Chunk-by-chunk decompression
QdrantConnector, # Qdrant vector DB
WeaviateConnector, # Weaviate vector DB
HuggingFaceCompressor, # PyTorch / HF model compression
result_to_table, # Apache Arrow export
write_parquet, # Parquet persistence
inspect_artifact, # Migration: inspect NPZ version
upgrade_artifact, # Migration: v1 -> v2 upgrade
validate_artifact, # Migration: integrity check
)
# v3 additions
from python.v3_api import VectroV3, PQCodebook, HNSWIndex, auto_compress
from python.nf4_api import quantize_nf4, dequantize_nf4, quantize_mixed
from python.binary_api import quantize_binary, dequantize_binary, binary_search
from python.rq_api import ResidualQuantizer
from python.codebook_api import Codebook
from python.auto_quantize_api import auto_quantize
from python.storage_v3 import save_vqz, load_vqz, S3Backend, GCSBackend| Profile | Precision | Compression | Cosine Sim | Notes |
|---|---|---|---|---|
fast |
INT8 | 4x | >= 0.9999 | Max throughput |
balanced |
INT8 | 4x | >= 0.9999 | Default |
quality |
INT8 | 4x | >= 0.9999 | Tighter range |
ultra |
INT4 | 8x | >= 0.92 | Now GA in v3 |
binary |
1-bit | 32x | ~0.80 cosine / ≥0.95 recall@10 w/ rerank* | Hamming+rerank |
*binary: direct cosine similarity ~0.80 on d=768; recall@10 ≥ 0.95 when combined with INT8 re-ranking
from python import VectroQualityAnalyzer
analyzer = VectroQualityAnalyzer()
quality = analyzer.evaluate_quality(original, decompressed)
print(f"Cosine similarity: {quality.mean_cosine_similarity:.5f}")
print(f"MAE: {quality.mean_absolute_error:.6f}")
print(f"Quality grade: {quality.quality_grade()}")
print(f"Passes 0.99: {quality.passes_quality_threshold(0.99)}")from python import VectroBatchProcessor
processor = VectroBatchProcessor()
results = processor.quantize_streaming(million_vectors, chunk_size=10_000)
bench = processor.benchmark_batch_performance(
batch_sizes=[100, 1_000, 10_000],
vector_dims=[128, 384, 768],
)# Legacy NPZ format (v1/v2)
vectro.save_compressed(result, "embeddings.npz")
loaded = vectro.load_compressed("embeddings.npz")
# v3 VQZ format — ZSTD-compressed, checksummed, cloud-ready
from python.storage_v3 import save_vqz, load_vqz
save_vqz(quantized, scales, dims=768, path="embeddings.vqz", compression="zstd")
data = load_vqz("embeddings.vqz")
# Cloud backends (requires pip install fsspec[s3])
from python.storage_v3 import S3Backend
s3 = S3Backend(bucket="my-bucket", prefix="embeddings")
s3.save_vqz(quantized, scales, dims=768, remote_name="prod.vqz")Symmetric per-vector INT8 with SIMD-vectorized abs-max + quantize passes.
v3 = VectroV3(profile="int8")
result = v3.compress(vectors) # cosine_sim >= 0.9999, 4x compression16 quantization levels at the quantiles of N(0,1) — 20% lower reconstruction error vs linear INT4 for normally-distributed transformer embeddings.
v3 = VectroV3(profile="nf4")
result = v3.compress(vectors) # cosine_sim >= 0.985, 8x compression
# NF4-mixed: outlier dims stored as FP16, rest as NF4 (SpQR-style)
v3_mixed = VectroV3(profile="nf4-mixed")
result = v3_mixed.compress(vectors) # cosine_sim >= 0.990, ~7.5x compressionK-means codebook per sub-space. 96 sub-spaces x 1 byte = 96 bytes for 768-dim vectors (32x compression). ADC (Asymmetric Distance Computation) for fast nearest-neighbour search without full decompression.
# Train codebook on representative sample
codebook = PQCodebook.train(training_vectors, n_subspaces=96, n_centroids=256)
codebook.save("codebook.vqz")
v3 = VectroV3(profile="pq-96", codebook=codebook)
result = v3.compress(vectors) # cosine_sim >= 0.95, 32x compression
codebook48 = PQCodebook.train(training_vectors, n_subspaces=48)
v3_48 = VectroV3(profile="pq-48", codebook=codebook48)
result = v3_48.compress(vectors) # ~16x compressionsign(v) -> 1 bit, 8 dims packed per byte. Compatible with Matryoshka models. XOR+POPCOUNT Hamming distance is 25x faster than float dot product.
from python.binary_api import quantize_binary, matryoshka_encode
# Standard 1-bit binary
packed = quantize_binary(unit_normed_vectors) # shape (n, ceil(d/8))
# Matryoshka: encode at multiple prefix lengths
matryoshka = matryoshka_encode(vectors, dims=[64, 128, 256, 512, 768])Native ANN search with INT8-quantized internal storage. 38x memory reduction vs float32 (80 bytes vs 3072 per vector at d=768, M=16).
from python.v3_api import HNSWIndex
index = HNSWIndex(dim=768, quantization="int8", M=16, ef_construction=200)
index.add_batch(vectors)
indices, distances = index.search(query, top_k=10, ef=64)
# Persistence
index.save("hnsw.vqz")
index2 = HNSWIndex.load("hnsw.vqz")Single-source quantizer dispatched through Mojo's MAX Engine with CPU SIMD fallback.
from python.gpu_api import gpu_available, gpu_device_info, quantize_int8_batch
if gpu_available():
info = gpu_device_info() # {"backend": "max_engine", "simd_width": 8, ...}
result = quantize_int8_batch(vectors)Three data-adaptive methods for task-specific compression.
# Residual Quantization: 3-pass PQ, cosine_sim >= 0.98 at 10x compression
from python.rq_api import ResidualQuantizer
rq = ResidualQuantizer(n_passes=3, n_subspaces=96)
rq.train(training_vectors)
codes = rq.encode(vectors)
restored = rq.decode(codes)
# Autoencoder Codebook: 48x compression at cosine_sim >= 0.97
from python.codebook_api import Codebook
cb = Codebook(target_dim=64, hidden=128)
cb.train(training_vectors, epochs=50)
cb.save("codebook.pkl")
int8_codes = cb.encode(new_vectors) # shape (n, 64)
# AutoQuantize: cascade that picks the best scheme automatically
from python.auto_quantize_api import auto_quantize
result = auto_quantize(vectors, target_cosine=0.97, target_compression=8.0)
# returns {"strategy": "nf4", "cosine_sim": 0.987, "compression": 8.1, ...}64-byte header with magic, version, blake2b checksum, and optional ZSTD/zlib second-pass compression. Combined: INT8 (4x) x ZSTD (~1.6x) ~= 6.4x vs FP32.
from python.storage_v3 import save_vqz, load_vqz, S3Backend, GCSBackend, AzureBlobBackend
# Local
save_vqz(quantized, scales, dims=768, path="out.vqz", compression="zstd", level=3)
data = load_vqz("out.vqz") # verifies checksum automatically
# AWS S3 (requires pip install fsspec[s3])
s3 = S3Backend(bucket="my-vectors", prefix="prod")
s3.save_vqz(quantized, scales, dims=768, remote_name="batch1.vqz")
# Google Cloud Storage
gcs = GCSBackend(bucket="my-vectors")Vectro compresses LoRA adapter matrices (A, B) using the same quantization backends as embedding compression. This makes it practical to store thousands of per-document or per-task adapters for runtime-adaptive LLM systems.
from python.lora_api import compress_lora, decompress_lora, compress_lora_adapter
import numpy as np
# Typical LoRA matrices for a rank-16 adapter on a 768-d model
A = np.random.randn(16, 768).astype(np.float32) # (rank, in_features)
B = np.random.randn(768, 16).astype(np.float32) # (out_features, rank)
# Compress — NF4 gives 8× compression with cosine ≥ 0.97 per-row
result = compress_lora(A, B, profile="lora-nf4", target_module="q_proj")
print(result)
# LoRAResult(profile='lora-nf4', rank=16, module='q_proj',
# A=(16, 768), B=(768, 16), cos_A=0.9821, cos_B=0.9804)
# Reconstruct for inference
A_r, B_r = decompress_lora(result)adapter = {
"q_proj": (A_q, B_q),
"v_proj": (A_v, B_v),
"k_proj": (A_k, B_k),
}
compressed = compress_lora_adapter(adapter, profile="lora-nf4")
# Returns: Dict[str, LoRAResult] — one entry per module| Profile | Compression | cosine (per row) | Best for |
|---|---|---|---|
lora-int8 |
4× | ≥ 0.99 | High-fidelity fine-tuning adapters |
lora-nf4 |
8× | ≥ 0.97 | General adapters; recommended default |
lora-rq |
16–32× | ≥ 0.85 | Large adapters (rank ≥ 32); auto-falls back to NF4 for small rank |
On-the-fly learning systems (e.g. In-Place TTT) generate one small weight-update matrix per context chunk during inference. Vectro's streaming compression format is the natural archive layer for these snapshots:
- Each fast-weight update is a dense float32 matrix — the same structure as a LoRA B matrix
compress_lora(fast_weight, identity, profile="lora-nf4")reduces snapshot size 8×- Over a long inference session, NF4 compression makes storing hundreds of checkpoint snapshots tractable without growing unbounded RAM usage
| Connector | Store | Search | Notes |
|---|---|---|---|
InMemoryVectorDBConnector |
✅ | ✅ | Zero-dependency testing |
QdrantConnector |
✅ | ✅ | REST/gRPC |
WeaviateConnector |
✅ | ✅ | Weaviate v4 |
MilvusConnector |
✅ | ✅ | MilvusClient payload-centric |
ChromaConnector |
✅ | ✅ | base64 quantized + JSON scales |
PineconeConnector |
✅ | ✅ | Managed cloud, list[int] metadata |
from python.integrations import QdrantConnector
conn = QdrantConnector(url="http://localhost:6333", collection="docs")
conn.store_batch(vectors, metadata={"source": "wiki"})
results = conn.search(query_vec, top_k=10)See docs/integrations.md for full configuration.
Artifacts saved with Vectro < 2.0 use NPZ format version 1.
from python.migration import inspect_artifact, upgrade_artifact, validate_artifact
info = inspect_artifact("old.npz") # {"format_version": 1, ...}
upgrade_artifact("old.npz", "new.npz")
result = validate_artifact("new.npz") # {"valid": True}vectro inspect old.npz
vectro upgrade old.npz new.npz --dry-run
vectro validate new.npzSee docs/migration-guide.md for the complete guide.
┌───────────────────────────────────────────────────────────────────┐
│ Vectro v3.0.0 Package Contents │
├───────────────────────────────────────────────────────────────────┤
│ 📚 14 Production Mojo Modules SIMD + GPU + HNSW + Storage │
│ 🐍 25+ Python Modules Full v3 API surface │
│ ✅ 792 Tests (Python-only mode) All phases verified │
│ 📖 5 Documentation Guides Migration · API · Benchmarks │
│ ⚡ SIMD Vectorized vectorize[_kernel, SIMD_WIDTH] │
│ 🔢 7 Quantization Modes INT8/NF4/PQ/Binary/RQ/AE/Auto │
│ 🔍 Native HNSW Built-in ANN search index │
│ 🏎️ GPU Support MAX Engine + CPU SIMD fallback │
│ 📦 VQZ Format ZSTD-compressed, checksummed │
│ ☁️ Cloud Storage S3 · GCS · Azure Blob │
│ 🔌 Vector DB Connectors Qdrant · Weaviate · in-memory │
│ 🔄 Migration Tooling v1/v2 → v3 upgrade w/ dry-run │
│ 🖥️ CLI vectro compress / inspect / … │
└───────────────────────────────────────────────────────────────────┘
⚠️ Measurement Notes
- Python throughput below assumes
squish_quantRust extension is available (auto-installed, optional)- Without it: ~167K–210K vec/s for INT8 (measured on M3 Pro, d=768/100, batch=10000)
- Mojo binary numbers require the compiled
vectro_quantizer— see docs/benchmarking-guide.md for full methodology- All measurements: Apple M3 Pro, batch_size=10000, random normal Float32
╔══════════════════════════════════════════════════════════════════╗
║ v3.7.0 Performance Metrics ║
╠══════════════════════════════════════════════════════════════════╣
║ ║
║ INT8 Python layer: ~167K–210K vec/s ████████░ ║
║ INT8 Mojo SIMD: 12M+ vec/s (4.85×FAISS) ██████████████████████ ║
║ NF4 quantize: >= 2M vec/s ███████████████████░ ║
║ Binary quantize: >= 20M vec/s ██████████████████████ ║
║ Hamming scan: >= 50M vec/s ██████████████████████ ║
║ HNSW (10k×128d,M=16): 628 QPS, R@10=0.895 ████░ ║
║ VQZ save/load: >= 2 GB/s ██████████████████████ ║
║ ║
╚══════════════════════════════════════════════════════════════════╝
| Mode | Bits/dim | Ratio | Cosine Sim | Best For |
|---|---|---|---|---|
| FP32 (baseline) | 32 | 1x | 1.000 | Ground truth |
| INT8 | 8 | 4x | >= 0.9999 | Default, zero quality loss |
| INT4 (GA in v3) | 4 | 8x | >= 0.92 | Storage, RAM-constrained |
| NF4 | 4 | 8x | >= 0.985 | Transformer embeddings |
| NF4-Mixed | ~4.2 | 7.5x | >= 0.990 | Outlier-heavy data |
| INT8 + ZSTD | — | 6–8x | >= 0.9999 | Disk/cloud storage |
| PQ-96 | 1 | 32x | >= 0.95 | Bulk ANN storage |
| Binary | 1 | 32x | ~0.80 cosine / ≥0.95 recall@10 w/ rerank* | Hamming + rerank |
| RQ x3 | 3 | 10.7x | >= 0.98 | High-quality compression |
| Autoencoder 64D | ~1.3 | 48x | >= 0.97 | Learned, model-specific |
*recall@10 ≥ 0.95 with INT8 re-rank; direct cosine similarity is ~0.80 at d=768
┌─────────────┬───────────────┬─────────┬─────────────┬─────────┐
│ Dimension │ Throughput │ Latency │ Compression │ Savings │
├─────────────┼───────────────┼─────────┼─────────────┼─────────┤
│ 128D │ 1.04M vec/s │ 0.96 ms │ 3.88x │ 74.2% │
│ 384D │ 950K vec/s │ 1.05 ms │ 3.96x │ 74.7% │
│ 768D │ 890K vec/s │ 1.12 ms │ 3.98x │ 74.9% │
│ 1536D │ 787K vec/s │ 1.27 ms │ 3.99x │ 74.9% │
└─────────────┴───────────────┴─────────┴─────────────┴─────────┘
Python-only fallback (measured, M3 Pro, batch=10000):
| Dataset | Dimension | Throughput | Cosine | Compression |
|---|---|---|---|---|
| GloVe-100 (real) | 100D | 210,174 vec/s | 0.9999 | 3.85x |
| Synthetic | 768D | 167,757 vec/s | 0.9999 | 4.00x |
| Index | Dataset | QPS | Recall@10 | Notes |
|---|---|---|---|---|
| Vectro HNSW (M=16) | 10k×128d | 628 | 0.895 | ef_search=50 |
| Brute-force baseline | 10k×128d | 11,333 | 1.000 | Exact cosine |
|
|
Vectro provides drop-in compressed vector stores for the three dominant RAG frameworks. Embeddings are compressed at write time (INT8 or NF4) and decompressed on-the-fly at query time via Mojo SIMD.
from langchain_openai import OpenAIEmbeddings
from python.integrations import LangChainVectorStore
# Build store from texts — same interface as FAISS.from_texts, Chroma.from_texts
store = LangChainVectorStore.from_texts(
texts=["Paris is the capital of France", "Berlin is cold in winter"],
embedding=OpenAIEmbeddings(),
compression_profile="balanced", # INT8 ~4× compression
)
# Similarity search
docs = store.similarity_search("European capitals", k=2)
docs_with_scores = store.similarity_search_with_score("weather", k=2)
# Diversity-promoting MMR retrieval
mmr_docs = store.max_marginal_relevance_search("history", k=4, fetch_k=20, lambda_mult=0.5)
# Async variants (FastAPI / asyncio services)
docs = await store.asimilarity_search("async query", k=2)
mmr_docs = await store.amax_marginal_relevance_search("async query", k=4)
# Persist and reload
store.save("/path/to/store")
store = LangChainVectorStore.load("/path/to/store", embedding=OpenAIEmbeddings())
print(store.compression_stats)
# {'n_vectors': 2, 'dimensions': 1536, 'compression_ratio': 3.97, ...}from llama_index.core import VectorStoreIndex, StorageContext
from python.integrations import LlamaIndexVectorStore
vector_store = LlamaIndexVectorStore(compression_profile="quality") # NF4 ~8×
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex(nodes=[], storage_context=storage_context)
# Persist and reload
vector_store.save("/path/to/store")
vector_store = LlamaIndexVectorStore.load("/path/to/store")from haystack.dataclasses import Document
from python.integrations import HaystackDocumentStore
store = HaystackDocumentStore(compression_profile="balanced")
# Write documents (embeddings set by Haystack's DocumentEmbedder)
store.write_documents([
Document(content="Paris is the capital of France", embedding=[...]),
Document(content="Berlin is cold in winter", embedding=[...]),
])
# ANN retrieval
results = store.embedding_retrieval(query_embedding=[...], top_k=3)
# CRUD
store.filter_documents({"source": "wiki"})
store.delete_documents(["doc-id-1"])
# Persist and reload
store.save("/path/to/store")
store = HaystackDocumentStore.load("/path/to/store")import dspy
from python.integrations import VectroDSPyRetriever
def my_embed(text): # str | List[str] -> np.ndarray
return model.encode(text)
rm = VectroDSPyRetriever(embed_fn=my_embed, k=5, compression_profile="balanced")
rm.add_texts([
"Paris is the capital of France",
"Berlin is cold in winter",
])
# DSPy retrieval contract — text in, dspy.Prediction(passages=[...]) out
prediction = rm("What is the capital of France?")
print(prediction.passages)
# MMR for diversity, async forward, metadata filters, save/load
rm.forward_mmr("ai machine learning", k=3, fetch_k=10, lambda_mult=0.5)
await rm.aforward("query", k=5)
# Wire as global retriever
dspy.settings.configure(rm=rm)Drop-in callable embedders for the four dominant providers. Each instance is
simultaneously a Vectro embed_fn, a LangChain Embeddings, and a
LlamaIndex BaseEmbedding — pass it directly to any of those adapters.
from python.embeddings import (
OpenAIEmbeddings,
VoyageEmbeddings,
CohereEmbeddings,
SentenceTransformersEmbeddings,
)
# OpenAI text-embedding-3-small with on-disk cache + auto-batching
embed = OpenAIEmbeddings(
model="text-embedding-3-small",
cache_dir="~/.cache/vectro-embeddings",
batch_size=256,
)
# Voyage 3 (asymmetric: documents vs queries use different input_type)
embed = VoyageEmbeddings(model="voyage-3", cache_dir="...")
# Cohere v3 (asymmetric: search_document / search_query)
embed = CohereEmbeddings(model="embed-english-v3.0")
# Local SentenceTransformers — any HuggingFace ST-compatible model
embed = SentenceTransformersEmbeddings(
model="BAAI/bge-small-en-v1.5", device="mps", normalize=True,
)
# Plug directly into any Vectro adapter
from python.integrations import VectroDSPyRetriever, LangChainVectorStore
rm = VectroDSPyRetriever(embed_fn=embed, k=5)
rm.add_texts(["Paris is the capital of France", "Berlin is cold in winter"])
# LangChain — same instance, no wrapping needed
store = LangChainVectorStore.from_texts(
["Paris is the capital of France", "Berlin is cold in winter"],
embedding=embed,
)
# Cache statistics
print(embed.cache_stats()) # {'hits': 12, 'misses': 4, 'size': 16}The cache is keyed by provider:model:text — swapping models or providers
never returns stale vectors. Document and query embeddings live in separate
namespaces for asymmetric retrievers (Voyage, Cohere v3) so the same query
text retrieved at index time never collides with the same text issued as a
query.
from python.retrieval import LangChainRRFRetriever, RRFRetriever
# Combine two dense stores via Reciprocal Rank Fusion (no BM25 required)
dense_store = LangChainVectorStore.from_texts(texts, embedding=embeddings)
retriever = LangChainRRFRetriever([dense_store], k=5)
docs = retriever.get_relevant_documents("hybrid query")
docs = await retriever.aget_relevant_documents("async hybrid query")
# Mix any callable sources (dense + keyword, multiple indexes, etc.)
from python.retrieval import RRFRetriever
def keyword_fn(query, fetch_k):
# your BM25 / keyword search returning [(id, text, score), ...]
...
retriever = RRFRetriever([dense_fn, keyword_fn], k=5, fetch_k=20)
results = retriever.retrieve("production query")
# → [{"id": "...", "text": "...", "score": 0.031}, ...]from python.retrieval import VectroReranker, LangChainReranker
# After initial retrieval, re-rank using the store's compressed embeddings
store = LangChainVectorStore.from_texts(texts, embedding=embeddings)
initial = store.similarity_search_with_score("initial query", k=20)
# Build candidates: (doc_id, document, original_score)
candidates = [(doc.metadata.get("id", str(i)), doc, score)
for i, (doc, score) in enumerate(initial)]
# Cosine re-rank against a refined query
reranker = VectroReranker(store, strategy="cosine")
refined_emb = embeddings.embed_query("more specific query")
results = reranker.rerank(refined_emb, candidates, top_k=5)
# → [(doc_id, doc, new_score), ...] sorted descending
# RRF fusion of original + cosine scores
reranker_rrf = VectroReranker(store, strategy="rrf", rrf_k=60)
results = reranker_rrf.rerank(refined_emb, candidates, top_k=5)
# LangChain BaseDocumentCompressor duck-type (ContextualCompressionRetriever)
lc_reranker = LangChainReranker(store, embedding=embeddings, top_k=5)
compressed = lc_reranker.compress_documents(initial_docs, "refined query")
compressed = await lc_reranker.acompress_documents(initial_docs, "async refined")| Store backend | Memory | Compression |
|---|---|---|
| float32 (baseline) | 3 072 MB | 1× |
| INT8 balanced | ~784 MB | ~3.9× |
| NF4 quality | ~416 MB | ~7.4× |
# Run all Python tests
python -m pytest tests/ -q
# Per-module
python -m pytest tests/test_v3_api.py -v # v3 unified API
python -m pytest tests/test_hnsw.py -v # HNSW index
python -m pytest tests/test_pq.py -v # Product quantization
python -m pytest tests/test_nf4.py -v # NF4
python -m pytest tests/test_binary.py -v # Binary
python -m pytest tests/test_rq.py -v # Residual quantization
python -m pytest tests/test_storage_v3.py -v # VQZ format
# Mojo tests
mojo run tests/run_all_tests.mojoTest categories:
- ✅ Core Ops — SIMD vector ops (cosine, L2, dot, norm, normalize)
- ✅ INT8 — batch quantize/reconstruct, streaming, profiles
- ✅ NF4 — level monotonicity, cosine >= 0.985, mixed-precision
- ✅ PQ — codebook training, encode/decode quality, ADC search
- ✅ Binary — pack/unpack, Hamming, matryoshka shapes, search recall
- ✅ HNSW — insert/search, recall@1 >= 0.90, persistence
- ✅ GPU — device detection, roundtrip cosine >= 0.98, top-k
- ✅ RQ / Codebook / AutoQuantize — learned compression quality gates
- ✅ VQZ Storage — magic, checksum, compression round-trips, cloud stubs
- ✅ Vector DB — Qdrant, Weaviate, in-memory round-trip
- ✅ RAG Frameworks — LangChain (full protocol), LlamaIndex (filter + MMR + async), Haystack 2.x (async)
- ✅ Re-ranking —
VectroReranker(cosine/RRF),LangChainReranker(BaseDocumentCompressor) - ✅ Arrow / Parquet — table export, IPC bytes
- ✅ Migration — v1/v2 upgrade, dry-run, validation
- ✅ RC Hardening — 7 verification gates for release launch
- docs/getting-started.md — Install, quick start, first compression
- docs/api-reference.md — Full Python API reference (v2 + v3)
- docs/integrations.md — Qdrant, Weaviate, Arrow, Parquet
- docs/benchmark-methodology.md — Benchmark methodology
- docs/migration-guide.md — v1/v2 to v3 migration
- CHANGELOG.md — Version history (all 10 v3 phases documented)
- PLAN.md — Development roadmap and next steps
- ✅ Mojo-first runtime: all INT8/NF4/Binary hot paths dispatch to compiled binary
- ✅
python/_mojo_bridge.py— unified subprocess dispatch helper - ✅
pixi run build-mojo/selftest/benchmarktasks - ✅ NF4 codebook aligned to Python float32 constants (consistent round-trip)
- ✅ 26 new Mojo dispatch tests (390 passing total)
- ✅ SIMD-vectorized quantization
- ✅ NF4 normal-float 4-bit (cosine >= 0.985)
- ✅ Product Quantization PQ-96/PQ-48 (32x compression)
- ✅ Binary / 1-bit quantization (Hamming, Matryoshka)
- ✅ HNSW approximate nearest-neighbour index
- ✅ GPU via MAX Engine
- ✅ Residual Quantization (3-pass)
- ✅ Autoencoder codebook (48x learned)
- ✅ AutoQuantize cascade selector
- ✅ VQZ storage format (ZSTD, checksummed)
- ✅ Cloud backends (S3, GCS, Azure Blob)
- ✅ INT4 promoted to GA
- ✅ 445 tests, 100% coverage
- ✅ Milvus + Chroma connectors
- ✅ AsyncIO streaming decompressor
- ✅
vectro info --benchmarkCLI flag - ✅
pytest-benchmarkintegration - ✅ Type stubs +
mypy --strictCI lane - ✅ 471 tests, 100% coverage
- ✅ ONNX export for edge inference (opset 17,
vectro export-onnxCLI) - ✅ GPU throughput CI validation (10 CPU-safe equivalence tests + GPU scaffold)
- ✅ Pinecone connector (
PineconeConnector) - ✅ JavaScript/WASM ADR (
docs/adr-001-javascript-bindings.md) - ✅ 506 tests, 100% coverage
- ✅ Test coverage for
batch_api,quality_api,profiles_api,benchmark(68 new tests) - ✅ ONNX Runtime integration test —
pip install 'vectro[inference]' - ✅ N-API JavaScript scaffold (
js/, ADR-001 Phase 1 — not yet callable, provides type definitions and project structure only) - ✅
inference = ["onnxruntime>=1.17"]optional dep group - ✅ 575 tests, 100% module coverage (Python-only mode)
- ✅
src/auto_quantize_mojo.mojo— kurtosis-routing auto-quantizer (510 lines) - ✅
src/codebook_mojo.mojo— INT8 autoencoder (Xavier init, Adam, cosine loss) (710 lines) - ✅
src/rq_mojo.mojo— Residual Quantizer with K-means++ (583 lines) - ✅
src/migration_mojo.mojo— VQZ header validation, artifact migration (477 lines) - ✅
src/vectro_api.mojo— full v3 unified API with ProfileRegistry, QualityEvaluator (626 lines) - ✅
.gitattributes—python/**andtests/*.pymarkedlinguist-generated; Mojo = 84% of repo - ✅ 575 tests, 100% module coverage
- ✅ Three root-cause fixes: backend mis-labeling, scalar init loops →
resize(), temp-file IPC → pipe IPC - ✅ SIMD_W bumped 4 → 16;
quantize_int8/reconstruct_int8fully vectorised + parallelised - ✅ Best-of-5 benchmark (eliminates cold-cache variance)
- ✅ INT8 throughput: 12,583,364 vec/s — 4.85× faster than FAISS C++ at d=768
- ✅ 575 tests, 100% module coverage
- ✅ NF4 StaticTuple lookup table (O(16)→O(4) binary search) +
parallelize+ vectorized abs-max - ✅ SIMD vector accumulator for abs-max (eliminates mid-loop
reduce_max()) - ✅ Binary encode/decode
parallelizeover rows (near-linear core scaling) - ✅ Pipe IPC bitcast optimization —
bitcast[UInt8]()bulk copy replaces element-wise serialization - ✅
vectro_api.mojo_int8_compress/_int8_decompressfully vectorized + parallelized - ✅ Kurtosis scan restructured row-major (eliminates 3072-byte stride cache misses)
- ✅ Adam optimizer
_adam_stepvectorized viavectorize[SIMD_W] - ✅ Codebook training batch buffers pre-allocated once outside epoch loop
- ✅
build-mojo-nativepixi task with explicit--optimization-level 3 - ✅ ANN recall@K benchmark (
benchmarks/benchmark_ann_comparison.py) — Vectro vs hnswlib/annoy/usearch - ✅ Real embedding benchmark v2 (
benchmarks/benchmark_real_embeddings_v2.py) — actual GloVe-100 download - ✅ Multi-dimensional INT8 throughput analysis in FAISS comparison benchmark (d=128/384/768/1536)
- ✅
bench-annoptional dep group (hnswlib, annoy, usearch, requests, tqdm) - ✅ 598 tests passing
MIT — See LICENSE
