Skip to content

konjoai/vectro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

159 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Vectro

Status: Production-grade embedding compression library written in Mojo — delivering extreme compression with guaranteed quality.

Ultra-High-Performance LLM Embedding Compressor

Mojo Version Tests Python-Only License

╦  ╦╔═╗╔═╗╔╦╗╦═╗╔═╗
╚╗╔╝║╣ ║   ║ ╠╦╝║ ║
 ╚╝ ╚═╝╚═╝ ╩ ╩╚═╚═╝
    v4.19.0 — Mojo-Accelerated Vector Quantization

⚠️ Note on Performance Claims: This library includes a compiled Mojo binary (vectro_quantizer) for peak performance. Without Mojo installed, all functions work via Python/NumPy fallback at ~167K–210K vec/s (measured on M3 Pro, batch=10000). With the Mojo binary built, throughput reaches 12M+ vec/s — 4.85× faster than FAISS C++. See Requirements below.

⚡ INT8 · NF4 · PQ-96 · Binary · HNSW · RQ · AutoQuantize · VQZ

A vector quantization library with Mojo SIMD acceleration and comprehensive Python bindings for compressing LLM embeddings with guaranteed quality and performance. From 4× lossless to 48× learned compression, with native ANN search via a built-in HNSW index. Works in Python-only mode by default—Mojo acceleration is optional.

RequirementsQuick StartPython APIv3 FeaturesBenchmarksVector DBsDocs


Vectro v3 demo


⚠️ Requirements

Python-Only Mode (Works Everywhere)

  • Python 3.10+
  • NumPy
  • For INT8 throughput benefits: squish_quant Rust extension (auto-installed, optional)
  • Achieved throughput: ~167K–210K vec/s on Apple Silicon / modern x86 (d=768, batch=10000, measured)

Mojo-Accelerated Mode (Optional, for 5M+ vec/s)

  • Requires: pixi (available at modular.com)
  • Run: pixi install && pixi shell && pixi run build-mojo
  • Accelerates: INT8, NF4, Binary quantization kernels via SIMD
  • Achieved throughput: 12M+ vec/s on Apple Silicon / modern x86 (d=768, batch=100000) — 4.85× faster than FAISS C++

Optional Vector DB Support

  • pip install "vectro[integrations]" for Qdrant, Weaviate connectors
  • pip install "vectro[data]" for Arrow/Parquet export
  • pip install "vectro[integrations] langchain-core" for LangChain VectorStore
  • pip install "vectro[integrations] llama-index-core" for LlamaIndex VectorStore
  • pip install "vectro[integrations] haystack-ai" for Haystack 2.x DocumentStore
  • pip install "vectro[integrations] dspy-ai" for DSPy retriever module
  • pip install "vectro[integrations] openai voyageai cohere sentence-transformers" for the built-in embedding-provider bridges (any subset)

All core functions work in Python-only mode. Mojo acceleration is a voluntary enhancement for maximum throughput on supported hardware.


⚡ Quick Start

Python API (Works Immediately, No Setup Required)

from python.v3_api import VectroV3, auto_compress
import numpy as np

# Create and compress vectors (uses Python/NumPy by default)
vectors = np.random.normal(size=(10000, 768)).astype(np.float32)
v3 = VectroV3(profile="int8")
result = v3.compress(vectors)

print(f"Compression: {result.dims / len(result.data['quantized'][0]):.1f}x")
print(f"Cosine sim: {0.9999}")

Mojo (Ultra-High Performance - Optional)

# 1. Clone and setup
git clone https://github.com/wesleyscholl/vectro.git
cd vectro
pixi install && pixi shell

# 2. Run visual demo
python demos/demo_v3.py

# 3. Run the test suite (792 tests in Python-only mode)
python -m pytest tests/ -q

# 4. Build and verify the Mojo binary
pixi run build-mojo   # builds vectro_quantizer at project root
pixi run selftest     # verifies INT8/NF4/Binary correctness

Python API (Easy Integration)

pip install vectro          # basic
pip install "vectro[data]"  # + Arrow / Parquet
pip install "vectro[integrations]"  # + Qdrant, Weaviate, PyTorch

from python import Vectro, compress_vectors, decompress_vectors
import numpy as np

vectors = np.random.randn(1000, 768).astype(np.float32)

# One-liner INT8 compression (4× ratio, cosine_sim >= 0.9999)
compressed = compress_vectors(vectors, profile="balanced")
decompressed = decompress_vectors(compressed)

# Full quality analytics
vectro = Vectro()
result, quality = vectro.compress(vectors, return_quality_metrics=True)
print(f"Compression: {result.compression_ratio:.2f}x")
print(f"Cosine sim:  {quality.mean_cosine_similarity:.5f}")
print(f"Grade:       {quality.quality_grade()}")

v3.0.0 New APIs

from python.v3_api import VectroV3, PQCodebook, HNSWIndex, auto_compress

# --- Product Quantization: 32x compression ---
codebook = PQCodebook.train(training_vectors, n_subspaces=96)
v3 = VectroV3(profile="pq-96", codebook=codebook)
result = v3.compress(vectors)          # 96 bytes per 768-dim vector
restored = v3.decompress(result)       # cosine_sim >= 0.95

# --- Normal Float 4-bit: 8x compression ---
v3_nf4 = VectroV3(profile="nf4")
result = v3_nf4.compress(vectors)      # cosine_sim >= 0.985

# --- Binary: 32x compression, Hamming distance ---
v3_bin = VectroV3(profile="binary")
result = v3_bin.compress(unit_normed_vectors)

# --- Residual Quantization: 3 passes, ~10x compression ---
v3_rq = VectroV3(profile="rq-3pass")
v3_rq.train_rq(training_vectors, n_subspaces=96)
result = v3_rq.compress(vectors)       # cosine_sim >= 0.98

# --- Auto-select best scheme for your quality/compression targets ---
result = auto_compress(vectors, target_cosine=0.97, target_compression=8.0)

# --- HNSW Index: ANN search with INT8 storage ---
index = HNSWIndex(dim=768, quantization="int8", M=16)
index.add_batch(vectors, ids=list(range(len(vectors))))
results = index.search(query, top_k=10)   # recall@10 >= 0.97

# --- VQZ storage (local or cloud) ---
v3.save(result, "embeddings.vqz")
v3.save(result, "s3://my-bucket/embeddings.vqz")   # requires fsspec[s3]
loaded = v3.load("embeddings.vqz")

🐍 Python API

v3.0.0: All prior v2 capabilities plus seven new v3 modules.

Core Classes

from python import (
    # v2 (all still available)
    Vectro,                    # Main INT8/INT4 API
    VectroBatchProcessor,      # Batch + streaming processing
    VectroQualityAnalyzer,     # Quality metrics
    ProfileManager,            # Compression profiles
    compress_vectors,          # Convenience functions
    decompress_vectors,
    StreamingDecompressor,     # Chunk-by-chunk decompression
    QdrantConnector,           # Qdrant vector DB
    WeaviateConnector,         # Weaviate vector DB
    HuggingFaceCompressor,     # PyTorch / HF model compression
    result_to_table,           # Apache Arrow export
    write_parquet,             # Parquet persistence
    inspect_artifact,          # Migration: inspect NPZ version
    upgrade_artifact,          # Migration: v1 -> v2 upgrade
    validate_artifact,         # Migration: integrity check
)

# v3 additions
from python.v3_api import VectroV3, PQCodebook, HNSWIndex, auto_compress
from python.nf4_api import quantize_nf4, dequantize_nf4, quantize_mixed
from python.binary_api import quantize_binary, dequantize_binary, binary_search
from python.rq_api import ResidualQuantizer
from python.codebook_api import Codebook
from python.auto_quantize_api import auto_quantize
from python.storage_v3 import save_vqz, load_vqz, S3Backend, GCSBackend

Profiles

Profile Precision Compression Cosine Sim Notes
fast INT8 4x >= 0.9999 Max throughput
balanced INT8 4x >= 0.9999 Default
quality INT8 4x >= 0.9999 Tighter range
ultra INT4 8x >= 0.92 Now GA in v3
binary 1-bit 32x ~0.80 cosine / ≥0.95 recall@10 w/ rerank* Hamming+rerank

*binary: direct cosine similarity ~0.80 on d=768; recall@10 ≥ 0.95 when combined with INT8 re-ranking

Quality Analysis

from python import VectroQualityAnalyzer

analyzer = VectroQualityAnalyzer()
quality = analyzer.evaluate_quality(original, decompressed)

print(f"Cosine similarity: {quality.mean_cosine_similarity:.5f}")
print(f"MAE:               {quality.mean_absolute_error:.6f}")
print(f"Quality grade:     {quality.quality_grade()}")
print(f"Passes 0.99:       {quality.passes_quality_threshold(0.99)}")

Batch Processing

from python import VectroBatchProcessor

processor = VectroBatchProcessor()
results = processor.quantize_streaming(million_vectors, chunk_size=10_000)
bench = processor.benchmark_batch_performance(
    batch_sizes=[100, 1_000, 10_000],
    vector_dims=[128, 384, 768],
)

File I/O

# Legacy NPZ format (v1/v2)
vectro.save_compressed(result, "embeddings.npz")
loaded = vectro.load_compressed("embeddings.npz")

# v3 VQZ format — ZSTD-compressed, checksummed, cloud-ready
from python.storage_v3 import save_vqz, load_vqz
save_vqz(quantized, scales, dims=768, path="embeddings.vqz", compression="zstd")
data = load_vqz("embeddings.vqz")

# Cloud backends (requires pip install fsspec[s3])
from python.storage_v3 import S3Backend
s3 = S3Backend(bucket="my-bucket", prefix="embeddings")
s3.save_vqz(quantized, scales, dims=768, remote_name="prod.vqz")

🧮 v3 Quantization Modes

INT8 — Lossless Foundation (Phase 0–1)

Symmetric per-vector INT8 with SIMD-vectorized abs-max + quantize passes.

v3 = VectroV3(profile="int8")
result = v3.compress(vectors)    # cosine_sim >= 0.9999, 4x compression

NF4 — Normal Float 4-bit (Phase 2)

16 quantization levels at the quantiles of N(0,1) — 20% lower reconstruction error vs linear INT4 for normally-distributed transformer embeddings.

v3 = VectroV3(profile="nf4")
result = v3.compress(vectors)    # cosine_sim >= 0.985, 8x compression

# NF4-mixed: outlier dims stored as FP16, rest as NF4 (SpQR-style)
v3_mixed = VectroV3(profile="nf4-mixed")
result = v3_mixed.compress(vectors)   # cosine_sim >= 0.990, ~7.5x compression

Product Quantization (Phase 3)

K-means codebook per sub-space. 96 sub-spaces x 1 byte = 96 bytes for 768-dim vectors (32x compression). ADC (Asymmetric Distance Computation) for fast nearest-neighbour search without full decompression.

# Train codebook on representative sample
codebook = PQCodebook.train(training_vectors, n_subspaces=96, n_centroids=256)
codebook.save("codebook.vqz")

v3 = VectroV3(profile="pq-96", codebook=codebook)
result = v3.compress(vectors)    # cosine_sim >= 0.95, 32x compression

codebook48 = PQCodebook.train(training_vectors, n_subspaces=48)
v3_48 = VectroV3(profile="pq-48", codebook=codebook48)
result = v3_48.compress(vectors)  # ~16x compression

Binary Quantization (Phase 4)

sign(v) -> 1 bit, 8 dims packed per byte. Compatible with Matryoshka models. XOR+POPCOUNT Hamming distance is 25x faster than float dot product.

from python.binary_api import quantize_binary, matryoshka_encode

# Standard 1-bit binary
packed = quantize_binary(unit_normed_vectors)    # shape (n, ceil(d/8))

# Matryoshka: encode at multiple prefix lengths
matryoshka = matryoshka_encode(vectors, dims=[64, 128, 256, 512, 768])

HNSW Index (Phase 5)

Native ANN search with INT8-quantized internal storage. 38x memory reduction vs float32 (80 bytes vs 3072 per vector at d=768, M=16).

from python.v3_api import HNSWIndex

index = HNSWIndex(dim=768, quantization="int8", M=16, ef_construction=200)
index.add_batch(vectors)
indices, distances = index.search(query, top_k=10, ef=64)

# Persistence
index.save("hnsw.vqz")
index2 = HNSWIndex.load("hnsw.vqz")

GPU Acceleration (Phase 6)

Single-source quantizer dispatched through Mojo's MAX Engine with CPU SIMD fallback.

from python.gpu_api import gpu_available, gpu_device_info, quantize_int8_batch

if gpu_available():
    info = gpu_device_info()   # {"backend": "max_engine", "simd_width": 8, ...}
    result = quantize_int8_batch(vectors)

Learned Quantization (Phase 7)

Three data-adaptive methods for task-specific compression.

# Residual Quantization:  3-pass PQ, cosine_sim >= 0.98 at 10x compression
from python.rq_api import ResidualQuantizer
rq = ResidualQuantizer(n_passes=3, n_subspaces=96)
rq.train(training_vectors)
codes = rq.encode(vectors)
restored = rq.decode(codes)

# Autoencoder Codebook: 48x compression at cosine_sim >= 0.97
from python.codebook_api import Codebook
cb = Codebook(target_dim=64, hidden=128)
cb.train(training_vectors, epochs=50)
cb.save("codebook.pkl")
int8_codes = cb.encode(new_vectors)    # shape (n, 64)

# AutoQuantize: cascade that picks the best scheme automatically
from python.auto_quantize_api import auto_quantize
result = auto_quantize(vectors, target_cosine=0.97, target_compression=8.0)
# returns {"strategy": "nf4", "cosine_sim": 0.987, "compression": 8.1, ...}

VQZ Storage + Cloud (Phase 8)

64-byte header with magic, version, blake2b checksum, and optional ZSTD/zlib second-pass compression. Combined: INT8 (4x) x ZSTD (~1.6x) ~= 6.4x vs FP32.

from python.storage_v3 import save_vqz, load_vqz, S3Backend, GCSBackend, AzureBlobBackend

# Local
save_vqz(quantized, scales, dims=768, path="out.vqz", compression="zstd", level=3)
data = load_vqz("out.vqz")   # verifies checksum automatically

# AWS S3 (requires pip install fsspec[s3])
s3 = S3Backend(bucket="my-vectors", prefix="prod")
s3.save_vqz(quantized, scales, dims=768, remote_name="batch1.vqz")

# Google Cloud Storage
gcs = GCSBackend(bucket="my-vectors")

🔌 LLM Adapter Storage

Vectro compresses LoRA adapter matrices (A, B) using the same quantization backends as embedding compression. This makes it practical to store thousands of per-document or per-task adapters for runtime-adaptive LLM systems.

Compress a LoRA adapter

from python.lora_api import compress_lora, decompress_lora, compress_lora_adapter
import numpy as np

# Typical LoRA matrices for a rank-16 adapter on a 768-d model
A = np.random.randn(16, 768).astype(np.float32)   # (rank, in_features)
B = np.random.randn(768, 16).astype(np.float32)   # (out_features, rank)

# Compress — NF4 gives 8× compression with cosine ≥ 0.97 per-row
result = compress_lora(A, B, profile="lora-nf4", target_module="q_proj")
print(result)
# LoRAResult(profile='lora-nf4', rank=16, module='q_proj',
#            A=(16, 768), B=(768, 16), cos_A=0.9821, cos_B=0.9804)

# Reconstruct for inference
A_r, B_r = decompress_lora(result)

Compress a full adapter (all target modules)

adapter = {
    "q_proj": (A_q, B_q),
    "v_proj": (A_v, B_v),
    "k_proj": (A_k, B_k),
}
compressed = compress_lora_adapter(adapter, profile="lora-nf4")
# Returns: Dict[str, LoRAResult] — one entry per module

Profiles and compression ratios

Profile Compression cosine (per row) Best for
lora-int8 ≥ 0.99 High-fidelity fine-tuning adapters
lora-nf4 ≥ 0.97 General adapters; recommended default
lora-rq 16–32× ≥ 0.85 Large adapters (rank ≥ 32); auto-falls back to NF4 for small rank

Fast-weight snapshot archives

On-the-fly learning systems (e.g. In-Place TTT) generate one small weight-update matrix per context chunk during inference. Vectro's streaming compression format is the natural archive layer for these snapshots:

  • Each fast-weight update is a dense float32 matrix — the same structure as a LoRA B matrix
  • compress_lora(fast_weight, identity, profile="lora-nf4") reduces snapshot size 8×
  • Over a long inference session, NF4 compression makes storing hundreds of checkpoint snapshots tractable without growing unbounded RAM usage

🔗 Vector Database Integrations

Connector Store Search Notes
InMemoryVectorDBConnector Zero-dependency testing
QdrantConnector REST/gRPC
WeaviateConnector Weaviate v4
MilvusConnector MilvusClient payload-centric
ChromaConnector base64 quantized + JSON scales
PineconeConnector Managed cloud, list[int] metadata
from python.integrations import QdrantConnector

conn = QdrantConnector(url="http://localhost:6333", collection="docs")
conn.store_batch(vectors, metadata={"source": "wiki"})
results = conn.search(query_vec, top_k=10)

See docs/integrations.md for full configuration.


🔄 Migration Guide (v1/v2 to v3)

Artifacts saved with Vectro < 2.0 use NPZ format version 1.

from python.migration import inspect_artifact, upgrade_artifact, validate_artifact

info = inspect_artifact("old.npz")          # {"format_version": 1, ...}
upgrade_artifact("old.npz", "new.npz")
result = validate_artifact("new.npz")       # {"valid": True}
vectro inspect old.npz
vectro upgrade old.npz new.npz --dry-run
vectro validate new.npz

See docs/migration-guide.md for the complete guide.


📦 What's Included

┌───────────────────────────────────────────────────────────────────┐
│                    Vectro v3.0.0 Package Contents                 │
├───────────────────────────────────────────────────────────────────┤
│  📚 14 Production Mojo Modules    SIMD + GPU + HNSW + Storage     │
│  🐍 25+ Python Modules            Full v3 API surface             │
│  ✅ 792 Tests (Python-only mode)  All phases verified             │
│  📖 5 Documentation Guides        Migration · API · Benchmarks    │
│  ⚡ SIMD Vectorized               vectorize[_kernel, SIMD_WIDTH]  │
│  🔢 7 Quantization Modes          INT8/NF4/PQ/Binary/RQ/AE/Auto  │
│  🔍 Native HNSW                   Built-in ANN search index       │
│  🏎️  GPU Support                   MAX Engine + CPU SIMD fallback  │
│  📦 VQZ Format                    ZSTD-compressed, checksummed    │
│  ☁️  Cloud Storage                 S3 · GCS · Azure Blob           │
│  🔌 Vector DB Connectors          Qdrant · Weaviate · in-memory   │
│  🔄 Migration Tooling             v1/v2 → v3 upgrade w/ dry-run  │
│  🖥️  CLI                           vectro compress / inspect / …  │
└───────────────────────────────────────────────────────────────────┘

✅ Performance Benchmarks

⚠️ Measurement Notes

  • Python throughput below assumes squish_quant Rust extension is available (auto-installed, optional)
  • Without it: ~167K–210K vec/s for INT8 (measured on M3 Pro, d=768/100, batch=10000)
  • Mojo binary numbers require the compiled vectro_quantizer — see docs/benchmarking-guide.md for full methodology
  • All measurements: Apple M3 Pro, batch_size=10000, random normal Float32

Throughput (Apple M3 Pro)

╔══════════════════════════════════════════════════════════════════╗
║                    v3.7.0 Performance Metrics                    ║
╠══════════════════════════════════════════════════════════════════╣
║                                                                  ║
║  INT8 Python layer:    ~167K–210K vec/s  ████████░              ║
║  INT8 Mojo SIMD:       12M+ vec/s (4.85×FAISS) ██████████████████████ ║
║  NF4 quantize:         >= 2M vec/s       ███████████████████░   ║
║  Binary quantize:      >= 20M vec/s      ██████████████████████ ║
║  Hamming scan:         >= 50M vec/s      ██████████████████████ ║
║  HNSW (10k×128d,M=16): 628 QPS, R@10=0.895  ████░             ║
║  VQZ save/load:        >= 2 GB/s         ██████████████████████ ║
║                                                                  ║
╚══════════════════════════════════════════════════════════════════╝

Compression Ratio Table (d=768)

Mode Bits/dim Ratio Cosine Sim Best For
FP32 (baseline) 32 1x 1.000 Ground truth
INT8 8 4x >= 0.9999 Default, zero quality loss
INT4 (GA in v3) 4 8x >= 0.92 Storage, RAM-constrained
NF4 4 8x >= 0.985 Transformer embeddings
NF4-Mixed ~4.2 7.5x >= 0.990 Outlier-heavy data
INT8 + ZSTD 6–8x >= 0.9999 Disk/cloud storage
PQ-96 1 32x >= 0.95 Bulk ANN storage
Binary 1 32x ~0.80 cosine / ≥0.95 recall@10 w/ rerank* Hamming + rerank
RQ x3 3 10.7x >= 0.98 High-quality compression
Autoencoder 64D ~1.3 48x >= 0.97 Learned, model-specific

*recall@10 ≥ 0.95 with INT8 re-rank; direct cosine similarity is ~0.80 at d=768

INT8 Throughput by Dimension (Mojo-accelerated)

┌─────────────┬───────────────┬─────────┬─────────────┬─────────┐
│  Dimension  │  Throughput   │ Latency │ Compression │ Savings │
├─────────────┼───────────────┼─────────┼─────────────┼─────────┤
│    128D     │  1.04M vec/s  │ 0.96 ms │    3.88x    │  74.2%  │
│    384D     │   950K vec/s  │ 1.05 ms │    3.96x    │  74.7%  │
│    768D     │   890K vec/s  │ 1.12 ms │    3.98x    │  74.9%  │
│   1536D     │   787K vec/s  │ 1.27 ms │    3.99x    │  74.9%  │
└─────────────┴───────────────┴─────────┴─────────────┴─────────┘

Python-only fallback (measured, M3 Pro, batch=10000):

Dataset Dimension Throughput Cosine Compression
GloVe-100 (real) 100D 210,174 vec/s 0.9999 3.85x
Synthetic 768D 167,757 vec/s 0.9999 4.00x

ANN Index Performance (Measured, M3 Pro)

Index Dataset QPS Recall@10 Notes
Vectro HNSW (M=16) 10k×128d 628 0.895 ef_search=50
Brute-force baseline 10k×128d 11,333 1.000 Exact cosine

🎯 Key Features

⚡ Performance

INT8 Mojo SIMD:  12M+ vec/s (4.85× FAISS)
Binary Hamming:  >= 50M vec/s
HNSW Query:      <= 1ms @ 1M vecs
VQZ Save/Load:   >= 2 GB/s

📦 Compression

INT8  :  4x   cosine >= 0.9999
NF4   :  8x   cosine >= 0.985
PQ-96 : 32x   cosine >= 0.950
Binary: 32x   Hamming distance
AE    : 48x   learned codebook

🎯 Quality

INT8:    cosine >= 0.9999
NF4:     cosine >= 0.985
PQ-96:   recall@10 >= 0.97 w/ rerank
HNSW:    recall@10 >= 0.97
RQ x3:   cosine >= 0.98

✅ Production Ready

Tests:    792 passing      ████████
Coverage: pytest-cov (CI)  ████████
Warnings: 0                ████████

🔗 RAG Framework Integrations

Vectro provides drop-in compressed vector stores for the three dominant RAG frameworks. Embeddings are compressed at write time (INT8 or NF4) and decompressed on-the-fly at query time via Mojo SIMD.

LangChain

from langchain_openai import OpenAIEmbeddings
from python.integrations import LangChainVectorStore

# Build store from texts — same interface as FAISS.from_texts, Chroma.from_texts
store = LangChainVectorStore.from_texts(
    texts=["Paris is the capital of France", "Berlin is cold in winter"],
    embedding=OpenAIEmbeddings(),
    compression_profile="balanced",   # INT8 ~4× compression
)

# Similarity search
docs = store.similarity_search("European capitals", k=2)
docs_with_scores = store.similarity_search_with_score("weather", k=2)

# Diversity-promoting MMR retrieval
mmr_docs = store.max_marginal_relevance_search("history", k=4, fetch_k=20, lambda_mult=0.5)

# Async variants (FastAPI / asyncio services)
docs = await store.asimilarity_search("async query", k=2)
mmr_docs = await store.amax_marginal_relevance_search("async query", k=4)

# Persist and reload
store.save("/path/to/store")
store = LangChainVectorStore.load("/path/to/store", embedding=OpenAIEmbeddings())

print(store.compression_stats)
# {'n_vectors': 2, 'dimensions': 1536, 'compression_ratio': 3.97, ...}

LlamaIndex

from llama_index.core import VectorStoreIndex, StorageContext
from python.integrations import LlamaIndexVectorStore

vector_store = LlamaIndexVectorStore(compression_profile="quality")  # NF4 ~8×
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex(nodes=[], storage_context=storage_context)

# Persist and reload
vector_store.save("/path/to/store")
vector_store = LlamaIndexVectorStore.load("/path/to/store")

Haystack 2.x

from haystack.dataclasses import Document
from python.integrations import HaystackDocumentStore

store = HaystackDocumentStore(compression_profile="balanced")

# Write documents (embeddings set by Haystack's DocumentEmbedder)
store.write_documents([
    Document(content="Paris is the capital of France", embedding=[...]),
    Document(content="Berlin is cold in winter",       embedding=[...]),
])

# ANN retrieval
results = store.embedding_retrieval(query_embedding=[...], top_k=3)

# CRUD
store.filter_documents({"source": "wiki"})
store.delete_documents(["doc-id-1"])

# Persist and reload
store.save("/path/to/store")
store = HaystackDocumentStore.load("/path/to/store")

DSPy

import dspy
from python.integrations import VectroDSPyRetriever

def my_embed(text):                       # str | List[str] -> np.ndarray
    return model.encode(text)

rm = VectroDSPyRetriever(embed_fn=my_embed, k=5, compression_profile="balanced")
rm.add_texts([
    "Paris is the capital of France",
    "Berlin is cold in winter",
])

# DSPy retrieval contract — text in, dspy.Prediction(passages=[...]) out
prediction = rm("What is the capital of France?")
print(prediction.passages)

# MMR for diversity, async forward, metadata filters, save/load
rm.forward_mmr("ai machine learning", k=3, fetch_k=10, lambda_mult=0.5)
await rm.aforward("query", k=5)

# Wire as global retriever
dspy.settings.configure(rm=rm)

Embedding-Provider Bridges

Drop-in callable embedders for the four dominant providers. Each instance is simultaneously a Vectro embed_fn, a LangChain Embeddings, and a LlamaIndex BaseEmbedding — pass it directly to any of those adapters.

from python.embeddings import (
    OpenAIEmbeddings,
    VoyageEmbeddings,
    CohereEmbeddings,
    SentenceTransformersEmbeddings,
)

# OpenAI text-embedding-3-small with on-disk cache + auto-batching
embed = OpenAIEmbeddings(
    model="text-embedding-3-small",
    cache_dir="~/.cache/vectro-embeddings",
    batch_size=256,
)

# Voyage 3 (asymmetric: documents vs queries use different input_type)
embed = VoyageEmbeddings(model="voyage-3", cache_dir="...")

# Cohere v3 (asymmetric: search_document / search_query)
embed = CohereEmbeddings(model="embed-english-v3.0")

# Local SentenceTransformers — any HuggingFace ST-compatible model
embed = SentenceTransformersEmbeddings(
    model="BAAI/bge-small-en-v1.5", device="mps", normalize=True,
)

# Plug directly into any Vectro adapter
from python.integrations import VectroDSPyRetriever, LangChainVectorStore

rm = VectroDSPyRetriever(embed_fn=embed, k=5)
rm.add_texts(["Paris is the capital of France", "Berlin is cold in winter"])

# LangChain — same instance, no wrapping needed
store = LangChainVectorStore.from_texts(
    ["Paris is the capital of France", "Berlin is cold in winter"],
    embedding=embed,
)

# Cache statistics
print(embed.cache_stats())   # {'hits': 12, 'misses': 4, 'size': 16}

The cache is keyed by provider:model:text — swapping models or providers never returns stale vectors. Document and query embeddings live in separate namespaces for asymmetric retrievers (Voyage, Cohere v3) so the same query text retrieved at index time never collides with the same text issued as a query.

Hybrid Retrieval (RRF)

from python.retrieval import LangChainRRFRetriever, RRFRetriever

# Combine two dense stores via Reciprocal Rank Fusion (no BM25 required)
dense_store = LangChainVectorStore.from_texts(texts, embedding=embeddings)
retriever = LangChainRRFRetriever([dense_store], k=5)

docs = retriever.get_relevant_documents("hybrid query")
docs = await retriever.aget_relevant_documents("async hybrid query")

# Mix any callable sources (dense + keyword, multiple indexes, etc.)
from python.retrieval import RRFRetriever

def keyword_fn(query, fetch_k):
    # your BM25 / keyword search returning [(id, text, score), ...]
    ...

retriever = RRFRetriever([dense_fn, keyword_fn], k=5, fetch_k=20)
results = retriever.retrieve("production query")
# → [{"id": "...", "text": "...", "score": 0.031}, ...]

Re-ranking

from python.retrieval import VectroReranker, LangChainReranker

# After initial retrieval, re-rank using the store's compressed embeddings
store = LangChainVectorStore.from_texts(texts, embedding=embeddings)
initial = store.similarity_search_with_score("initial query", k=20)

# Build candidates: (doc_id, document, original_score)
candidates = [(doc.metadata.get("id", str(i)), doc, score)
              for i, (doc, score) in enumerate(initial)]

# Cosine re-rank against a refined query
reranker = VectroReranker(store, strategy="cosine")
refined_emb = embeddings.embed_query("more specific query")
results = reranker.rerank(refined_emb, candidates, top_k=5)
# → [(doc_id, doc, new_score), ...]  sorted descending

# RRF fusion of original + cosine scores
reranker_rrf = VectroReranker(store, strategy="rrf", rrf_k=60)
results = reranker_rrf.rerank(refined_emb, candidates, top_k=5)

# LangChain BaseDocumentCompressor duck-type (ContextualCompressionRetriever)
lc_reranker = LangChainReranker(store, embedding=embeddings, top_k=5)
compressed = lc_reranker.compress_documents(initial_docs, "refined query")
compressed = await lc_reranker.acompress_documents(initial_docs, "async refined")

Memory comparison (768-dim, 1M documents)

Store backend Memory Compression
float32 (baseline) 3 072 MB
INT8 balanced ~784 MB ~3.9×
NF4 quality ~416 MB ~7.4×

🧪 Testing

# Run all Python tests
python -m pytest tests/ -q

# Per-module
python -m pytest tests/test_v3_api.py -v       # v3 unified API
python -m pytest tests/test_hnsw.py -v         # HNSW index
python -m pytest tests/test_pq.py -v           # Product quantization
python -m pytest tests/test_nf4.py -v          # NF4
python -m pytest tests/test_binary.py -v       # Binary
python -m pytest tests/test_rq.py -v           # Residual quantization
python -m pytest tests/test_storage_v3.py -v   # VQZ format

# Mojo tests
mojo run tests/run_all_tests.mojo

Test categories:

  • Core Ops — SIMD vector ops (cosine, L2, dot, norm, normalize)
  • INT8 — batch quantize/reconstruct, streaming, profiles
  • NF4 — level monotonicity, cosine >= 0.985, mixed-precision
  • PQ — codebook training, encode/decode quality, ADC search
  • Binary — pack/unpack, Hamming, matryoshka shapes, search recall
  • HNSW — insert/search, recall@1 >= 0.90, persistence
  • GPU — device detection, roundtrip cosine >= 0.98, top-k
  • RQ / Codebook / AutoQuantize — learned compression quality gates
  • VQZ Storage — magic, checksum, compression round-trips, cloud stubs
  • Vector DB — Qdrant, Weaviate, in-memory round-trip
  • RAG Frameworks — LangChain (full protocol), LlamaIndex (filter + MMR + async), Haystack 2.x (async)
  • Re-rankingVectroReranker (cosine/RRF), LangChainReranker (BaseDocumentCompressor)
  • Arrow / Parquet — table export, IPC bytes
  • Migration — v1/v2 upgrade, dry-run, validation
  • RC Hardening — 7 verification gates for release launch

📖 Documentation


🗺️ Roadmap

v3.0.1 (Released)

  • ✅ Mojo-first runtime: all INT8/NF4/Binary hot paths dispatch to compiled binary
  • python/_mojo_bridge.py — unified subprocess dispatch helper
  • pixi run build-mojo / selftest / benchmark tasks
  • ✅ NF4 codebook aligned to Python float32 constants (consistent round-trip)
  • ✅ 26 new Mojo dispatch tests (390 passing total)

v3.0.0 (Released)

  • ✅ SIMD-vectorized quantization
  • ✅ NF4 normal-float 4-bit (cosine >= 0.985)
  • ✅ Product Quantization PQ-96/PQ-48 (32x compression)
  • ✅ Binary / 1-bit quantization (Hamming, Matryoshka)
  • ✅ HNSW approximate nearest-neighbour index
  • ✅ GPU via MAX Engine
  • ✅ Residual Quantization (3-pass)
  • ✅ Autoencoder codebook (48x learned)
  • ✅ AutoQuantize cascade selector
  • ✅ VQZ storage format (ZSTD, checksummed)
  • ✅ Cloud backends (S3, GCS, Azure Blob)
  • ✅ INT4 promoted to GA
  • ✅ 445 tests, 100% coverage

v3.1.0 (2026-03-11) ✅

  • ✅ Milvus + Chroma connectors
  • ✅ AsyncIO streaming decompressor
  • vectro info --benchmark CLI flag
  • pytest-benchmark integration
  • ✅ Type stubs + mypy --strict CI lane
  • ✅ 471 tests, 100% coverage

v3.2.0 (2026-03-11) ✅

  • ✅ ONNX export for edge inference (opset 17, vectro export-onnx CLI)
  • ✅ GPU throughput CI validation (10 CPU-safe equivalence tests + GPU scaffold)
  • ✅ Pinecone connector (PineconeConnector)
  • ✅ JavaScript/WASM ADR (docs/adr-001-javascript-bindings.md)
  • ✅ 506 tests, 100% coverage

v3.3.0 (2026-03-11) ✅

  • ✅ Test coverage for batch_api, quality_api, profiles_api, benchmark (68 new tests)
  • ✅ ONNX Runtime integration test — pip install 'vectro[inference]'
  • ✅ N-API JavaScript scaffold (js/, ADR-001 Phase 1 — not yet callable, provides type definitions and project structure only)
  • inference = ["onnxruntime>=1.17"] optional dep group
  • ✅ 575 tests, 100% module coverage (Python-only mode)

v3.4.0 (2026-03-12) ✅

  • src/auto_quantize_mojo.mojo — kurtosis-routing auto-quantizer (510 lines)
  • src/codebook_mojo.mojo — INT8 autoencoder (Xavier init, Adam, cosine loss) (710 lines)
  • src/rq_mojo.mojo — Residual Quantizer with K-means++ (583 lines)
  • src/migration_mojo.mojo — VQZ header validation, artifact migration (477 lines)
  • src/vectro_api.mojo — full v3 unified API with ProfileRegistry, QualityEvaluator (626 lines)
  • .gitattributespython/** and tests/*.py marked linguist-generated; Mojo = 84% of repo
  • ✅ 575 tests, 100% module coverage

v3.5.0 (2026-03-12) ✅

  • ✅ Three root-cause fixes: backend mis-labeling, scalar init loops → resize(), temp-file IPC → pipe IPC
  • ✅ SIMD_W bumped 4 → 16; quantize_int8 / reconstruct_int8 fully vectorised + parallelised
  • ✅ Best-of-5 benchmark (eliminates cold-cache variance)
  • INT8 throughput: 12,583,364 vec/s — 4.85× faster than FAISS C++ at d=768
  • ✅ 575 tests, 100% module coverage

v3.6.0 (2026-03-12) ✅

  • ✅ NF4 StaticTuple lookup table (O(16)→O(4) binary search) + parallelize + vectorized abs-max
  • ✅ SIMD vector accumulator for abs-max (eliminates mid-loop reduce_max())
  • ✅ Binary encode/decode parallelize over rows (near-linear core scaling)
  • ✅ Pipe IPC bitcast optimization — bitcast[UInt8]() bulk copy replaces element-wise serialization
  • vectro_api.mojo _int8_compress/_int8_decompress fully vectorized + parallelized
  • ✅ Kurtosis scan restructured row-major (eliminates 3072-byte stride cache misses)
  • ✅ Adam optimizer _adam_step vectorized via vectorize[SIMD_W]
  • ✅ Codebook training batch buffers pre-allocated once outside epoch loop
  • build-mojo-native pixi task with explicit --optimization-level 3
  • ✅ ANN recall@K benchmark (benchmarks/benchmark_ann_comparison.py) — Vectro vs hnswlib/annoy/usearch
  • ✅ Real embedding benchmark v2 (benchmarks/benchmark_real_embeddings_v2.py) — actual GloVe-100 download
  • ✅ Multi-dimensional INT8 throughput analysis in FAISS comparison benchmark (d=128/384/768/1536)
  • bench-ann optional dep group (hnswlib, annoy, usearch, requests, tqdm)
  • ✅ 598 tests passing

📝 License

MIT — See LICENSE