Feature request: Add local qwen3-embedding:0.6b & qwen3-reranker:0.6b and run MTEB / Reranker leaderboard benchmarks on-device
Summary
Please consider adding first-class local support for the official Qwen3 embedding & reranker models, and publishing a reproducible on-device benchmark against the MTEB Embeddings Leaderboard and MTEB Reranking Leaderboard.
Hindsight currently defaults to BAAI/bge-* (configuration docs) and already supports local / tei / litellm-sdk / Google providers per .env.example. Adding Qwen3 would give users a fully-local, SOTA-quality option that fits Hindsight's privacy-first positioning.
Why Qwen3-Embedding / Qwen3-Reranker
⚠️ Important caveat about the reranker (needs verification before merging)
Qwen3-Embedding-0.6B works out-of-the-box via Ollama / llama.cpp / TEI / transformers — no special handling needed, it's a standard bi-encoder.
Qwen3-Reranker-0.6B is NOT a drop-in reranker in the bge-reranker-v2-m3 sense. It is a causal-LM–style reranker that scores relevance by computing the probability of a "yes" token given a specific chat template, and several community toolchains have open bugs today:
| Runtime |
Status |
Reference |
| vLLM |
✅ Works |
upstream supported |
HF transformers (official snippet) |
✅ Works |
model card |
llama.cpp --pooling rerank / llama-server --rerank |
❌ Broken / wrong scores |
llama.cpp#16407, llama.cpp#17743, mradermacher GGUF discussion (missing sep_token), Mungert GGUF discussion (missing cls.output.weight) |
| Ollama rerank endpoint |
⚠️ Inherits llama.cpp issue |
same as above |
node-llama-cpp via custom prompt (e.g. QMD) |
✅ Works with custom chat template |
community workaround |
Implication for Hindsight integration:
- The embedding side can reuse the existing
local / tei / ollama code paths unchanged.
- The reranker side likely needs a dedicated
qwen3-reranker provider that either:
- calls Qwen3-Reranker through
transformers / vLLM using the official yes-token-probability template, or
- wraps Ollama/llama.cpp behind a prompt-based scorer (not the native
/rerank endpoint) until llama.cpp ships a fix.
- Please document the chosen approach and pin a known-good runtime version, because ranking accuracy silently degrades otherwise.
Requested changes
1. Model support
2. On-device leaderboard benchmark (the main ask)
I'd like the maintainers to run the mteb benchmark on the actual hardware Hindsight users are expected to deploy on (CPU-only laptop + single consumer GPU), not just cite the HF leaderboard, and publish the numbers in-repo (e.g. benchmarks/qwen3-local.md).
Embeddings leaderboard tasks to run (English + Chinese subsets at minimum):
Reranker leaderboard:
Baselines to include:
BAAI/bge-m3, BAAI/bge-large-en-v1.5, BAAI/bge-large-zh-v1.5
BAAI/bge-reranker-v2-m3
nomic-embed-text
- (optional)
Qwen3-Embedding-4B for scaling reference
Deliverables:
3. Nice-to-have
Why this matters to Hindsight specifically
- Privacy-first story stays intact — users get SOTA retrieval quality without sending any memory to an external API.
- Cheap to run — 0.6B variants fit on any dev laptop; no 4B/8B GPU requirement.
- Real numbers > leaderboard numbers — public MTEB scores are measured on beefy GPUs; Hindsight users care about the quality↔latency tradeoff on their boxes. Only the maintainers can authoritatively publish that.
- Concrete default to point people to — today users have to choose between
bge-* by habit; a measured recommendation reduces decision fatigue.
Additional context
Happy to help with:
- PR for the embedding provider wiring (straightforward).
- PR for a
qwen3-reranker provider using the official transformers yes-token-probability template.
- Running the benchmark on my own hardware and contributing results, but I think the official numbers should come from a maintainer to be authoritative.
Two questions for the maintainers before I open a PR:
- Do you want the Qwen3 reranker integrated via
transformers/vLLM, via litellm-sdk, or do you want to wait until llama.cpp's --rerank path is fixed so Ollama can handle it natively?
- Should benchmark results live in
README.md, a dedicated benchmarks/ directory, or on the docs site?
Thanks for Hindsight — the learning-over-time design is genuinely novel. 🙌
Feature request: Add local
qwen3-embedding:0.6b&qwen3-reranker:0.6band run MTEB / Reranker leaderboard benchmarks on-deviceSummary
Please consider adding first-class local support for the official Qwen3 embedding & reranker models, and publishing a reproducible on-device benchmark against the MTEB Embeddings Leaderboard and MTEB Reranking Leaderboard.
qwen3-embedding:0.6b(official Qwen model, now on Ollama's official library)qwen3-reranker:0.6b(official, HF)Hindsight currently defaults to
BAAI/bge-*(configuration docs) and already supportslocal/tei/litellm-sdk/ Google providers per.env.example. Adding Qwen3 would give users a fully-local, SOTA-quality option that fits Hindsight's privacy-first positioning.Why Qwen3-Embedding / Qwen3-Reranker
bge-m3/bge-reranker-v2-m3at ~1/10 the size.qwen3-embedding:0.6bis ~639 MB (Q4) / ~1.2 GB (bf16), runnable on CPU-only laptops.ollama pull qwen3-embedding:0.6b(link)Qwen/Qwen3-Embedding-0.6B,Qwen/Qwen3-Embedding-0.6B-GGUFQwen/Qwen3-Reranker-0.6BQwen3-Embedding-0.6Bworks out-of-the-box via Ollama / llama.cpp / TEI / transformers — no special handling needed, it's a standard bi-encoder.Qwen3-Reranker-0.6Bis NOT a drop-in reranker in thebge-reranker-v2-m3sense. It is a causal-LM–style reranker that scores relevance by computing the probability of a "yes" token given a specific chat template, and several community toolchains have open bugs today:transformers(official snippet)llama.cpp--pooling rerank/llama-server --reranksep_token), Mungert GGUF discussion (missingcls.output.weight)node-llama-cppvia custom prompt (e.g. QMD)Implication for Hindsight integration:
local/tei/ollamacode paths unchanged.qwen3-rerankerprovider that either:transformers/ vLLM using the official yes-token-probability template, or/rerankendpoint) until llama.cpp ships a fix.Requested changes
1. Model support
qwen3-embedding:0.6bas a supported value for thelocal/ollama/tei/litellm-sdkembedding providers.qwen3-reranker:0.6bas a reranker option with a correct scoring implementation (per caveat above)..env.example, e.g.:2. On-device leaderboard benchmark (the main ask)
I'd like the maintainers to run the
mtebbenchmark on the actual hardware Hindsight users are expected to deploy on (CPU-only laptop + single consumer GPU), not just cite the HF leaderboard, and publish the numbers in-repo (e.g.benchmarks/qwen3-local.md).Embeddings leaderboard tasks to run (English + Chinese subsets at minimum):
Reranker leaderboard:
qwen3-reranker:0.6bvs current defaultbge-reranker-v2-m3.Baselines to include:
BAAI/bge-m3,BAAI/bge-large-en-v1.5,BAAI/bge-large-zh-v1.5BAAI/bge-reranker-v2-m3nomic-embed-textQwen3-Embedding-4Bfor scaling referenceDeliverables:
scripts/bench_mteb.py3. Nice-to-have
Why this matters to Hindsight specifically
bge-*by habit; a measured recommendation reduces decision fatigue.Additional context
Happy to help with:
qwen3-rerankerprovider using the officialtransformersyes-token-probability template.Two questions for the maintainers before I open a PR:
transformers/vLLM, vialitellm-sdk, or do you want to wait untilllama.cpp's--rerankpath is fixed so Ollama can handle it natively?README.md, a dedicatedbenchmarks/directory, or on the docs site?Thanks for Hindsight — the learning-over-time design is genuinely novel. 🙌