Skip to content

[Feature Request] Add local qwen3-embedding:0.6b & qwen3-reranker:0.6b support and run MTEB Embeddings/Reranker Leaderboard benchmarks locally #11

@yunyng

Description

@yunyng

Feature request: Add local qwen3-embedding:0.6b & qwen3-reranker:0.6b and run MTEB / Reranker leaderboard benchmarks on-device

Summary

Please consider adding first-class local support for the official Qwen3 embedding & reranker models, and publishing a reproducible on-device benchmark against the MTEB Embeddings Leaderboard and MTEB Reranking Leaderboard.

Hindsight currently defaults to BAAI/bge-* (configuration docs) and already supports local / tei / litellm-sdk / Google providers per .env.example. Adding Qwen3 would give users a fully-local, SOTA-quality option that fits Hindsight's privacy-first positioning.


Why Qwen3-Embedding / Qwen3-Reranker


⚠️ Important caveat about the reranker (needs verification before merging)

Qwen3-Embedding-0.6B works out-of-the-box via Ollama / llama.cpp / TEI / transformers — no special handling needed, it's a standard bi-encoder.

Qwen3-Reranker-0.6B is NOT a drop-in reranker in the bge-reranker-v2-m3 sense. It is a causal-LM–style reranker that scores relevance by computing the probability of a "yes" token given a specific chat template, and several community toolchains have open bugs today:

Runtime Status Reference
vLLM ✅ Works upstream supported
HF transformers (official snippet) ✅ Works model card
llama.cpp --pooling rerank / llama-server --rerank ❌ Broken / wrong scores llama.cpp#16407, llama.cpp#17743, mradermacher GGUF discussion (missing sep_token), Mungert GGUF discussion (missing cls.output.weight)
Ollama rerank endpoint ⚠️ Inherits llama.cpp issue same as above
node-llama-cpp via custom prompt (e.g. QMD) ✅ Works with custom chat template community workaround

Implication for Hindsight integration:

  1. The embedding side can reuse the existing local / tei / ollama code paths unchanged.
  2. The reranker side likely needs a dedicated qwen3-reranker provider that either:
    • calls Qwen3-Reranker through transformers / vLLM using the official yes-token-probability template, or
    • wraps Ollama/llama.cpp behind a prompt-based scorer (not the native /rerank endpoint) until llama.cpp ships a fix.
  3. Please document the chosen approach and pin a known-good runtime version, because ranking accuracy silently degrades otherwise.

Requested changes

1. Model support

  • Add qwen3-embedding:0.6b as a supported value for the local / ollama / tei / litellm-sdk embedding providers.
  • Add qwen3-reranker:0.6b as a reranker option with a correct scoring implementation (per caveat above).
  • Ship sample config snippets in .env.example, e.g.:
    # Embedding
    EMBEDDING_PROVIDER=ollama
    EMBEDDING_MODEL=qwen3-embedding:0.6b
    OLLAMA_BASE_URL=http://localhost:11434
    
    # Reranker (use transformers/vLLM path until llama.cpp bug is fixed)
    RERANKER_PROVIDER=qwen3
    RERANKER_MODEL=Qwen/Qwen3-Reranker-0.6B
  • Update Configuration docs with a "Fully local with Qwen3" recipe.

2. On-device leaderboard benchmark (the main ask)

I'd like the maintainers to run the mteb benchmark on the actual hardware Hindsight users are expected to deploy on (CPU-only laptop + single consumer GPU), not just cite the HF leaderboard, and publish the numbers in-repo (e.g. benchmarks/qwen3-local.md).

Embeddings leaderboard tasks to run (English + Chinese subsets at minimum):

  • Retrieval (highest priority — most relevant to Hindsight's memory recall)
  • Reranking
  • STS
  • Classification
  • Clustering

Reranker leaderboard:

  • MTEB Reranking task, comparing qwen3-reranker:0.6b vs current default bge-reranker-v2-m3.

Baselines to include:

  • BAAI/bge-m3, BAAI/bge-large-en-v1.5, BAAI/bge-large-zh-v1.5
  • BAAI/bge-reranker-v2-m3
  • nomic-embed-text
  • (optional) Qwen3-Embedding-4B for scaling reference

Deliverables:

  • Repro script: scripts/bench_mteb.py
  • Hardware profile: CPU model, RAM, GPU (if any), OS
  • Per-model: quality score × p50/p95 latency × peak RAM/VRAM
  • A recommended default-config section in the README

3. Nice-to-have

  • CI smoke test that boots Hindsight with Qwen3 local models and runs a minimal recall test.

Why this matters to Hindsight specifically

  1. Privacy-first story stays intact — users get SOTA retrieval quality without sending any memory to an external API.
  2. Cheap to run — 0.6B variants fit on any dev laptop; no 4B/8B GPU requirement.
  3. Real numbers > leaderboard numbers — public MTEB scores are measured on beefy GPUs; Hindsight users care about the quality↔latency tradeoff on their boxes. Only the maintainers can authoritatively publish that.
  4. Concrete default to point people to — today users have to choose between bge-* by habit; a measured recommendation reduces decision fatigue.

Additional context

Happy to help with:

  • PR for the embedding provider wiring (straightforward).
  • PR for a qwen3-reranker provider using the official transformers yes-token-probability template.
  • Running the benchmark on my own hardware and contributing results, but I think the official numbers should come from a maintainer to be authoritative.

Two questions for the maintainers before I open a PR:

  1. Do you want the Qwen3 reranker integrated via transformers/vLLM, via litellm-sdk, or do you want to wait until llama.cpp's --rerank path is fixed so Ollama can handle it natively?
  2. Should benchmark results live in README.md, a dedicated benchmarks/ directory, or on the docs site?

Thanks for Hindsight — the learning-over-time design is genuinely novel. 🙌

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions