[Feature Request] Add local qwen3-embedding:0.6b & qwen3-reranker:0.6b support and run MTEB Embeddings/Reranker Leaderboard benchmarks locally

### Feature request: Add local `qwen3-embedding:0.6b` & `qwen3-reranker:0.6b` and run MTEB / Reranker leaderboard benchmarks on-device

## Summary

Please consider adding first-class local support for the **official** Qwen3 embedding & reranker models, and publishing a reproducible **on-device** benchmark against the MTEB Embeddings Leaderboard and MTEB Reranking Leaderboard.

- Embedding: [`qwen3-embedding:0.6b`](https://ollama.com/library/qwen3-embedding) (official Qwen model, now on Ollama's official library)
- Reranker: [`qwen3-reranker:0.6b`](https://huggingface.co/Qwen/Qwen3-Reranker-0.6B) (official, HF)

Hindsight currently defaults to `BAAI/bge-*` ([configuration docs](https://hindsight.vectorize.io/developer/configuration)) and already supports `local` / `tei` / `litellm-sdk` / Google providers per [`.env.example`](https://github.com/vectorize-io/hindsight/blob/main/.env.example). Adding Qwen3 would give users a fully-local, SOTA-quality option that fits Hindsight's privacy-first positioning.

---

## Why Qwen3-Embedding / Qwen3-Reranker

- **SOTA on MTEB multilingual** — the 8B variant ranked #1 on MTEB when released (June 2025); the 0.6B variant is competitive with `bge-m3` / `bge-reranker-v2-m3` at ~1/10 the size.
- **Small footprint** — `qwen3-embedding:0.6b` is ~639 MB (Q4) / ~1.2 GB (bf16), runnable on CPU-only laptops.
- **Official releases now exist**:
  - Ollama official library: `ollama pull qwen3-embedding:0.6b`  ([link](https://ollama.com/library/qwen3-embedding))
  - HF official: [`Qwen/Qwen3-Embedding-0.6B`](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B), [`Qwen/Qwen3-Embedding-0.6B-GGUF`](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B-GGUF)
  - HF official: [`Qwen/Qwen3-Reranker-0.6B`](https://huggingface.co/Qwen/Qwen3-Reranker-0.6B)
- **Source**: [QwenLM/Qwen3-Embedding](https://github.com/QwenLM/Qwen3-Embedding)

---

## ⚠️ Important caveat about the reranker (needs verification before merging)

**`Qwen3-Embedding-0.6B` works out-of-the-box** via Ollama / llama.cpp / TEI / transformers — no special handling needed, it's a standard bi-encoder.

**`Qwen3-Reranker-0.6B` is NOT a drop-in reranker** in the `bge-reranker-v2-m3` sense. It is a *causal-LM–style* reranker that scores relevance by computing the probability of a "yes" token given a specific chat template, and several community toolchains have open bugs today:

| Runtime | Status | Reference |
|---|---|---|
| vLLM | ✅ Works | upstream supported |
| HF `transformers` (official snippet) | ✅ Works | [model card](https://huggingface.co/Qwen/Qwen3-Reranker-0.6B) |
| `llama.cpp` `--pooling rerank` / `llama-server --rerank` | ❌ Broken / wrong scores | [llama.cpp#16407](https://github.com/ggml-org/llama.cpp/issues/16407), [llama.cpp#17743](https://github.com/ggml-org/llama.cpp/issues/17743), [mradermacher GGUF discussion](https://huggingface.co/mradermacher/Qwen3-Reranker-0.6B-GGUF/discussions/1) (missing `sep_token`), [Mungert GGUF discussion](https://huggingface.co/Mungert/Qwen3-Reranker-4B-GGUF/discussions/1) (missing `cls.output.weight`) |
| Ollama rerank endpoint | ⚠️ Inherits llama.cpp issue | same as above |
| `node-llama-cpp` via custom prompt (e.g. [QMD](https://medium.com/coding-nexus/qmd-local-hybrid-search-engine-for-markdown-that-cuts-token-usage-by-95-e0f9d21f89af)) | ✅ Works with custom chat template | community workaround |

**Implication for Hindsight integration:**
1. The embedding side can reuse the existing `local` / `tei` / `ollama` code paths unchanged.
2. The reranker side likely needs a **dedicated `qwen3-reranker` provider** that either:
   - calls Qwen3-Reranker through `transformers` / vLLM using the official yes-token-probability template, **or**
   - wraps Ollama/llama.cpp behind a prompt-based scorer (not the native `/rerank` endpoint) until llama.cpp ships a fix.
3. Please document the chosen approach and pin a known-good runtime version, because ranking accuracy silently degrades otherwise.

---

## Requested changes

### 1. Model support
- [ ] Add `qwen3-embedding:0.6b` as a supported value for the `local` / `ollama` / `tei` / `litellm-sdk` embedding providers.
- [ ] Add `qwen3-reranker:0.6b` as a reranker option with a **correct** scoring implementation (per caveat above).
- [ ] Ship sample config snippets in `.env.example`, e.g.:
  ```bash
  # Embedding
  EMBEDDING_PROVIDER=ollama
  EMBEDDING_MODEL=qwen3-embedding:0.6b
  OLLAMA_BASE_URL=http://localhost:11434

  # Reranker (use transformers/vLLM path until llama.cpp bug is fixed)
  RERANKER_PROVIDER=qwen3
  RERANKER_MODEL=Qwen/Qwen3-Reranker-0.6B
  ```
- [ ] Update [Configuration docs](https://hindsight.vectorize.io/developer/configuration) with a "Fully local with Qwen3" recipe.

### 2. On-device leaderboard benchmark (the main ask)

I'd like the maintainers to run the [`mteb`](https://github.com/embeddings-benchmark/mteb) benchmark **on the actual hardware Hindsight users are expected to deploy on** (CPU-only laptop + single consumer GPU), not just cite the HF leaderboard, and publish the numbers in-repo (e.g. `benchmarks/qwen3-local.md`).

**Embeddings leaderboard tasks to run** (English + Chinese subsets at minimum):
- [ ] Retrieval (highest priority — most relevant to Hindsight's memory recall)
- [ ] Reranking
- [ ] STS
- [ ] Classification
- [ ] Clustering

**Reranker leaderboard:**
- [ ] MTEB Reranking task, comparing `qwen3-reranker:0.6b` vs current default `bge-reranker-v2-m3`.

**Baselines to include:**
- `BAAI/bge-m3`, `BAAI/bge-large-en-v1.5`, `BAAI/bge-large-zh-v1.5`
- `BAAI/bge-reranker-v2-m3`
- `nomic-embed-text`
- (optional) `Qwen3-Embedding-4B` for scaling reference

**Deliverables:**
- [ ] Repro script: `scripts/bench_mteb.py`
- [ ] Hardware profile: CPU model, RAM, GPU (if any), OS
- [ ] Per-model: quality score × p50/p95 latency × peak RAM/VRAM
- [ ] A recommended default-config section in the README

### 3. Nice-to-have
- [ ] CI smoke test that boots Hindsight with Qwen3 local models and runs a minimal recall test.

---

## Why this matters to Hindsight specifically

1. **Privacy-first story stays intact** — users get SOTA retrieval quality without sending any memory to an external API.
2. **Cheap to run** — 0.6B variants fit on any dev laptop; no 4B/8B GPU requirement.
3. **Real numbers > leaderboard numbers** — public MTEB scores are measured on beefy GPUs; Hindsight users care about the quality↔latency tradeoff on *their* boxes. Only the maintainers can authoritatively publish that.
4. **Concrete default to point people to** — today users have to choose between `bge-*` by habit; a measured recommendation reduces decision fatigue.

---

## Additional context

Happy to help with:
- PR for the embedding provider wiring (straightforward).
- PR for a `qwen3-reranker` provider using the official `transformers` yes-token-probability template.
- Running the benchmark on my own hardware and contributing results, but I think the **official** numbers should come from a maintainer to be authoritative.

Two questions for the maintainers before I open a PR:
1. Do you want the Qwen3 reranker integrated via `transformers`/vLLM, via `litellm-sdk`, or do you want to wait until `llama.cpp`'s `--rerank` path is fixed so Ollama can handle it natively?
2. Should benchmark results live in `README.md`, a dedicated `benchmarks/` directory, or on the docs site?

Thanks for Hindsight — the learning-over-time design is genuinely novel. 🙌

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Add local qwen3-embedding:0.6b & qwen3-reranker:0.6b support and run MTEB Embeddings/Reranker Leaderboard benchmarks locally #11

Feature request: Add local `qwen3-embedding:0.6b` & `qwen3-reranker:0.6b` and run MTEB / Reranker leaderboard benchmarks on-device

Summary

Why Qwen3-Embedding / Qwen3-Reranker

⚠️ Important caveat about the reranker (needs verification before merging)

Requested changes

1. Model support

2. On-device leaderboard benchmark (the main ask)

3. Nice-to-have

Why this matters to Hindsight specifically

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Runtime	Status	Reference
vLLM	✅ Works	upstream supported
HF `transformers` (official snippet)	✅ Works	model card
`llama.cpp` `--pooling rerank` / `llama-server --rerank`	❌ Broken / wrong scores	llama.cpp#16407, llama.cpp#17743, mradermacher GGUF discussion (missing `sep_token`), Mungert GGUF discussion (missing `cls.output.weight`)
Ollama rerank endpoint	⚠️ Inherits llama.cpp issue	same as above
`node-llama-cpp` via custom prompt (e.g. QMD)	✅ Works with custom chat template	community workaround

[Feature Request] Add local qwen3-embedding:0.6b & qwen3-reranker:0.6b support and run MTEB Embeddings/Reranker Leaderboard benchmarks locally #11

Description

Feature request: Add local qwen3-embedding:0.6b & qwen3-reranker:0.6b and run MTEB / Reranker leaderboard benchmarks on-device

Summary

Why Qwen3-Embedding / Qwen3-Reranker

⚠️ Important caveat about the reranker (needs verification before merging)

Requested changes

1. Model support

2. On-device leaderboard benchmark (the main ask)

3. Nice-to-have

Why this matters to Hindsight specifically

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Feature request: Add local `qwen3-embedding:0.6b` & `qwen3-reranker:0.6b` and run MTEB / Reranker leaderboard benchmarks on-device