feat: Add OpenAI embedding and query expansion support by jonesj38 · Pull Request #116 · tobi/qmd

jonesj38 · 2026-02-05T11:22:03Z

Summary

Optional OpenAI integration for embeddings and query expansion. Dramatically faster for users who prefer API-based inference over local models.

Performance

Operation	Local (llama-cpp)	OpenAI
Query expansion	30-40s	200ms
Full re-embed (30k chunks)	~2 hours	~10 min
Tokenizer load	30s	0s
Search latency	30-60s	3-5s
Reranking (30 docs)	10-15s	1-2s

Features

• OpenAI Embeddings — text-embedding-3-small (1536 dims), native batch API, ~$0.02/1M tokens
• OpenAI Query Expansion — gpt-4o-mini for lex/vec/hyde variants
• OpenAI Reranking — API-based reranking replaces local qwen3-reranker, eliminating model download and GGUF inference overhead
• Tiktoken chunking — eliminates model load time for tokenization
• Robust retry logic — exponential backoff with jitter for rate limits
Usage

export OPENAI_API_KEY="sk-..." export QMD_OPENAI=1 qmd embed -f # Re-embed with OpenAI qmd search "query"

Design

• Opt-in — local models remain the default
• Graceful fallback — errors don't crash, just skip
• Replace local reranking with OpenAI — no GGUF model download or local inference needed
• No breaking changes — existing workflows unchanged
Files Changed

• src/openai-llm.ts — new OpenAI LLM implementation
• src/llm.ts — embedding config, provider switching
• src/store.ts — tiktoken chunking integration
• src/qmd.ts — QMD_OPENAI env var support
Dependencies

• openai — API client
• tiktoken — fast BPE tokenization

lyrl · 2026-02-08T04:38:26Z

Great. I was looking for this. But the rerank in there doesn't support api calls?

darkhanakh · 2026-02-11T12:41:01Z

This is great! :) Good job man 🔥

Port PR tobi#116 (tobi/qmd) to current main, adapting to the refactored codebase. Adds OpenAI as an alternative to local GGUF models, fixing the ARM64 segfault during hybrid search (issue tobi#68). Changes: - New src/openai-llm.ts: OpenAI API client (embed, embedBatch, rerank, expandQuery) with exponential backoff and rate limiting - llm.ts: setEmbeddingConfig(), getDefaultEmbeddingLLM(), isUsingOpenAI() - collections.ts: EmbeddingProviderConfig type, getEmbeddingConfig() - store.ts: Provider-aware embedding, chunking (tiktoken), expand, rerank - qmd.ts: Startup config loading, provider-aware embed command - package.json: openai + tiktoken dependencies Config via ~/.config/qmd/index.yml: embedding: provider: openai openai: model: text-embedding-3-small Or env: QMD_OPENAI=1 + OPENAI_API_KEY

vincentkoc · 2026-02-16T18:32:15Z

Love this!

alexleach · 2026-03-01T21:25:10Z

Can one change config.baseUrl easily? I would like to connect to my own hosted OpenAI-compatible server. It is actually local, but as qmd is running in a container, I need to host the models in Docker Model Runtime to gain GPU acceleration. That is OpenAI-compatible, and like other hosted implementations, it just needs a way to configure the baseUrl...

jonesj38 · 2026-03-28T08:10:12Z

Thanks for the patience on this. I've refreshed it:

Update (2026-03-28)
Rebased feat/openai-embeddings onto current main
Resolved conflicts and cleaned commit history
Force-pushed updated branch (--force-with-lease)
Verified local build passes (bun run build)
Current PR status is now mergeable.

I also got feedback from @alexleach running OpenAI-compatible remote endpoints in minimal Docker environments.

adds configurable OPENAI_BASE_URL, and
avoids initializing/building node-llama-cpp when OpenAI mode is selected.

Waiting for him to re-submit PRs

Cheers

jonesj38 · 2026-03-31T03:03:17Z

PR rebased to main

paralizeer · 2026-04-05T16:12:43Z

+1 for merging this. Just closed our PR #490 (remote Ollama embeddings) in favor of this one — it's the more complete solution.

We've been running QMD on an ARM64 VPS (Oracle Cloud Ampere, no Vulkan/GPU) and remote Ollama embeddings via HTTP is the only viable path there. This PR solves that cleanly while also adding query expansion and reranking through OpenAI-compatible endpoints.

Tested the remote embedding approach on our infra and it works great — the baseUrl override that @alexleach asked about would also cover self-hosted OpenAI-compatible servers (Ollama, vLLM, etc.).

@tobi this would unblock a lot of headless/ARM/Docker deployments. Ready to help test if needed.

viniciushsantana · 2026-04-05T16:27:59Z

Tested this fork with Ollama. The YAML config wiring needs two additions, plus generateEmbeddings needs to be patched to actually use the OpenAI provider. Here's what I did:

1- YAML config map forexpansion_model or base_url

collections.ts EmbeddingProviderConfig is missing both fields, and qmd.ts doesn't pass them to setEmbeddingConfig:

// collections.ts
  openai?: {
    api_key?: string;
    model?: string;
+   expansion_model?: string;
+   base_url?: string;
  };

// qmd.ts
  openai: {
    apiKey: embeddingYamlConfig.openai?.api_key,
    embedModel: embeddingYamlConfig.openai?.model,
+   expansionModel: embeddingYamlConfig.openai?.expansion_model,
+   baseURL: embeddingYamlConfig.openai?.base_url,
  },

2 - generateEmbeddings ignoring the OpenAI provider

getLlm() always returns LlamaCpp, and the embedding loop always wraps in withLLMSessionForLlm which expects LlamaCpp. When isUsingOpenAI() is true, it needs to bypass the session wrapper and call the OpenAI LLM directly.

My yaml config file:

collections:
# [...]
embedding:
  provider: openai
  openai:
    api_key: ollama
    base_url: "http://ollama:11434/v1"
    model: qwen3-embedding:8b
    expansion_model: qwen3.5:9b

alexleach · 2026-04-06T08:02:52Z

Just catching up on progress, as I received several notifications yesterday. Looks like this will need to be rebased again, since https://github.com/tobi/qmd/releases/tag/v2.1.0 was released 8 hours ago (~2026-04-06:T00:00.00Z).

One note of interest on the release, which may conflict:-

GPU: catch initialization failures and fall back to CPU instead of crashing.

@jonesj38 Thanks again for your work on this and rebasing it to main last week. I've one comment on this PR as is:

adds configurable OPENAI_BASE_URL

I feel this would be better if it was renamed to QMD_OPENAI_BASE_URL, as OPENAI_BASE_URL is used by the official openai-node SDK and may conflict. For QMD, I want to use my host's GPU (so I would set OPENAI_BASE_URL to http://model-runner.docker.internal:12434), but then I want openclaw to use Azure Foundry for gpt- models, for which I would set OPENAI_BASE_URL to my Azure Foundry's resource URL. Of course this can also be set in the yaml config file, and actually, having just checked "Files Changed", there is no reference to OPENAI_BASE_URL, so this would have to be set in the yaml config anyway:-

  constructor(config: OpenAIConfig = {}) {
    this.client = new OpenAI({ 
      apiKey: config.apiKey || process.env.OPENAI_API_KEY,
      baseURL: config.baseURL,
    });

Sorry for not re-submitting my PRs, my code base became a mess after trying to rebase and I ended up deleting a lot of branches...

@tobi please can you comment on one of the many open PRs and Issues? A lot of people are pretty desperate to use remote models and it can be a lot of work having to rebase to main each time there's a new release. If you have any feedback, we would love to hear it! 🙂

jonesj38 · 2026-04-06T19:21:19Z

Just catching up on progress, as I received several notifications yesterday. Looks like this will need to be rebased again, since https://github.com/tobi/qmd/releases/tag/v2.1.0 was released 8 hours ago (~2026-04-06:T00:00.00Z).

One note of interest on the release, which may conflict:-

GPU: catch initialization failures and fall back to CPU instead of crashing.

@jonesj38 Thanks again for your work on this and rebasing it to main last week. I've one comment on this PR as is:

adds configurable OPENAI_BASE_URL

I feel this would be better if it was renamed to QMD_OPENAI_BASE_URL, as OPENAI_BASE_URL is used by the official openai-node SDK and may conflict. For QMD, I want to use my host's GPU (so I would set OPENAI_BASE_URL to http://model-runner.docker.internal:12434), but then I want openclaw to use Azure Foundry for gpt- models, for which I would set OPENAI_BASE_URL to my Azure Foundry's resource URL. Of course this can also be set in the yaml config file, and actually, having just checked "Files Changed", there is no reference to OPENAI_BASE_URL, so this would have to be set in the yaml config anyway:-
  constructor(config: OpenAIConfig = {}) {
    this.client = new OpenAI({ 
      apiKey: config.apiKey || process.env.OPENAI_API_KEY,
      baseURL: config.baseURL,
    });
Sorry for not re-submitting my PRs, my code base became a mess after trying to rebase and I ended up deleting a lot of branches...

@tobi please can you comment on one of the many open PRs and Issues? A lot of people are pretty desperate to use remote models and it can be a lot of work having to rebase to main each time there's a new release. If you have any feedback, we would love to hear it! 🙂

Thanks Alex, agree with renaming OPENAI_API_KEY to QMD_OPENAI_API_KEY . I'll get this branch rebased and have a look at your pull request today/tomorrow

Cheers

@alexleach

…ename, embed fix Changes based on PR comments: 1. Configurable base_url for OpenAI-compatible APIs (Ollama, vLLM, Azure) - collections.ts: EmbeddingProviderConfig already has base_url field - qmd.ts: now passes base_url and expansion_model from YAML to setEmbeddingConfig - openai-llm.ts: constructor accepts baseURL config 2. Env var rename: QMD_OPENAI_API_KEY takes priority over OPENAI_API_KEY - Avoids conflict with official openai-node SDK (per @alexleach) - Falls back to OPENAI_API_KEY for backwards compatibility 3. generateEmbeddings bypasses LlamaCpp when using OpenAI (per @viniciushsantana) - OpenAI path calls API directly, no local model session needed - Refactored to shared runEmbedding() with pluggable embed/embedBatch fns 4. expandQuery now actually calls OpenAI for query expansion - Was previously returning lex-only fallback when isUsingOpenAI() - Now uses gpt-4o-mini via openaiLLM.expandQuery() 5. README updated with base_url, expansion_model docs Addresses: @alexleach (env naming, base_url), @viniciushsantana (embed fix, expansion_model, base_url YAML wiring)

@alexleach

…ename, embed fix Changes based on PR comments: 1. Configurable base_url for OpenAI-compatible APIs (Ollama, vLLM, Azure) - collections.ts: EmbeddingProviderConfig already has base_url field - qmd.ts: now passes base_url and expansion_model from YAML to setEmbeddingConfig - openai-llm.ts: constructor accepts baseURL config 2. Env var rename: QMD_OPENAI_API_KEY takes priority over OPENAI_API_KEY - Avoids conflict with official openai-node SDK (per @alexleach) - Falls back to OPENAI_API_KEY for backwards compatibility 3. generateEmbeddings bypasses LlamaCpp when using OpenAI (per @viniciushsantana) - OpenAI path calls API directly, no local model session needed - Refactored to shared runEmbedding() with pluggable embed/embedBatch fns 4. expandQuery now actually calls OpenAI for query expansion - Was previously returning lex-only fallback when isUsingOpenAI() - Now uses gpt-4o-mini via openaiLLM.expandQuery() 5. README updated with base_url, expansion_model docs Addresses: @alexleach (env naming, base_url), @viniciushsantana (embed fix, expansion_model, base_url YAML wiring)

jonesj38 · 2026-04-09T17:58:10Z

@tobi Rebased on to main

Cheers

Kaspre · 2026-04-10T04:17:52Z

Use case: split embedding + chat backends (local GPU + cloud)

We're running QMD 2.1.0 on WSL2 (Ubuntu 24.04) with no GPU access in WSL — the GPU (AMD RX 580) is only reachable via a llama-server instance on the Windows host serving nomic-embed-text over HTTP (/v1/embeddings). For query expansion and reranking, we use Ollama Cloud (gemma3:4b via https://ollama.com/v1/chat/completions).

This PR got us from "completely broken hybrid search" (node-llama-cpp triggering a 2+ minute cmake build on every query) to 4s end-to-end hybrid queries — expansion via cloud, embeddings via local GPU, reranking via cloud.

One small patch we needed on top of this PR: split base URLs.

The current base_url config is shared between embeddings and chat (expansion/reranking). When your embedding server and chat server are at different URLs (common with llama-server which only serves one model per instance), you need separate endpoints.

We added:

chat_base_url / QMD_OPENAI_CHAT_BASE_URL — separate base URL for chat completions (falls back to base_url)
chat_api_key / QMD_OPENAI_CHAT_API_KEY — separate API key for the chat endpoint (falls back to api_key)
A second OpenAI client instance in OpenAIEmbedding that routes expandQuery() and rerank() to the chat URL while embed() / embedBatch() use the embedding URL

Config example:

embedding:
  provider: openai
  openai:
    api_key: "dummy"
    base_url: "http://172.30.192.1:8081/v1"       # llama-server on Windows (nomic-embed-text, GPU)
    model: "nomic-embed-text"
    chat_base_url: "https://ollama.com/v1"          # Ollama Cloud for expansion/reranking
    expansion_model: "gemma3:4b"

The patch is ~20 lines across openai-llm.ts, collections.ts, and qmd.ts. Happy to open a PR against this branch if useful.

Also: we had to disable node-llama-cpp entirely (mv node_modules/node-llama-cpp ...disabled) to prevent it from triggering cmake compilation at import time, even when OpenAI mode is active. The lazy await import("node-llama-cpp") in llm.ts appears to fire during the embed command path despite isUsingOpenAI() returning true. Might be worth guarding that import with an explicit OpenAI check, or making the embed CLI command skip LLM session initialization entirely when in OpenAI mode.

Adds support for using OpenAI's text-embedding-3-small model as an alternative to local llama-cpp embeddings. Changes: - New openai-llm.ts: OpenAI API client implementing LLM interface - llm.ts: Embedding config management, getDefaultEmbeddingLLM() - collections.ts: EmbeddingProviderConfig for YAML config schema - store.ts: Use configurable embedding LLM, skip local model for query expansion/rerank when using OpenAI - qmd.ts: Load embedding config on startup - package.json: Add openai dependency - README.md: Documentation for OpenAI embeddings Configuration (in ~/.config/qmd/index.yml): embedding: provider: openai openai: api_key: sk-... # Optional, falls back to OPENAI_API_KEY env model: text-embedding-3-small # Optional, this is the default Benefits: - Much faster embedding (~10x vs local models on CPU) - No GPU/VRAM requirements - More reliable (no local model loading issues) - Cost: ~$0.02 per 1M tokens

- OpenAI embeddings (text-embedding-3-small, 1536d) via QMD_OPENAI=1 - Query expansion with gpt-4o-mini (~200ms vs 30s local) - Tiktoken for fast tokenization (no model loading) - Exponential backoff with jitter for rate limits (429) - Inter-batch delay (150ms) to avoid hitting RPM limits - Performance: search 3-5s (was 30-60s), embed ~10min (was 2hrs) Files: openai-llm.ts, llm.ts, store.ts, qmd.ts Deps: openai, tiktoken

Replace the rerank() stub with a real listwise reranker using gpt-4o-mini. - Sends top candidates with query to gpt-4o-mini as a ranking task - Parses comma-separated index output, handles missing/duplicate indices - Skips API call for ≤2 documents (not worth the latency) - Falls back to original order on API failure - Cost: ~$0.001 per rerank call - Updated qmd.ts to route through OpenAI reranker instead of skipping The full qmd query pipeline with OpenAI now: 1. Query expansion (gpt-4o-mini) 2. BM25 + vector search (parallel) 3. RRF fusion 4. Cross-encoder reranking (gpt-4o-mini) ← NEW 5. Position-aware blending

Accept comma-separated collection names in -c flag for cross-collection search. All three search modes (search, vsearch, query) now support querying multiple collections simultaneously. Changes: - resolveCollectionFilter() helper parses and validates comma-separated names - searchFTS() accepts string | string[] for collection filtering - searchVec() accepts string | string[] for collection filtering - SQL uses IN clause for multi-collection filtering - Updated interface types and test for new parameter types Usage: qmd search 'auth' -c repo-a,repo-b qmd vsearch 'auth patterns' -c docs,examples qmd query 'OAuth implementation' -c project,patterns,docs This enables Shad's multi-vault search to pass all vault collections in a single qmd call instead of running separate searches per collection.

Add support for separate OpenAI-compatible servers for embeddings vs chat (expansion/reranking). Common in setups where local GPU serves embeddings and cloud handles chat. Implements Kaspre's split-URL pattern from PR tobi#116 discussion. - Add chat_base_url and chat_api_key to YAML config and OpenAIConfig - Add QMD_OPENAI_* env var prefix (QMD_OPENAI_BASE_URL, QMD_OPENAI_API_KEY, QMD_OPENAI_CHAT_BASE_URL, QMD_OPENAI_CHAT_API_KEY) per alexleach's suggestion - Wire expansion_model and base_url through YAML config per viniciushsantana's feedback - Route expandQuery() and rerank() through chatClient, embed()/embedBatch() through embedding client - Fix upstream rebase issues (Database.transaction type, collectionName rename) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@Kaspre

…ifferent models can be used to each. Thanks to @Kaspre for their comment embedding: provider: openai openai: api_key: "sk-..." base_url: "http://localhost:8081/v1" # embeddings model: "nomic-embed-text" chat_base_url: "https://ollama.com/v1" # expansion (falls back to base_url) chat_api_key: "..." # (falls back to api_key) expansion_model: "gemma3:4b" rerank_base_url: "https://api.cohere.com/v1" # reranking (falls back to chat_base_url) rerank_api_key: "..." # (falls back to chat_api_key) rerank_model: "rerank-v3" # (falls back to expansion_model) also rebased onto main

@alexleach

…ename, embed fix Changes based on PR comments: 1. Configurable base_url for OpenAI-compatible APIs (Ollama, vLLM, Azure) - collections.ts: EmbeddingProviderConfig already has base_url field - qmd.ts: now passes base_url and expansion_model from YAML to setEmbeddingConfig - openai-llm.ts: constructor accepts baseURL config 2. Env var rename: QMD_OPENAI_API_KEY takes priority over OPENAI_API_KEY - Avoids conflict with official openai-node SDK (per @alexleach) - Falls back to OPENAI_API_KEY for backwards compatibility 3. generateEmbeddings bypasses LlamaCpp when using OpenAI (per @viniciushsantana) - OpenAI path calls API directly, no local model session needed - Refactored to shared runEmbedding() with pluggable embed/embedBatch fns 4. expandQuery now actually calls OpenAI for query expansion - Was previously returning lex-only fallback when isUsingOpenAI() - Now uses gpt-4o-mini via openaiLLM.expandQuery() 5. README updated with base_url, expansion_model docs Addresses: @alexleach (env naming, base_url), @viniciushsantana (embed fix, expansion_model, base_url YAML wiring)

…ments - Lazy-load node-llama-cpp to skip native compilation in OpenAI mode - Add tiktoken-based input truncation (QMD_OPENAI_MAX_INPUT_TOKENS) - QMD_OPENAI_BASE_URL auto-activates OpenAI mode (no QMD_OPENAI=1 needed) - Skip LlamaCpp init in qmd status when using OpenAI - Restore terminal cursor on embed error (try/finally) - Bypass withLLMSession in vectorSearch/querySearch for OpenAI mode Co-authored-by: ALB.Leach <alexleach@users.noreply.github.com>

This was referenced Feb 17, 2026

feat(llm): add API backend support with backend-aware session routing #204

Closed

[FEATURE]: models via API #114

Open

jonesj38 force-pushed the feat/openai-embeddings branch from 4869849 to a4f8415 Compare February 17, 2026 13:38

alexleach mentioned this pull request Mar 9, 2026

Add optional OpenRouter provider for embeddings/search pipeline #133

Open

tobi force-pushed the main branch from d2a6c42 to ed0249f Compare March 10, 2026 16:59

alexleach mentioned this pull request Mar 28, 2026

Add support for remote OpenAI-compatible embeddings #480

Closed

jonesj38 force-pushed the feat/openai-embeddings branch from 7a718f6 to fc2f137 Compare March 28, 2026 07:44

alexleach mentioned this pull request Apr 2, 2026

feat: support remote Ollama embeddings via OLLAMA_EMBED_URL #490

Closed

paralizeer mentioned this pull request Apr 5, 2026

Feature request: Support remote Ollama embeddings via HTTP (OLLAMA_EMBED_URL) #489

Open

This was referenced Apr 5, 2026

feat: remote model server (qmd serve) for shared inference across clients #509

Closed

feat: remote model server (qmd serve) for shared inference across clients #511

Open

jonesj38 force-pushed the feat/openai-embeddings branch from 36754da to 3ad7006 Compare April 7, 2026 17:35

jonesj38 force-pushed the feat/openai-embeddings branch from be46ac9 to a1835e0 Compare April 9, 2026 17:54

jonesj38 added 2 commits April 11, 2026 19:23

jonesj38 and others added 5 commits April 11, 2026 19:24

fix: use default embedding LLM for hybrid vector queries

ce3b061

jonesj38 force-pushed the feat/openai-embeddings branch from a1835e0 to fc30ecd Compare April 12, 2026 03:11

jonesj38 and others added 2 commits April 11, 2026 21:49

jonesj38 marked this pull request as draft April 12, 2026 05:00

jonesj38 marked this pull request as ready for review April 15, 2026 22:00

jonesj38 mentioned this pull request Apr 15, 2026

feat(llm): add remote embedding/reranking via OpenAI-compatible endpoints #575

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add OpenAI embedding and query expansion support#116

feat: Add OpenAI embedding and query expansion support#116
jonesj38 wants to merge 9 commits intotobi:mainfrom
jonesj38:feat/openai-embeddings

jonesj38 commented Feb 5, 2026 •

edited

Loading

Uh oh!

lyrl commented Feb 8, 2026

Uh oh!

darkhanakh commented Feb 11, 2026

Uh oh!

vincentkoc commented Feb 16, 2026

Uh oh!

alexleach commented Mar 1, 2026

Uh oh!

jonesj38 commented Mar 28, 2026

Uh oh!

jonesj38 commented Mar 31, 2026

Uh oh!

paralizeer commented Apr 5, 2026

Uh oh!

viniciushsantana commented Apr 5, 2026

Uh oh!

alexleach commented Apr 6, 2026

Uh oh!

jonesj38 commented Apr 6, 2026

Uh oh!

jonesj38 commented Apr 9, 2026

Uh oh!

Kaspre commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

jonesj38 commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lyrl commented Feb 8, 2026

Uh oh!

darkhanakh commented Feb 11, 2026

Uh oh!

vincentkoc commented Feb 16, 2026

Uh oh!

alexleach commented Mar 1, 2026

Uh oh!

jonesj38 commented Mar 28, 2026

Uh oh!

jonesj38 commented Mar 31, 2026

Uh oh!

paralizeer commented Apr 5, 2026

Uh oh!

viniciushsantana commented Apr 5, 2026

Uh oh!

alexleach commented Apr 6, 2026

Uh oh!

jonesj38 commented Apr 6, 2026

Uh oh!

jonesj38 commented Apr 9, 2026

Uh oh!

Kaspre commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

jonesj38 commented Feb 5, 2026 •

edited

Loading