feat: Add OpenAI embedding and query expansion support#116
feat: Add OpenAI embedding and query expansion support#116
Conversation
|
Great. I was looking for this. But the rerank in there doesn't support api calls? |
|
This is great! :) Good job man 🔥 |
Port PR tobi#116 (tobi/qmd) to current main, adapting to the refactored codebase. Adds OpenAI as an alternative to local GGUF models, fixing the ARM64 segfault during hybrid search (issue tobi#68). Changes: - New src/openai-llm.ts: OpenAI API client (embed, embedBatch, rerank, expandQuery) with exponential backoff and rate limiting - llm.ts: setEmbeddingConfig(), getDefaultEmbeddingLLM(), isUsingOpenAI() - collections.ts: EmbeddingProviderConfig type, getEmbeddingConfig() - store.ts: Provider-aware embedding, chunking (tiktoken), expand, rerank - qmd.ts: Startup config loading, provider-aware embed command - package.json: openai + tiktoken dependencies Config via ~/.config/qmd/index.yml: embedding: provider: openai openai: model: text-embedding-3-small Or env: QMD_OPENAI=1 + OPENAI_API_KEY
|
Love this! |
4869849 to
a4f8415
Compare
|
Can one change |
7a718f6 to
fc2f137
Compare
|
Thanks for the patience on this. I've refreshed it: Update (2026-03-28) I also got feedback from @alexleach running OpenAI-compatible remote endpoints in minimal Docker environments. adds configurable OPENAI_BASE_URL, and Waiting for him to re-submit PRs Cheers |
|
PR rebased to main |
|
+1 for merging this. Just closed our PR #490 (remote Ollama embeddings) in favor of this one — it's the more complete solution. We've been running QMD on an ARM64 VPS (Oracle Cloud Ampere, no Vulkan/GPU) and remote Ollama embeddings via HTTP is the only viable path there. This PR solves that cleanly while also adding query expansion and reranking through OpenAI-compatible endpoints. Tested the remote embedding approach on our infra and it works great — the @tobi this would unblock a lot of headless/ARM/Docker deployments. Ready to help test if needed. |
|
Tested this fork with Ollama. The YAML config wiring needs two additions, plus 1- YAML config map for
// collections.ts
openai?: {
api_key?: string;
model?: string;
+ expansion_model?: string;
+ base_url?: string;
};
// qmd.ts
openai: {
apiKey: embeddingYamlConfig.openai?.api_key,
embedModel: embeddingYamlConfig.openai?.model,
+ expansionModel: embeddingYamlConfig.openai?.expansion_model,
+ baseURL: embeddingYamlConfig.openai?.base_url,
},2 -
My yaml config file: collections:
# [...]
embedding:
provider: openai
openai:
api_key: ollama
base_url: "http://ollama:11434/v1"
model: qwen3-embedding:8b
expansion_model: qwen3.5:9b |
|
Just catching up on progress, as I received several notifications yesterday. Looks like this will need to be rebased again, since https://github.com/tobi/qmd/releases/tag/v2.1.0 was released 8 hours ago (~2026-04-06:T00:00.00Z). One note of interest on the release, which may conflict:-
@jonesj38 Thanks again for your work on this and rebasing it to main last week. I've one comment on this PR as is:
I feel this would be better if it was renamed to constructor(config: OpenAIConfig = {}) {
this.client = new OpenAI({
apiKey: config.apiKey || process.env.OPENAI_API_KEY,
baseURL: config.baseURL,
});Sorry for not re-submitting my PRs, my code base became a mess after trying to rebase and I ended up deleting a lot of branches... @tobi please can you comment on one of the many open PRs and Issues? A lot of people are pretty desperate to use remote models and it can be a lot of work having to rebase to main each time there's a new release. If you have any feedback, we would love to hear it! 🙂 |
Thanks Alex, agree with renaming OPENAI_API_KEY to QMD_OPENAI_API_KEY . I'll get this branch rebased and have a look at your pull request today/tomorrow Cheers |
36754da to
3ad7006
Compare
…ename, embed fix Changes based on PR comments: 1. Configurable base_url for OpenAI-compatible APIs (Ollama, vLLM, Azure) - collections.ts: EmbeddingProviderConfig already has base_url field - qmd.ts: now passes base_url and expansion_model from YAML to setEmbeddingConfig - openai-llm.ts: constructor accepts baseURL config 2. Env var rename: QMD_OPENAI_API_KEY takes priority over OPENAI_API_KEY - Avoids conflict with official openai-node SDK (per @alexleach) - Falls back to OPENAI_API_KEY for backwards compatibility 3. generateEmbeddings bypasses LlamaCpp when using OpenAI (per @viniciushsantana) - OpenAI path calls API directly, no local model session needed - Refactored to shared runEmbedding() with pluggable embed/embedBatch fns 4. expandQuery now actually calls OpenAI for query expansion - Was previously returning lex-only fallback when isUsingOpenAI() - Now uses gpt-4o-mini via openaiLLM.expandQuery() 5. README updated with base_url, expansion_model docs Addresses: @alexleach (env naming, base_url), @viniciushsantana (embed fix, expansion_model, base_url YAML wiring)
…ename, embed fix Changes based on PR comments: 1. Configurable base_url for OpenAI-compatible APIs (Ollama, vLLM, Azure) - collections.ts: EmbeddingProviderConfig already has base_url field - qmd.ts: now passes base_url and expansion_model from YAML to setEmbeddingConfig - openai-llm.ts: constructor accepts baseURL config 2. Env var rename: QMD_OPENAI_API_KEY takes priority over OPENAI_API_KEY - Avoids conflict with official openai-node SDK (per @alexleach) - Falls back to OPENAI_API_KEY for backwards compatibility 3. generateEmbeddings bypasses LlamaCpp when using OpenAI (per @viniciushsantana) - OpenAI path calls API directly, no local model session needed - Refactored to shared runEmbedding() with pluggable embed/embedBatch fns 4. expandQuery now actually calls OpenAI for query expansion - Was previously returning lex-only fallback when isUsingOpenAI() - Now uses gpt-4o-mini via openaiLLM.expandQuery() 5. README updated with base_url, expansion_model docs Addresses: @alexleach (env naming, base_url), @viniciushsantana (embed fix, expansion_model, base_url YAML wiring)
be46ac9 to
a1835e0
Compare
|
@tobi Rebased on to main Cheers |
|
Use case: split embedding + chat backends (local GPU + cloud) We're running QMD 2.1.0 on WSL2 (Ubuntu 24.04) with no GPU access in WSL — the GPU (AMD RX 580) is only reachable via a This PR got us from "completely broken hybrid search" (node-llama-cpp triggering a 2+ minute cmake build on every query) to 4s end-to-end hybrid queries — expansion via cloud, embeddings via local GPU, reranking via cloud. One small patch we needed on top of this PR: split base URLs. The current We added:
Config example: embedding:
provider: openai
openai:
api_key: "dummy"
base_url: "http://172.30.192.1:8081/v1" # llama-server on Windows (nomic-embed-text, GPU)
model: "nomic-embed-text"
chat_base_url: "https://ollama.com/v1" # Ollama Cloud for expansion/reranking
expansion_model: "gemma3:4b"The patch is ~20 lines across Also: we had to disable |
Adds support for using OpenAI's text-embedding-3-small model as an
alternative to local llama-cpp embeddings.
Changes:
- New openai-llm.ts: OpenAI API client implementing LLM interface
- llm.ts: Embedding config management, getDefaultEmbeddingLLM()
- collections.ts: EmbeddingProviderConfig for YAML config schema
- store.ts: Use configurable embedding LLM, skip local model for
query expansion/rerank when using OpenAI
- qmd.ts: Load embedding config on startup
- package.json: Add openai dependency
- README.md: Documentation for OpenAI embeddings
Configuration (in ~/.config/qmd/index.yml):
embedding:
provider: openai
openai:
api_key: sk-... # Optional, falls back to OPENAI_API_KEY env
model: text-embedding-3-small # Optional, this is the default
Benefits:
- Much faster embedding (~10x vs local models on CPU)
- No GPU/VRAM requirements
- More reliable (no local model loading issues)
- Cost: ~$0.02 per 1M tokens
- OpenAI embeddings (text-embedding-3-small, 1536d) via QMD_OPENAI=1 - Query expansion with gpt-4o-mini (~200ms vs 30s local) - Tiktoken for fast tokenization (no model loading) - Exponential backoff with jitter for rate limits (429) - Inter-batch delay (150ms) to avoid hitting RPM limits - Performance: search 3-5s (was 30-60s), embed ~10min (was 2hrs) Files: openai-llm.ts, llm.ts, store.ts, qmd.ts Deps: openai, tiktoken
Replace the rerank() stub with a real listwise reranker using gpt-4o-mini. - Sends top candidates with query to gpt-4o-mini as a ranking task - Parses comma-separated index output, handles missing/duplicate indices - Skips API call for ≤2 documents (not worth the latency) - Falls back to original order on API failure - Cost: ~$0.001 per rerank call - Updated qmd.ts to route through OpenAI reranker instead of skipping The full qmd query pipeline with OpenAI now: 1. Query expansion (gpt-4o-mini) 2. BM25 + vector search (parallel) 3. RRF fusion 4. Cross-encoder reranking (gpt-4o-mini) ← NEW 5. Position-aware blending
Accept comma-separated collection names in -c flag for cross-collection search. All three search modes (search, vsearch, query) now support querying multiple collections simultaneously. Changes: - resolveCollectionFilter() helper parses and validates comma-separated names - searchFTS() accepts string | string[] for collection filtering - searchVec() accepts string | string[] for collection filtering - SQL uses IN clause for multi-collection filtering - Updated interface types and test for new parameter types Usage: qmd search 'auth' -c repo-a,repo-b qmd vsearch 'auth patterns' -c docs,examples qmd query 'OAuth implementation' -c project,patterns,docs This enables Shad's multi-vault search to pass all vault collections in a single qmd call instead of running separate searches per collection.
Add support for separate OpenAI-compatible servers for embeddings vs chat (expansion/reranking). Common in setups where local GPU serves embeddings and cloud handles chat. Implements Kaspre's split-URL pattern from PR tobi#116 discussion. - Add chat_base_url and chat_api_key to YAML config and OpenAIConfig - Add QMD_OPENAI_* env var prefix (QMD_OPENAI_BASE_URL, QMD_OPENAI_API_KEY, QMD_OPENAI_CHAT_BASE_URL, QMD_OPENAI_CHAT_API_KEY) per alexleach's suggestion - Wire expansion_model and base_url through YAML config per viniciushsantana's feedback - Route expandQuery() and rerank() through chatClient, embed()/embedBatch() through embedding client - Fix upstream rebase issues (Database.transaction type, collectionName rename) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ifferent models can be used to each. Thanks to @Kaspre for their comment embedding: provider: openai openai: api_key: "sk-..." base_url: "http://localhost:8081/v1" # embeddings model: "nomic-embed-text" chat_base_url: "https://ollama.com/v1" # expansion (falls back to base_url) chat_api_key: "..." # (falls back to api_key) expansion_model: "gemma3:4b" rerank_base_url: "https://api.cohere.com/v1" # reranking (falls back to chat_base_url) rerank_api_key: "..." # (falls back to chat_api_key) rerank_model: "rerank-v3" # (falls back to expansion_model) also rebased onto main
a1835e0 to
fc30ecd
Compare
…ename, embed fix Changes based on PR comments: 1. Configurable base_url for OpenAI-compatible APIs (Ollama, vLLM, Azure) - collections.ts: EmbeddingProviderConfig already has base_url field - qmd.ts: now passes base_url and expansion_model from YAML to setEmbeddingConfig - openai-llm.ts: constructor accepts baseURL config 2. Env var rename: QMD_OPENAI_API_KEY takes priority over OPENAI_API_KEY - Avoids conflict with official openai-node SDK (per @alexleach) - Falls back to OPENAI_API_KEY for backwards compatibility 3. generateEmbeddings bypasses LlamaCpp when using OpenAI (per @viniciushsantana) - OpenAI path calls API directly, no local model session needed - Refactored to shared runEmbedding() with pluggable embed/embedBatch fns 4. expandQuery now actually calls OpenAI for query expansion - Was previously returning lex-only fallback when isUsingOpenAI() - Now uses gpt-4o-mini via openaiLLM.expandQuery() 5. README updated with base_url, expansion_model docs Addresses: @alexleach (env naming, base_url), @viniciushsantana (embed fix, expansion_model, base_url YAML wiring)
…ments - Lazy-load node-llama-cpp to skip native compilation in OpenAI mode - Add tiktoken-based input truncation (QMD_OPENAI_MAX_INPUT_TOKENS) - QMD_OPENAI_BASE_URL auto-activates OpenAI mode (no QMD_OPENAI=1 needed) - Skip LlamaCpp init in qmd status when using OpenAI - Restore terminal cursor on embed error (try/finally) - Bypass withLLMSession in vectorSearch/querySearch for OpenAI mode Co-authored-by: ALB.Leach <alexleach@users.noreply.github.com>
Summary
Optional OpenAI integration for embeddings and query expansion. Dramatically faster for users who prefer API-based inference over local models.
Performance
Features
• OpenAI Embeddings — text-embedding-3-small (1536 dims), native batch API, ~$0.02/1M tokens
• OpenAI Query Expansion — gpt-4o-mini for lex/vec/hyde variants
• OpenAI Reranking — API-based reranking replaces local qwen3-reranker, eliminating model download and GGUF inference overhead
• Tiktoken chunking — eliminates model load time for tokenization
• Robust retry logic — exponential backoff with jitter for rate limits
Usage
export OPENAI_API_KEY="sk-..." export QMD_OPENAI=1 qmd embed -f # Re-embed with OpenAI qmd search "query"Design
• Opt-in — local models remain the default
• Graceful fallback — errors don't crash, just skip
• Replace local reranking with OpenAI — no GGUF model download or local inference needed
• No breaking changes — existing workflows unchanged
Files Changed
• src/openai-llm.ts — new OpenAI LLM implementation
• src/llm.ts — embedding config, provider switching
• src/store.ts — tiktoken chunking integration
• src/qmd.ts — QMD_OPENAI env var support
Dependencies
• openai — API client
• tiktoken — fast BPE tokenization