Problem
QMD 2.0 uses node-llama-cpp for all embedding and tokenization operations. This requires local compilation via CMake on every run, which fails on platforms without GPU/Vulkan support (e.g., ARM64 VPS, Docker containers, CI runners) and is unusable when the Ollama instance runs on a separate machine (common with Tailscale, Docker networks, or dedicated GPU boxes).
The OLLAMA_EMBED_URL env var exists but only applies to qmd vsearch partially — expandQuery(), chunkDocumentByTokens(), generateEmbeddings(), and the vsearch CLI command all still require node-llama-cpp and trigger CMake compilation.
Use Case
Running QMD on an ARM64 Oracle Cloud VPS with Ollama on a separate machine (connected via Tailscale). The Ollama instance serves qwen3-embedding:0.6b (1024 dims, MTEB #1 multilingual). There is no local GPU and no way to compile node-llama-cpp cleanly.
This is a common setup for anyone using:
- Remote Ollama (Docker, Tailscale, LAN)
- ARM64 servers (Oracle, Ampere, Graviton)
- Headless VPS without GPU drivers
Proposed Solution
When OLLAMA_EMBED_URL is set, bypass all node-llama-cpp / getDefaultLlamaCpp() calls:
1. ollamaEmbed() helper function
async function ollamaEmbed(text: string): Promise<EmbeddingResult> {
const url = process.env.OLLAMA_EMBED_URL;
const model = process.env.OLLAMA_EMBED_MODEL || "nomic-embed-text";
const res = await fetch(`${url}/api/embed`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ model, input: text }),
});
const data = await res.json();
return { embedding: data.embeddings[0], model };
}
2. Patch points (6 locations)
| Function |
File |
What to bypass |
getEmbedding() |
store.ts |
getDefaultLlamaCpp() → ollamaEmbed() |
generateEmbeddings() |
store.ts |
withLLMSessionForLlm → direct Ollama HTTP |
expandQuery() |
store.ts |
LLM query expansion → pass-through [{type:"vec", query}] |
chunkDocumentByTokens() |
store.ts |
llm.tokenize() → char-based estimation (text.length / 3) |
embedBatch (2 sites) |
store.ts |
llm.embedBatch() → ollamaEmbedBatch() |
vectorSearch CLI |
cli/qmd.ts |
withLLMSession() → direct call |
3. Environment variables
export OLLAMA_EMBED_URL=http://your-ollama:11434
export OLLAMA_EMBED_MODEL=qwen3-embedding:0.6b # optional, defaults to nomic-embed-text
Results (tested on ARM64 Oracle VPS)
- Before: Every
vsearch/embed/query triggers CMake compilation (fails on ARM64 without Vulkan)
- After: Zero compilation, instant results via HTTP to remote Ollama
qmd embed --force successfully re-indexes 7,100+ documents
qmd vsearch returns results in <2s vs hanging on CMake indefinitely
Notes
qmd search (BM25) is unaffected — works perfectly without any of this
expandQuery uses LLM for HYDE-style query expansion. When using Ollama path, we skip this and pass the raw query as a vector search. A future improvement could call Ollama /api/generate for query expansion.
- Char-based chunking (
text.length / avgCharsPerToken) is a reasonable approximation that avoids requiring a local tokenizer
- The
OLLAMA_EMBED_MODEL env var allows users to pick any embedding model available on their Ollama instance
Happy to submit a PR if there is interest.
Problem
QMD 2.0 uses
node-llama-cppfor all embedding and tokenization operations. This requires local compilation via CMake on every run, which fails on platforms without GPU/Vulkan support (e.g., ARM64 VPS, Docker containers, CI runners) and is unusable when the Ollama instance runs on a separate machine (common with Tailscale, Docker networks, or dedicated GPU boxes).The
OLLAMA_EMBED_URLenv var exists but only applies toqmd vsearchpartially —expandQuery(),chunkDocumentByTokens(),generateEmbeddings(), and thevsearchCLI command all still requirenode-llama-cppand trigger CMake compilation.Use Case
Running QMD on an ARM64 Oracle Cloud VPS with Ollama on a separate machine (connected via Tailscale). The Ollama instance serves
qwen3-embedding:0.6b(1024 dims, MTEB #1 multilingual). There is no local GPU and no way to compilenode-llama-cppcleanly.This is a common setup for anyone using:
Proposed Solution
When
OLLAMA_EMBED_URLis set, bypass allnode-llama-cpp/getDefaultLlamaCpp()calls:1.
ollamaEmbed()helper function2. Patch points (6 locations)
getEmbedding()store.tsgetDefaultLlamaCpp()→ollamaEmbed()generateEmbeddings()store.tswithLLMSessionForLlm→ direct Ollama HTTPexpandQuery()store.ts[{type:"vec", query}]chunkDocumentByTokens()store.tsllm.tokenize()→ char-based estimation (text.length / 3)embedBatch(2 sites)store.tsllm.embedBatch()→ollamaEmbedBatch()vectorSearchCLIcli/qmd.tswithLLMSession()→ direct call3. Environment variables
Results (tested on ARM64 Oracle VPS)
vsearch/embed/querytriggers CMake compilation (fails on ARM64 without Vulkan)qmd embed --forcesuccessfully re-indexes 7,100+ documentsqmd vsearchreturns results in <2s vs hanging on CMake indefinitelyNotes
qmd search(BM25) is unaffected — works perfectly without any of thisexpandQueryuses LLM for HYDE-style query expansion. When using Ollama path, we skip this and pass the raw query as a vector search. A future improvement could call Ollama/api/generatefor query expansion.text.length / avgCharsPerToken) is a reasonable approximation that avoids requiring a local tokenizerOLLAMA_EMBED_MODELenv var allows users to pick any embedding model available on their Ollama instanceHappy to submit a PR if there is interest.