Skip to content

Feature request: Support remote Ollama embeddings via HTTP (OLLAMA_EMBED_URL) #489

@paralizeer

Description

@paralizeer

Problem

QMD 2.0 uses node-llama-cpp for all embedding and tokenization operations. This requires local compilation via CMake on every run, which fails on platforms without GPU/Vulkan support (e.g., ARM64 VPS, Docker containers, CI runners) and is unusable when the Ollama instance runs on a separate machine (common with Tailscale, Docker networks, or dedicated GPU boxes).

The OLLAMA_EMBED_URL env var exists but only applies to qmd vsearch partially — expandQuery(), chunkDocumentByTokens(), generateEmbeddings(), and the vsearch CLI command all still require node-llama-cpp and trigger CMake compilation.

Use Case

Running QMD on an ARM64 Oracle Cloud VPS with Ollama on a separate machine (connected via Tailscale). The Ollama instance serves qwen3-embedding:0.6b (1024 dims, MTEB #1 multilingual). There is no local GPU and no way to compile node-llama-cpp cleanly.

This is a common setup for anyone using:

  • Remote Ollama (Docker, Tailscale, LAN)
  • ARM64 servers (Oracle, Ampere, Graviton)
  • Headless VPS without GPU drivers

Proposed Solution

When OLLAMA_EMBED_URL is set, bypass all node-llama-cpp / getDefaultLlamaCpp() calls:

1. ollamaEmbed() helper function

async function ollamaEmbed(text: string): Promise<EmbeddingResult> {
  const url = process.env.OLLAMA_EMBED_URL;
  const model = process.env.OLLAMA_EMBED_MODEL || "nomic-embed-text";
  const res = await fetch(`${url}/api/embed`, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ model, input: text }),
  });
  const data = await res.json();
  return { embedding: data.embeddings[0], model };
}

2. Patch points (6 locations)

Function File What to bypass
getEmbedding() store.ts getDefaultLlamaCpp()ollamaEmbed()
generateEmbeddings() store.ts withLLMSessionForLlm → direct Ollama HTTP
expandQuery() store.ts LLM query expansion → pass-through [{type:"vec", query}]
chunkDocumentByTokens() store.ts llm.tokenize() → char-based estimation (text.length / 3)
embedBatch (2 sites) store.ts llm.embedBatch()ollamaEmbedBatch()
vectorSearch CLI cli/qmd.ts withLLMSession() → direct call

3. Environment variables

export OLLAMA_EMBED_URL=http://your-ollama:11434
export OLLAMA_EMBED_MODEL=qwen3-embedding:0.6b  # optional, defaults to nomic-embed-text

Results (tested on ARM64 Oracle VPS)

  • Before: Every vsearch/embed/query triggers CMake compilation (fails on ARM64 without Vulkan)
  • After: Zero compilation, instant results via HTTP to remote Ollama
  • qmd embed --force successfully re-indexes 7,100+ documents
  • qmd vsearch returns results in <2s vs hanging on CMake indefinitely

Notes

  • qmd search (BM25) is unaffected — works perfectly without any of this
  • expandQuery uses LLM for HYDE-style query expansion. When using Ollama path, we skip this and pass the raw query as a vector search. A future improvement could call Ollama /api/generate for query expansion.
  • Char-based chunking (text.length / avgCharsPerToken) is a reasonable approximation that avoids requiring a local tokenizer
  • The OLLAMA_EMBED_MODEL env var allows users to pick any embedding model available on their Ollama instance

Happy to submit a PR if there is interest.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions