A Claude Code plugin for shareable, linkable LLM knowledge bases. If you have been following Andrej Karpathy's LLM wiki idea or building Obsidian-based PKM setups, hiivmind-corpus is the productised version of that pattern — raw sources compiled into a curated, interlinked index — but any corpus can be published as a plain git repo and registered by anyone else with a single command, queryable remotely via gh api with sparse-cloned embeddings and cross-corpus bridges that link concepts across independently maintained knowledge bases. Every other tool in this space (ObsidianRAG, Neural Composer, obsidian-notes-rag, Karpathy's LLM wiki) is single-user, single-vault, local-only; this one enables your knowledge base to become a public library, not just a personal notebook.
Quick links: Using a Corpus | Building a Corpus | Semantic Search | Published Corpora
Without structured indexing, Claude investigates libraries by relying on training data (outdated), web searching (hit-or-miss), or fetching URLs one at a time (no context). Every session rediscovers the same things.
A corpus solves this. You build a curated index once — collaboratively, around your actual use case — and Claude searches it across sessions. The index tracks where everything came from, how fresh it is, and uses semantic search to find relevant entries even when queries don't match exact keywords.
This follows the "just in time" context pattern from Anthropic: maintain lightweight identifiers, dynamically load content at runtime.
/hiivmind-corpus
One command, natural language:
| You say... | What happens |
|---|---|
| "Create a corpus for Polars" | Scaffolds a new corpus, clones docs |
| "What corpora do I have?" | Discovers all installed corpora |
| "How do lazy frames work?" | Searches your corpora with semantic + keyword matching |
| "Refresh my React corpus" | Checks for upstream changes, updates stale entries |
| "Add the TanStack Query docs" | Extends an existing corpus with new sources |
From the command line:
claude plugin marketplace add hiivmind/hiivmind-corpus
claude plugin add hiivmind-corpus@hiivmindFrom within a Claude Code session:
/plugin marketplace add hiivmind/hiivmind-corpus
/plugin install hiivmind-corpus@hiivmind
Or use /plugin to browse and install interactively.
Most users start here — someone else built the corpus, you just want to query it.
/hiivmind-corpus register github:hiivmind/hiivmind-corpus-data/hiivmind-corpus-polars
This adds the corpus to your project's registry (.hiivmind/corpus/registry.yaml). Register as many as you need — they're lightweight references, not copies.
Just ask naturally. Claude routes your question to the right corpus:
"How do I filter rows in Polars?"
"What's the difference between select and with_columns?"
"Show me lazy frame optimization techniques"
Or be explicit: /hiivmind-corpus navigate polars "group by aggregations"
When you ask a question, the navigate skill:
- Routes to the right corpus — matches your query against registered corpora using semantic similarity (if embeddings are cached) or keyword matching
- Finds relevant entries — searches the corpus index using vector search with optional SQL filtering, boosted by concept graph relationships
- Fetches documentation — retrieves the actual content from the source repo via
gh api - Presents the answer — with source citations and related doc suggestions
For remote GitHub corpora, embeddings are automatically cached locally on first query (~5 seconds, then instant on subsequent queries).
/hiivmind-corpus discover # List all available corpora
/hiivmind-corpus status # Check freshness and health
With 2+ corpora registered, you can create bridges — links between related concepts across corpora:
/hiivmind-corpus bridge # Detect and create cross-corpus links
Navigate then uses bridges and aliases to route queries that span multiple documentation sets. Search for "lazy evaluation" and it finds relevant entries in both Polars and Ibis.
If you want to create a new corpus from scratch — for a library, framework, or internal project.
/hiivmind-corpus init # Scaffold from a GitHub repo
/hiivmind-corpus add-source # Add git repos, local files, web pages, PDFs
/hiivmind-corpus build # Scan sources, build index collaboratively
The build process is a conversation — Claude scans the docs and you guide what matters: "I care about data modeling and ETL, skip the deployment stuff." That curation persists across sessions.
/hiivmind-corpus enhance # Deepen coverage on specific topics
/hiivmind-corpus refresh # Sync with upstream changes
/hiivmind-corpus graph add-concept # Add zettelkasten concept clusters
Each corpus is a data-only repository — just files, no server, no database:
hiivmind-corpus-{project}/
├── config.yaml # Source definitions + keywords
├── index.yaml # Structured index (entries with summaries, tags, concepts)
├── index.md # Human-readable index (rendered from index.yaml)
├── graph.yaml # Concept graph — zettelkasten relationships (optional)
├── index-embeddings.lance/ # Semantic search embeddings (committed, not gitignored)
├── render-index.sh # Deterministic index.yaml → index.md renderer
├── .source/ # Git clones (gitignored)
├── .cache/ # Web/llms-txt cached content (gitignored)
└── uploads/ # Local document sources
The index is the product. Everything else supports building and maintaining it.
The corpus architecture is a value-add pipeline — each layer builds on the previous:
Layer 1: INDEX (foundation)
source files → index.yaml (entries with title, summary, tags, keywords, concepts)
→ index.md (rendered for humans)
Layer 2a: GRAPH (zettelkasten) Layer 2b: RAG (semantic search)
Concept definitions + relationships Vector embeddings of entry metadata
References entries by concept ID Enriched by concept membership
Pure relationship store Entries embed: title|summary|tags|concepts
↕ mutual enrichment ↕
Graph candidate detection uses RAG similarity
RAG graph-boost uses graph relationships
Bridge detection queries per-corpus RAG
Layer 1 (index.yaml) is always built. Layers 2a and 2b are optional — they add value for larger corpora and cross-corpus scenarios.
| Skill | Purpose |
|---|---|
| init | Scaffold a new corpus from a GitHub repo URL or local project |
| add-source | Add git repos, local files, web pages, PDFs, Obsidian vaults, llms.txt |
| build | Scan sources, build index.yaml collaboratively with user, generate embeddings |
| enhance | Deepen coverage on specific topics within an existing index |
| refresh | Compare against upstream commits, flag stale entries, re-embed |
| Skill | Purpose |
|---|---|
| navigate | Search across corpora — semantic pre-filter, graph-boost, reranking |
| discover | Find all installed/registered corpora and their status |
| register | Connect a corpus to the current project via registry.yaml |
| status | Check corpus health — freshness, embedding status, upstream changes |
| Skill | Purpose |
|---|---|
| graph | View, validate, and edit concept graphs (graph.yaml) |
| bridge | Create cross-corpus concept bridges and query-routing aliases |
init → add-source → build → refresh/enhance (as needed)
↓
graph (concepts) ←→ embeddings (RAG)
↓
register → navigate (query) → bridge (cross-corpus)
| Type | Storage | Example |
|---|---|---|
| git | .source/{source_id}/ |
Library docs, framework APIs |
| local | uploads/{source_id}/ |
Team standards, internal docs |
| web | .cache/web/{source_id}/ |
Blog posts, articles |
| llms-txt | .cache/llms-txt/{source_id}/ |
Sites with llms.txt manifests |
| generated-docs | .source/{source_id}/ + web |
Hybrid git+web (e.g., docs built from source) |
uploads/{source_id}/ |
PDF books, split into chapters | |
| obsidian | .source/{source_id}/ |
Obsidian vaults with wikilinks and tags |
| self | (current repo) | Embedded corpus — index the repo's own docs |
Corpora can include entry-level semantic embeddings for retrieval that goes beyond keyword matching.
How it works:
- During
build, entries are embedded using fastembed withBAAI/bge-small-en-v1.5(ONNX, no PyTorch, ~120MB) - Embeddings are stored in Lance format — flat files, no server, committed to the repo
- At query time,
navigatesearches by cosine similarity with optional SQL predicates for hybrid search (vector + keyword) - An FTS index on metadata enables full-text keyword matching alongside semantic search
- Optional reranking with CrossEncoder for better precision on ambiguous queries
What gets embedded:
"passage: {title} | {summary} | {tags} | {concepts}"
Concept labels in the embedding text mean searching for "lazy evaluation" finds entries assigned to that concept even if those words don't appear in the title or summary. The zettelkasten structure directly improves RAG recall.
Remote corpora: For GitHub-hosted corpora, navigate automatically sparse-clones the Lance directory to a local cache (.hiivmind/corpus/cache/) on first query, with TTL-based freshness tracking.
Opt-in: Embeddings are suggested during build when entry_count > 150 or the corpus has tiered indexes. Below that threshold, the LLM scanning the full index directly is effective enough.
Dependencies: pip install fastembed lancedb pyyaml (~260MB). If not installed, navigate falls back to keyword/yq pre-filtering — embeddings are an enhancement, not a requirement.
Each corpus can have a concept graph (graph.yaml) — a zettelkasten-style knowledge structure:
schema_version: 2
concepts:
lazy-evaluation:
label: "Lazy Evaluation"
description: "Deferred query execution for optimization"
tags: [performance, lazy]
query-optimization:
label: "Query Optimization"
description: "Techniques for faster query execution"
tags: [performance, indexing]
relationships:
- from: lazy-evaluation
to: query-optimization
type: depends-on
origin: manualConcept membership is bidirectional: entries in index.yaml declare their concepts via a concepts[] field, and graph.yaml defines concept definitions and relationships. The graph skill lets you add concepts, add relationships (with embedding-powered candidate detection), and validate the graph.
Projects with 2+ registered corpora can create bridges — links between concepts in different corpora:
# .hiivmind/corpus/registry-graph.yaml
bridges:
- concept_a: "polars:lazy-evaluation"
concept_b: "ibis:deferred-execution"
type: see-also
note: "Both implement deferred query execution"
aliases:
"lazy evaluation":
- corpus: polars
concept: lazy-evaluation
- corpus: ibis
concept: deferred-executionBridge candidate detection queries each corpus's embeddings to find semantically similar concepts across corpora — even when they use different terminology.
Projects register which corpora they use:
# .hiivmind/corpus/registry.yaml
corpora:
- id: polars
source:
type: github
repo: hiivmind/hiivmind-corpus-data
path: hiivmind-corpus-polars
ref: main
- id: flyio
source:
type: github
repo: hiivmind/hiivmind-corpus-flyio
ref: mainNavigate uses the registry to search across all registered corpora and route queries to the right one.
| Tool | Required | Purpose |
|---|---|---|
| git | Yes | Clone source repos, track commits |
| gh (GitHub CLI) | Recommended | Fetch content from GitHub repos (preferred over raw URLs) |
| yq 4.0+ | Recommended | Parse YAML config files (grep fallback available) |
| fastembed + lancedb | Optional | Semantic search embeddings (~260MB pip install) |
| pyyaml | Optional | YAML parsing in embedding scripts |
- Human-curated indexes — You decide what matters, not an algorithm
- Collaborative building — The build process is a conversation, not a batch job
- Layered value-add — Index first, then optionally concepts and embeddings
- Graceful degradation — Works without embeddings, without graphs, without yq
- Portable — Corpora are just files. Commit, diff, review, share with your team
- Known freshness — Commit SHA tracking tells you exactly how old your sources are
- Works without local clone — Falls back to
gh apifor remote content fetching
Already use Obsidian? Register the Obsidian help corpus to get started: /hiivmind-corpus register github:hiivmind/hiivmind-corpus-obsidian
| Corpus | Source |
|---|---|
| hiivmind-corpus-obsidian | Obsidian help docs |
| hiivmind-corpus-polars | Polars documentation |
| hiivmind-corpus-ibis | Ibis documentation |
| hiivmind-corpus-narwhals | Narwhals documentation |
| hiivmind-corpus-substrait | Substrait specification |
| hiivmind-corpus-flyio | Fly.io platform docs |
| hiivmind-corpus-lancedb | LanceDB documentation |
| hiivmind-corpus-claude-agent-sdk | Claude Agent SDK |
Register any of these with:
/hiivmind-corpus register github:hiivmind/hiivmind-corpus-flyio
MIT