hiivmind-corpus

A Claude Code plugin for shareable, linkable LLM knowledge bases. If you have been following Andrej Karpathy's LLM wiki idea or building Obsidian-based PKM setups, hiivmind-corpus is the productised version of that pattern — raw sources compiled into a curated, interlinked index — but any corpus can be published as a plain git repo and registered by anyone else with a single command, queryable remotely via gh api with sparse-cloned embeddings and cross-corpus bridges that link concepts across independently maintained knowledge bases. Every other tool in this space (ObsidianRAG, Neural Composer, obsidian-notes-rag, Karpathy's LLM wiki) is single-user, single-vault, local-only; this one enables your knowledge base to become a public library, not just a personal notebook.

Quick links: Using a Corpus | Building a Corpus | Semantic Search | Published Corpora

The Idea

Without structured indexing, Claude investigates libraries by relying on training data (outdated), web searching (hit-or-miss), or fetching URLs one at a time (no context). Every session rediscovers the same things.

A corpus solves this. You build a curated index once — collaboratively, around your actual use case — and Claude searches it across sessions. The index tracks where everything came from, how fresh it is, and uses semantic search to find relevant entries even when queries don't match exact keywords.

This follows the "just in time" context pattern from Anthropic: maintain lightweight identifiers, dynamically load content at runtime.

Getting Started

/hiivmind-corpus

One command, natural language:

You say...	What happens
"Create a corpus for Polars"	Scaffolds a new corpus, clones docs
"What corpora do I have?"	Discovers all installed corpora
"How do lazy frames work?"	Searches your corpora with semantic + keyword matching
"Refresh my React corpus"	Checks for upstream changes, updates stale entries
"Add the TanStack Query docs"	Extends an existing corpus with new sources

Installation

From the command line:

claude plugin marketplace add hiivmind/hiivmind-corpus
claude plugin add hiivmind-corpus@hiivmind

From within a Claude Code session:

/plugin marketplace add hiivmind/hiivmind-corpus
/plugin install hiivmind-corpus@hiivmind

Or use /plugin to browse and install interactively.

Using a Corpus

Most users start here — someone else built the corpus, you just want to query it.

1. Register a corpus with your project

/hiivmind-corpus register github:hiivmind/hiivmind-corpus-data/hiivmind-corpus-polars

This adds the corpus to your project's registry (.hiivmind/corpus/registry.yaml). Register as many as you need — they're lightweight references, not copies.

2. Ask questions

Just ask naturally. Claude routes your question to the right corpus:

"How do I filter rows in Polars?"
"What's the difference between select and with_columns?"
"Show me lazy frame optimization techniques"

Or be explicit: /hiivmind-corpus navigate polars "group by aggregations"

How navigation works

When you ask a question, the navigate skill:

Routes to the right corpus — matches your query against registered corpora using semantic similarity (if embeddings are cached) or keyword matching
Finds relevant entries — searches the corpus index using vector search with optional SQL filtering, boosted by concept graph relationships
Fetches documentation — retrieves the actual content from the source repo via gh api
Presents the answer — with source citations and related doc suggestions

For remote GitHub corpora, embeddings are automatically cached locally on first query (~5 seconds, then instant on subsequent queries).

Check what you have

/hiivmind-corpus discover        # List all available corpora
/hiivmind-corpus status           # Check freshness and health

Cross-corpus queries

With 2+ corpora registered, you can create bridges — links between related concepts across corpora:

/hiivmind-corpus bridge           # Detect and create cross-corpus links

Navigate then uses bridges and aliases to route queries that span multiple documentation sets. Search for "lazy evaluation" and it finds relevant entries in both Polars and Ibis.

Building a Corpus

If you want to create a new corpus from scratch — for a library, framework, or internal project.

Quick start

/hiivmind-corpus init             # Scaffold from a GitHub repo
/hiivmind-corpus add-source       # Add git repos, local files, web pages, PDFs
/hiivmind-corpus build            # Scan sources, build index collaboratively

The build process is a conversation — Claude scans the docs and you guide what matters: "I care about data modeling and ETL, skip the deployment stuff." That curation persists across sessions.

After building

/hiivmind-corpus enhance          # Deepen coverage on specific topics
/hiivmind-corpus refresh          # Sync with upstream changes
/hiivmind-corpus graph add-concept  # Add zettelkasten concept clusters

What a Corpus Looks Like

Each corpus is a data-only repository — just files, no server, no database:

hiivmind-corpus-{project}/
├── config.yaml                # Source definitions + keywords
├── index.yaml                 # Structured index (entries with summaries, tags, concepts)
├── index.md                   # Human-readable index (rendered from index.yaml)
├── graph.yaml                 # Concept graph — zettelkasten relationships (optional)
├── index-embeddings.lance/    # Semantic search embeddings (committed, not gitignored)
├── render-index.sh            # Deterministic index.yaml → index.md renderer
├── .source/                   # Git clones (gitignored)
├── .cache/                    # Web/llms-txt cached content (gitignored)
└── uploads/                   # Local document sources

The index is the product. Everything else supports building and maintaining it.

Three Layers

The corpus architecture is a value-add pipeline — each layer builds on the previous:

Layer 1: INDEX (foundation)
  source files → index.yaml (entries with title, summary, tags, keywords, concepts)
  → index.md (rendered for humans)

Layer 2a: GRAPH (zettelkasten)          Layer 2b: RAG (semantic search)
  Concept definitions + relationships    Vector embeddings of entry metadata
  References entries by concept ID       Enriched by concept membership
  Pure relationship store                Entries embed: title|summary|tags|concepts

         ↕ mutual enrichment ↕
  Graph candidate detection uses RAG similarity
  RAG graph-boost uses graph relationships
  Bridge detection queries per-corpus RAG

Layer 1 (index.yaml) is always built. Layers 2a and 2b are optional — they add value for larger corpora and cross-corpus scenarios.

Skills

Build & Maintain

Skill	Purpose
init	Scaffold a new corpus from a GitHub repo URL or local project
add-source	Add git repos, local files, web pages, PDFs, Obsidian vaults, llms.txt
build	Scan sources, build index.yaml collaboratively with user, generate embeddings
enhance	Deepen coverage on specific topics within an existing index
refresh	Compare against upstream commits, flag stale entries, re-embed

Query & Discover

Skill	Purpose
navigate	Search across corpora — semantic pre-filter, graph-boost, reranking
discover	Find all installed/registered corpora and their status
register	Connect a corpus to the current project via registry.yaml
status	Check corpus health — freshness, embedding status, upstream changes

Concepts & Relationships

Skill	Purpose
graph	View, validate, and edit concept graphs (graph.yaml)
bridge	Create cross-corpus concept bridges and query-routing aliases

Lifecycle

init → add-source → build → refresh/enhance (as needed)
                       ↓
              graph (concepts) ←→ embeddings (RAG)
                       ↓
         register → navigate (query) → bridge (cross-corpus)

Source Types

Type	Storage	Example
git	`.source/{source_id}/`	Library docs, framework APIs
local	`uploads/{source_id}/`	Team standards, internal docs
web	`.cache/web/{source_id}/`	Blog posts, articles
llms-txt	`.cache/llms-txt/{source_id}/`	Sites with llms.txt manifests
generated-docs	`.source/{source_id}/` + web	Hybrid git+web (e.g., docs built from source)
pdf	`uploads/{source_id}/`	PDF books, split into chapters
obsidian	`.source/{source_id}/`	Obsidian vaults with wikilinks and tags
self	(current repo)	Embedded corpus — index the repo's own docs

Semantic Search (RAG)

Corpora can include entry-level semantic embeddings for retrieval that goes beyond keyword matching.

How it works:

During build, entries are embedded using fastembed with BAAI/bge-small-en-v1.5 (ONNX, no PyTorch, ~120MB)
Embeddings are stored in Lance format — flat files, no server, committed to the repo
At query time, navigate searches by cosine similarity with optional SQL predicates for hybrid search (vector + keyword)
An FTS index on metadata enables full-text keyword matching alongside semantic search
Optional reranking with CrossEncoder for better precision on ambiguous queries

What gets embedded:

"passage: {title} | {summary} | {tags} | {concepts}"

Concept labels in the embedding text mean searching for "lazy evaluation" finds entries assigned to that concept even if those words don't appear in the title or summary. The zettelkasten structure directly improves RAG recall.

Remote corpora: For GitHub-hosted corpora, navigate automatically sparse-clones the Lance directory to a local cache (.hiivmind/corpus/cache/) on first query, with TTL-based freshness tracking.

Opt-in: Embeddings are suggested during build when entry_count > 150 or the corpus has tiered indexes. Below that threshold, the LLM scanning the full index directly is effective enough.

Dependencies: pip install fastembed lancedb pyyaml (~260MB). If not installed, navigate falls back to keyword/yq pre-filtering — embeddings are an enhancement, not a requirement.

Concept Graphs

Each corpus can have a concept graph (graph.yaml) — a zettelkasten-style knowledge structure:

schema_version: 2
concepts:
  lazy-evaluation:
    label: "Lazy Evaluation"
    description: "Deferred query execution for optimization"
    tags: [performance, lazy]
  query-optimization:
    label: "Query Optimization"
    description: "Techniques for faster query execution"
    tags: [performance, indexing]
relationships:
  - from: lazy-evaluation
    to: query-optimization
    type: depends-on
    origin: manual

Concept membership is bidirectional: entries in index.yaml declare their concepts via a concepts[] field, and graph.yaml defines concept definitions and relationships. The graph skill lets you add concepts, add relationships (with embedding-powered candidate detection), and validate the graph.

Cross-Corpus Bridges

Projects with 2+ registered corpora can create bridges — links between concepts in different corpora:

# .hiivmind/corpus/registry-graph.yaml
bridges:
  - concept_a: "polars:lazy-evaluation"
    concept_b: "ibis:deferred-execution"
    type: see-also
    note: "Both implement deferred query execution"
aliases:
  "lazy evaluation":
    - corpus: polars
      concept: lazy-evaluation
    - corpus: ibis
      concept: deferred-execution

Bridge candidate detection queries each corpus's embeddings to find semantically similar concepts across corpora — even when they use different terminology.

Per-Project Registry

Projects register which corpora they use:

# .hiivmind/corpus/registry.yaml
corpora:
  - id: polars
    source:
      type: github
      repo: hiivmind/hiivmind-corpus-data
      path: hiivmind-corpus-polars
      ref: main
  - id: flyio
    source:
      type: github
      repo: hiivmind/hiivmind-corpus-flyio
      ref: main

Navigate uses the registry to search across all registered corpora and route queries to the right one.

Dependencies

Tool	Required	Purpose
git	Yes	Clone source repos, track commits
gh (GitHub CLI)	Recommended	Fetch content from GitHub repos (preferred over raw URLs)
yq 4.0+	Recommended	Parse YAML config files (grep fallback available)
fastembed + lancedb	Optional	Semantic search embeddings (~260MB pip install)
pyyaml	Optional	YAML parsing in embedding scripts

Design Principles

Human-curated indexes — You decide what matters, not an algorithm
Collaborative building — The build process is a conversation, not a batch job
Layered value-add — Index first, then optionally concepts and embeddings
Graceful degradation — Works without embeddings, without graphs, without yq
Portable — Corpora are just files. Commit, diff, review, share with your team
Known freshness — Commit SHA tracking tells you exactly how old your sources are
Works without local clone — Falls back to gh api for remote content fetching

Published Corpora

Already use Obsidian? Register the Obsidian help corpus to get started: /hiivmind-corpus register github:hiivmind/hiivmind-corpus-obsidian

Corpus	Source
hiivmind-corpus-obsidian	Obsidian help docs
hiivmind-corpus-polars	Polars documentation
hiivmind-corpus-ibis	Ibis documentation
hiivmind-corpus-narwhals	Narwhals documentation
hiivmind-corpus-substrait	Substrait specification
hiivmind-corpus-flyio	Fly.io platform docs
hiivmind-corpus-lancedb	LanceDB documentation
hiivmind-corpus-claude-agent-sdk	Claude Agent SDK

Register any of these with:

/hiivmind-corpus register github:hiivmind/hiivmind-corpus-flyio

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 287 Commits
.claude-plugin		.claude-plugin
.hiivmind/corpus		.hiivmind/corpus
agents		agents
commands		commands
docs		docs
lib/corpus		lib/corpus
references		references
skills		skills
templates		templates
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

hiivmind-corpus

The Idea

Getting Started

Installation

Using a Corpus

1. Register a corpus with your project

2. Ask questions

How navigation works

Check what you have

Cross-corpus queries

Building a Corpus

Quick start

After building

What a Corpus Looks Like

Three Layers

Skills

Build & Maintain

Query & Discover

Concepts & Relationships

Lifecycle

Source Types

Semantic Search (RAG)

Concept Graphs

Cross-Corpus Bridges

Per-Project Registry

Dependencies

Design Principles

Published Corpora

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages