The shared foundation that all skills, recipes, and integrations build on.
INPUT (markdown files, git repo)
↓
FILE RESOLUTION (local → .redirect → .supabase → error)
↓
MARKDOWN PARSER (gray-matter frontmatter + body)
→ compiled_truth + timeline separation
↓
CONTENT HASH (SHA-256 idempotency check — skip if unchanged)
↓
CHUNKING (3 strategies, configurable)
├── Recursive: 300-word chunks, 50-word overlap, 5-level delimiter hierarchy
├── Semantic: embed sentences, cosine similarity, Savitzky-Golay smoothing
└── LLM-guided: Claude Haiku identifies topic shifts in 128-word candidates
↓
EMBEDDING (OpenAI text-embedding-3-large, 1536 dimensions)
→ batch 100, exponential backoff, non-fatal if fails
↓
DATABASE TRANSACTION (atomic: page + chunks + tags + version)
↓
SEARCH (hybrid, available immediately)
GBrain uses Reciprocal Rank Fusion (RRF) to merge vector and keyword search:
User Query
↓
EXPANSION (optional: Claude Haiku generates 2 alternative phrasings)
↓
├── VECTOR SEARCH (pgvector HNSW, cosine distance)
│ → 2x limit results per query variant
│
└── KEYWORD SEARCH (PostgreSQL tsvector, ts_rank)
→ 2x limit results
↓
RRF MERGE (score = Σ(1/(60 + rank)), balances both fairly)
↓
4-LAYER DEDUP
├── Best 3 chunks per page (source dedup)
├── Jaccard similarity > 0.85 (text dedup)
├── No type exceeds 60% (diversity)
└── Max 2 chunks per page (page cap)
↓
TOP N RESULTS (default 20)
| File | Purpose |
|---|---|
src/core/engine.ts |
Pluggable engine interface (BrainEngine) |
src/core/postgres-engine.ts |
Postgres + pgvector implementation |
src/core/import-file.ts |
importFromFile + importFromContent pipeline |
src/core/sync.ts |
Git-based incremental change detection |
src/core/markdown.ts |
YAML frontmatter + compiled_truth/timeline parsing |
src/core/embedding.ts |
OpenAI embedding with batch, retry, backoff |
src/core/chunkers/recursive.ts |
Base chunker (300w, 5-level delimiters) |
src/core/chunkers/semantic.ts |
Embedding-based topic boundary detection |
src/core/chunkers/llm.ts |
Claude Haiku guided chunking |
src/core/search/hybrid.ts |
RRF merge of vector + keyword |
src/core/search/dedup.ts |
4-layer result deduplication |
src/core/search/expansion.ts |
Multi-query expansion via Claude Haiku |
src/core/storage.ts |
Pluggable storage (S3, Supabase, local) |
src/core/operations.ts |
Contract-first operation definitions (31 ops) |
src/schema.sql |
Full DDL (10 tables, RLS, tsvector, HNSW) |
10 tables in Postgres:
- pages — slug (unique), type, title, compiled_truth, timeline, frontmatter (JSONB)
- content_chunks — pgvector 1536-dim embedding, chunk_source (compiled_truth|timeline)
- links — typed edges (knows, works_at, invested_in, founded, etc.)
- tags — many-to-many page tagging
- timeline_entries — structured events (date, source, summary, detail)
- page_versions — snapshot history for diff/revert
- raw_data — sidecar JSON from external APIs (preserves provenance)
- files — binary attachments in storage backend
- ingest_log — audit trail of import operations
- config — brain-level settings (version, embedding model, chunk strategy)
Full-text search uses weighted tsvector: title (A), compiled_truth (B), timeline (C). Vector search uses HNSW index with cosine distance on content_chunks.embedding.
GBrain is the deterministic layer. Skills and recipes are the latent space layer.
See Thin Harness, Fat Skills for the full architecture philosophy.
- GBrain CLI = thin harness (same input → same output)
- Skills (ingest, query, maintain, enrich, briefing, migrate, setup) = fat skills
- Recipes (voice-to-brain, email-to-brain) = fat skills that install infrastructure
The agent reads the skill/recipe and uses GBrain's deterministic tools to do the work.