Skip to content

Latest commit

 

History

History
105 lines (88 loc) · 4.12 KB

File metadata and controls

105 lines (88 loc) · 4.12 KB

GBrain Infrastructure Layer

The shared foundation that all skills, recipes, and integrations build on.

Data Pipeline

INPUT (markdown files, git repo)
  ↓
FILE RESOLUTION (local → .redirect → .supabase → error)
  ↓
MARKDOWN PARSER (gray-matter frontmatter + body)
  → compiled_truth + timeline separation
  ↓
CONTENT HASH (SHA-256 idempotency check — skip if unchanged)
  ↓
CHUNKING (3 strategies, configurable)
  ├── Recursive: 300-word chunks, 50-word overlap, 5-level delimiter hierarchy
  ├── Semantic: embed sentences, cosine similarity, Savitzky-Golay smoothing
  └── LLM-guided: Claude Haiku identifies topic shifts in 128-word candidates
  ↓
EMBEDDING (OpenAI text-embedding-3-large, 1536 dimensions)
  → batch 100, exponential backoff, non-fatal if fails
  ↓
DATABASE TRANSACTION (atomic: page + chunks + tags + version)
  ↓
SEARCH (hybrid, available immediately)

Search Architecture

GBrain uses Reciprocal Rank Fusion (RRF) to merge vector and keyword search:

User Query
  ↓
EXPANSION (optional: Claude Haiku generates 2 alternative phrasings)
  ↓
  ├── VECTOR SEARCH (pgvector HNSW, cosine distance)
  │     → 2x limit results per query variant
  │
  └── KEYWORD SEARCH (PostgreSQL tsvector, ts_rank)
        → 2x limit results
  ↓
RRF MERGE (score = Σ(1/(60 + rank)), balances both fairly)
  ↓
4-LAYER DEDUP
  ├── Best 3 chunks per page (source dedup)
  ├── Jaccard similarity > 0.85 (text dedup)
  ├── No type exceeds 60% (diversity)
  └── Max 2 chunks per page (page cap)
  ↓
TOP N RESULTS (default 20)

Key Components

File Purpose
src/core/engine.ts Pluggable engine interface (BrainEngine)
src/core/postgres-engine.ts Postgres + pgvector implementation
src/core/import-file.ts importFromFile + importFromContent pipeline
src/core/sync.ts Git-based incremental change detection
src/core/markdown.ts YAML frontmatter + compiled_truth/timeline parsing
src/core/embedding.ts OpenAI embedding with batch, retry, backoff
src/core/chunkers/recursive.ts Base chunker (300w, 5-level delimiters)
src/core/chunkers/semantic.ts Embedding-based topic boundary detection
src/core/chunkers/llm.ts Claude Haiku guided chunking
src/core/search/hybrid.ts RRF merge of vector + keyword
src/core/search/dedup.ts 4-layer result deduplication
src/core/search/expansion.ts Multi-query expansion via Claude Haiku
src/core/storage.ts Pluggable storage (S3, Supabase, local)
src/core/operations.ts Contract-first operation definitions (31 ops)
src/schema.sql Full DDL (10 tables, RLS, tsvector, HNSW)

Schema Overview

10 tables in Postgres:

  • pages — slug (unique), type, title, compiled_truth, timeline, frontmatter (JSONB)
  • content_chunks — pgvector 1536-dim embedding, chunk_source (compiled_truth|timeline)
  • links — typed edges (knows, works_at, invested_in, founded, etc.)
  • tags — many-to-many page tagging
  • timeline_entries — structured events (date, source, summary, detail)
  • page_versions — snapshot history for diff/revert
  • raw_data — sidecar JSON from external APIs (preserves provenance)
  • files — binary attachments in storage backend
  • ingest_log — audit trail of import operations
  • config — brain-level settings (version, embedding model, chunk strategy)

Full-text search uses weighted tsvector: title (A), compiled_truth (B), timeline (C). Vector search uses HNSW index with cosine distance on content_chunks.embedding.

The Thin Harness Principle

GBrain is the deterministic layer. Skills and recipes are the latent space layer.

See Thin Harness, Fat Skills for the full architecture philosophy.

  • GBrain CLI = thin harness (same input → same output)
  • Skills (ingest, query, maintain, enrich, briefing, migrate, setup) = fat skills
  • Recipes (voice-to-brain, email-to-brain) = fat skills that install infrastructure

The agent reads the skill/recipe and uses GBrain's deterministic tools to do the work.