Skip to content

Code chunks land in DB with NULL language / symbol_name / symbol_type across all languages #769

@rayers

Description

@rayers

Symptoms

After gbrain sync --strategy code walks a multi-language repo (1882 source files: Java + Python + C + TypeScript + JavaScript + bash + CSS + HTML + YAML + JSON, etc.), the chunks land in content_chunks with the chunk text correctly tagged in the body (e.g., [Java] dview/.../DstreamView.java:5-5 import java.awt.C), but the language, symbol_name, symbol_type, and symbol_name_qualified columns are NULL on essentially every chunk.

code-def <symbol> returns 0 hits as a result — the query filters on symbol_name which is unpopulated. code-refs still works because it's a text-scan over chunk_text.

Concrete numbers from a freshly-synced nvr repo (~31k chunks)

After gbrain sync --source <id> --strategy code finishes successfully, plus the autopilot daemon doing a few maintenance cycles plus a manual gbrain extract all:

Source language (inferred from slug) chunks with language with symbol_name with symbol_type
java 16,663 0 0 0
python 10,541 0 0 0
c 3,024 0 0 0
javascript 692 0 0 0
c-header 461 0 0 0
bash 234 0 0 0
json 117 104 0 104
html 59 0 0 0
yaml 52 0 0 0
css 50 0 0 0
cpp 14 0 0 0
typescript 1 0 0 0

JSON's 104 chunks with symbol_type are the only language with consistent metadata; everything else is empty.

The chunker code path looks plausible:

  • src/core/chunkers/code.ts:542-579 calls extractSymbolName(node) and normalizeSymbolType(node.type) to populate metadata.
  • src/core/import-file.ts:127, 470 reads c.metadata.symbolName and writes to the symbol_name column.
  • The chunkable-node Sets at src/core/chunkers/code.ts:296-300 for Java look right (method_declaration, class_declaration, interface_declaration, etc.).

So either the chunker is computing metadata but not persisting (some pipeline disconnect), or the chunker isn't actually invoking the language grammar for code files (silent fallback to a non-AST chunker), or a downstream process (extraction, autopilot maintenance) rewrites chunks and strips metadata.

Reproduction

  1. Register a code source pointing at a real multi-language repo:
    gbrain sources add my-code --path /path/to/repo --federated
  2. Set strategy=code in the source's stored config (until sync --strategy code dropped on first sync via performFullSync #767's PR lands, this is needed to make sync walk code files):
    UPDATE sources SET config = '{"federated":true,"strategy":"code"}' WHERE id='my-code';
  3. First sync:
    cd /path/to/repo
    gbrain sync --source my-code --strategy code --no-embed
  4. Verify chunks landed:
    SELECT COUNT(*) FROM content_chunks WHERE chunk_text LIKE '[Java] %';
    -- many
    SELECT COUNT(*) FROM content_chunks WHERE language = 'java' OR symbol_name IS NOT NULL;
    -- zero or near-zero
  5. Confirm code-def is empty for known symbols:
    gbrain code-def <a-class-name-from-the-repo>
    # {"symbol": "...", "count": 0, "results": []}

Suspected interaction with gbrain extract all / autopilot

A pre-extract snapshot showed __init__: 325, os: 133, stdMsg: 102 populated in Python symbol_name (from gbrain stats with strategy filter). After running gbrain extract all plus a few autopilot maintenance cycles, those counts dropped to single digits or zero. Suspect the extract / autopilot path may be rewriting content_chunks rows (e.g., for backlink reconciliation or timeline extraction) without preserving the original symbol metadata.

Environment

Why I'm filing as an observation rather than a fix

Root-causing this requires instrumenting the chunker output before DB write to see whether c.metadata.symbolName is populated on the way in (chunker bug vs persistence bug) and tracing the extract / autopilot paths to see if either rewrites chunks without metadata. The chunker pipeline is yours; faster for you to triage than for me to debug-explore. Happy to run any diagnostic queries or share more data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions