You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After gbrain sync --strategy code walks a multi-language repo (1882 source files: Java + Python + C + TypeScript + JavaScript + bash + CSS + HTML + YAML + JSON, etc.), the chunks land in content_chunks with the chunk text correctly tagged in the body (e.g., [Java] dview/.../DstreamView.java:5-5 import java.awt.C), but the language, symbol_name, symbol_type, and symbol_name_qualified columns are NULL on essentially every chunk.
code-def <symbol> returns 0 hits as a result — the query filters on symbol_name which is unpopulated. code-refs still works because it's a text-scan over chunk_text.
Concrete numbers from a freshly-synced nvr repo (~31k chunks)
After gbrain sync --source <id> --strategy code finishes successfully, plus the autopilot daemon doing a few maintenance cycles plus a manual gbrain extract all:
Source language (inferred from slug)
chunks
with language
with symbol_name
with symbol_type
java
16,663
0
0
0
python
10,541
0
0
0
c
3,024
0
0
0
javascript
692
0
0
0
c-header
461
0
0
0
bash
234
0
0
0
json
117
104
0
104
html
59
0
0
0
yaml
52
0
0
0
css
50
0
0
0
cpp
14
0
0
0
typescript
1
0
0
0
JSON's 104 chunks with symbol_type are the only language with consistent metadata; everything else is empty.
The chunker code path looks plausible:
src/core/chunkers/code.ts:542-579 calls extractSymbolName(node) and normalizeSymbolType(node.type) to populate metadata.
src/core/import-file.ts:127, 470 reads c.metadata.symbolName and writes to the symbol_name column.
The chunkable-node Sets at src/core/chunkers/code.ts:296-300 for Java look right (method_declaration, class_declaration, interface_declaration, etc.).
So either the chunker is computing metadata but not persisting (some pipeline disconnect), or the chunker isn't actually invoking the language grammar for code files (silent fallback to a non-AST chunker), or a downstream process (extraction, autopilot maintenance) rewrites chunks and strips metadata.
Reproduction
Register a code source pointing at a real multi-language repo:
UPDATE sources SET config ='{"federated":true,"strategy":"code"}'WHERE id='my-code';
First sync:
cd /path/to/repo
gbrain sync --source my-code --strategy code --no-embed
Verify chunks landed:
SELECTCOUNT(*) FROM content_chunks WHERE chunk_text LIKE'[Java] %';
-- manySELECTCOUNT(*) FROM content_chunks WHERE language ='java'OR symbol_name IS NOT NULL;
-- zero or near-zero
Suspected interaction with gbrain extract all / autopilot
A pre-extract snapshot showed __init__: 325, os: 133, stdMsg: 102 populated in Python symbol_name (from gbrain stats with strategy filter). After running gbrain extract all plus a few autopilot maintenance cycles, those counts dropped to single digits or zero. Suspect the extract / autopilot path may be rewriting content_chunks rows (e.g., for backlink reconciliation or timeline extraction) without preserving the original symbol metadata.
Engine: Postgres 16.13 (pgvector + pg_trgm) via Docker
bun 1.3.10
macOS 26.3.1 (build 25D771280a)
Why I'm filing as an observation rather than a fix
Root-causing this requires instrumenting the chunker output before DB write to see whether c.metadata.symbolName is populated on the way in (chunker bug vs persistence bug) and tracing the extract / autopilot paths to see if either rewrites chunks without metadata. The chunker pipeline is yours; faster for you to triage than for me to debug-explore. Happy to run any diagnostic queries or share more data.
Symptoms
After
gbrain sync --strategy codewalks a multi-language repo (1882 source files: Java + Python + C + TypeScript + JavaScript + bash + CSS + HTML + YAML + JSON, etc.), the chunks land incontent_chunkswith the chunk text correctly tagged in the body (e.g.,[Java] dview/.../DstreamView.java:5-5 import java.awt.C), but thelanguage,symbol_name,symbol_type, andsymbol_name_qualifiedcolumns are NULL on essentially every chunk.code-def <symbol>returns 0 hits as a result — the query filters onsymbol_namewhich is unpopulated.code-refsstill works because it's a text-scan overchunk_text.Concrete numbers from a freshly-synced nvr repo (~31k chunks)
After
gbrain sync --source <id> --strategy codefinishes successfully, plus the autopilot daemon doing a few maintenance cycles plus a manualgbrain extract all:JSON's 104 chunks with
symbol_typeare the only language with consistent metadata; everything else is empty.The chunker code path looks plausible:
src/core/chunkers/code.ts:542-579callsextractSymbolName(node)andnormalizeSymbolType(node.type)to populate metadata.src/core/import-file.ts:127, 470readsc.metadata.symbolNameand writes to thesymbol_namecolumn.src/core/chunkers/code.ts:296-300for Java look right (method_declaration,class_declaration,interface_declaration, etc.).So either the chunker is computing metadata but not persisting (some pipeline disconnect), or the chunker isn't actually invoking the language grammar for code files (silent fallback to a non-AST chunker), or a downstream process (extraction, autopilot maintenance) rewrites chunks and strips metadata.
Reproduction
strategy=codein the source's stored config (until sync --strategy code dropped on first sync via performFullSync #767's PR lands, this is needed to make sync walk code files):cd /path/to/repo gbrain sync --source my-code --strategy code --no-embedcode-defis empty for known symbols:Suspected interaction with
gbrain extract all/ autopilotA pre-extract snapshot showed
__init__: 325, os: 133, stdMsg: 102populated in Pythonsymbol_name(fromgbrain statswith strategy filter). After runninggbrain extract allplus a few autopilot maintenance cycles, those counts dropped to single digits or zero. Suspect the extract / autopilot path may be rewritingcontent_chunksrows (e.g., for backlink reconciliation or timeline extraction) without preserving the original symbol metadata.Environment
dffb607) with PR fix(sync+embed+extract): code-symbol ingest + graph extraction (#767 + #769 + extract follow-ups) #768 applied locally (the strategy fix; doesn't touch chunkers or import-file)Why I'm filing as an observation rather than a fix
Root-causing this requires instrumenting the chunker output before DB write to see whether
c.metadata.symbolNameis populated on the way in (chunker bug vs persistence bug) and tracing the extract / autopilot paths to see if either rewrites chunks without metadata. The chunker pipeline is yours; faster for you to triage than for me to debug-explore. Happy to run any diagnostic queries or share more data.