Skip to content

fix(sync): honor --strategy on first sync via performFullSync (#767)#768

Open
rayers wants to merge 2 commits intogarrytan:masterfrom
rayers:local/sync-strategy-fix
Open

fix(sync): honor --strategy on first sync via performFullSync (#767)#768
rayers wants to merge 2 commits intogarrytan:masterfrom
rayers:local/sync-strategy-fix

Conversation

@rayers
Copy link
Copy Markdown

@rayers rayers commented May 9, 2026

Summary

Closes #767. gbrain sync --strategy code --source <id> silently ran a markdown-only import on first sync (no anchor commit yet) because performFullSync calls runImport which is hardcoded to collectMarkdownFiles. Result: registering a fresh code source produced 0 code pages — no error, just silent under-coverage. code-def / code-refs returned 0 hits for every symbol.

Changes

src/commands/import.ts

  • runImport now parses --strategy markdown|code|auto (defaults to markdown for backward compat).
  • Both space-separated (--strategy code) and equals-separated (--strategy=code) forms are accepted. The original code only matched the space form via args.indexOf('--strategy'); equals-form fell through to default.
  • Validates the value with a clear error before falling into the file walk.
  • auto strategy is computed as a UNION of collectMarkdownFiles(dir) ∪ collectFilesByStrategy(dir, 'code') — strict superset of legacy markdown coverage. The raw collectFilesByStrategy(dir, 'auto') helper alone uses isSyncable's auto rules (consistent with walkSyncableFiles), which strips brain-convention files (README.md, index.md, ops/**); callers wanting the union compose the two.
  • New collectFilesByStrategy(dir, strategy) helper. Mirrors collectMarkdownFiles symlink + hidden-dir + node_modules safety (the L002 invariants from PR fix: skip node_modules and handle broken symlinks in import walker #26 / PR fix: community fix wave — 9 PRs, 8 contributors (v0.6.1) #38) and routes the include filter through isSyncable so this and the incremental walkSyncableFiles share inclusion logic. Per-file 5MB cap via a documented module-level MAX_IMPORT_FILE_SIZE constant.

src/commands/sync.ts

  • performFullSync threads opts.strategy through to runImport's importArgs.
  • Same fix applied to the dryRun branch (parallel bug — silent misleading dry-run counts when strategy=code).

test/collect-files-by-strategy.test.ts (new, 13 tests)

  • strategy=code returns code-only, recurses, skips node_modules + hidden dirs, 5MB cap.
  • Empty dir returns []; non-existent dir returns []; unreadable subdir doesn't throw.
  • Symlink containment (file + dir + dangling) — full L002 parity.
  • strategy=auto returns code + non-stripped markdown; documents the brain-convention strip behavior so callers know to compose with collectMarkdownFiles for the union.

Verification

End-to-end against a 1.6GB Dividia NVR repo (~7000 files, mixed Java + Python + C + TypeScript + bash):

$ gbrain sources add gstack-code-nvr --path /repo --federated
$ gbrain sync --source gstack-code-nvr --strategy code --no-embed
Found 1882 code files
Using 4 parallel workers
[import.files] 1882/1882 (100%) imported=1882 skipped=0 errors=0
Import complete (17.3s):
  1882 pages imported
  0 pages skipped (0 unchanged, 0 errors)
  31933 chunks created

Was 0 before this patch.

gbrain code-refs <symbol> returns hits via chunk_text scan (verified with 50 hits for a known nvr Python symbol). gbrain code-def works for any language whose tree-sitter chunker fully populates symbol_name / symbol_type columns (Python: yes; Java: chunks are created with [Java] path:line chunk_text but symbol metadata is unpopulated — separate downstream chunker behavior, not affected by this patch).

Equals-form spot-checks:

--strategy code     → Found N code files
--strategy=code     → Found N code files
--strategy auto     → 4 files (3 md + 1 ts) on a 4-file fixture (3 md from collectMarkdownFiles + 1 ts from collectFilesByStrategy code)
--strategy markdown → 3 files (md only)
(no flag)           → Found N markdown files (backward compat)

Tests

  • 13 new tests in test/collect-files-by-strategy.test.ts — all pass.
  • Pre-existing import + sync test surface (import-walker, import-file, import-resume, sync-classifier-widening, sources-resync-recovery, skillpack-sync-guard): 64 → 64, all pass.
  • Combined: 77 pass, 0 fail.

Adversarial review

Two-round review process before submission. Round-1 patch was reviewed by an independent Claude subagent (read-only, hostile-reviewer mode) which surfaced three HIGH issues:

  1. auto was not a strict superset of collectMarkdownFiles (silently dropped README.md / index.md / ops/**).
  2. --strategy=code (equals form) silently dropped (same class of bug being fixed).
  3. require('../core/sync.ts') was the only require in the file (vs import everywhere else).

Plus MEDIUM (duplicated 5MB constant) and LOW (test gaps). All addressed in commit d8e79f7. Two rounds are visible as separate commits on the branch for review history; happy to squash on merge.

Codex CLI review was also attempted (codex review --base master) but stalled in repo-exploration mode — known issue with codex 0.117 on patches against larger repos. Claude subagent review was the more productive of the two.

Out of scope (separate from this PR)

  • gbrain sync --source <id> without --strategy does not consult cfg.strategy from the source's stored config (sync.ts:1071-1101). The --all path does. Pre-existing inconsistency; can address separately.
  • Java tree-sitter chunker creates chunks with chunk_text like [Java] path:line but doesn't populate symbol_name / language columns the way Python's chunker does. Separate downstream concern; affects code-def quality but is unrelated to file walking.

Environment used to develop this patch

gbrain v0.30.1 (dffb607), engine = Postgres 16.13 (pgvector + pg_trgm), bun 1.3.10, macOS 26.3.1 (build 25D771280a). Use case: enabling mcp__gbrain__code_def for cross-codebase symbol lookup on a single-developer NVR codebase.

🤖 Generated with Claude Code


View in Codesmith
Need help on this PR? Tag @codesmith with what you need.

  • Let Codesmith autofix CI failures and bot reviews

rayers added 2 commits May 8, 2026 21:55
…an#767)

`gbrain sync --strategy code --source <id>` silently ran a markdown-only
import on first sync (no anchor commit yet). The flag was parsed but
dropped when performFullSync called runImport, which is hardcoded to
collectMarkdownFiles. Result: registering a fresh code source produced
0 code pages, breaking code-def / code-refs / code-callers / code-callees
on freshly-registered code repos.

Changes:

- src/commands/import.ts
  - runImport now parses `--strategy markdown|code|auto` (defaults to
    markdown, preserves backward compat).
  - When strategy != markdown, route file collection through a new
    `collectFilesByStrategy` helper that mirrors collectMarkdownFiles
    safety guards (lstatSync symlink containment, hidden + node_modules
    skip, 5MB size cap) and uses isSyncable as the include filter so
    inclusion logic stays in sync with the incremental walker.
  - Validates --strategy value with a clear error before falling into
    the file walk.

- src/commands/sync.ts
  - performFullSync threads opts.strategy through to runImport's argv.
  - Dry-run branch also honors strategy (was a parallel bug — silent
    misleading dry-run counts when strategy=code).

- test/collect-files-by-strategy.test.ts (new)
  - 10 tests, all pass: strategy=code returns code-only, strategy=auto
    returns code+md, recursion, node_modules + hidden-dir skips, 5MB cap,
    symlink containment (file + dir + dangling), parity with markdown.
  - Mirrors import-walker.test.ts patterns + L002 security invariants.

Verified end-to-end: against a 1.6GB nvr repo (~7000 files, mixed Java
+ Python + C + TypeScript + bash), `gbrain sync --strategy code`
produced 1882 code pages and 31933 tree-sitter chunks (was 0 before
this patch). `gbrain code-def DstreamView` returns count: 1.

Existing test surface (import-walker, import-file, import-resume,
sync-classifier-widening, sources-resync-recovery, skillpack-sync-guard)
runs green: 64 pass, 0 fail.

Closes garrytan#767
Adversarial review (Claude subagent) flagged three HIGH issues in the
initial garrytan#767 patch. This commit addresses each:

1. HIGH — `auto` was not a strict superset of `collectMarkdownFiles`.
   The previous implementation routed `auto` through isSyncable, which
   strips brain-convention files (README.md, index.md, log.md, schema.md,
   ops/**, .raw/**) and multimodal images. A user migrating from
   strategy=markdown to strategy=auto would silently lose pages — the
   exact silent-drop class the patch is meant to fix.

   Fixed by computing `auto` in runImport as a UNION of
   `collectMarkdownFiles(dir) ∪ collectFilesByStrategy(dir, 'code')`. The
   raw `collectFilesByStrategy(dir, 'auto')` helper still uses isSyncable's
   auto rules (consistent with the incremental sync path) and is documented
   accordingly; callers wanting the union compose the two.

2. HIGH — `--strategy=code` (equals form) silently dropped. The original
   `args.indexOf('--strategy')` only matched the space-separated form;
   `--strategy=code` fell through to default 'markdown'. Verified by
   manual smoke test before the fix: `gbrain import <dir> --strategy=code`
   reported "Found N markdown files".

   Fixed with a two-form parser that accepts both `--strategy code` and
   `--strategy=code`. Verified: both forms now print "Found N code files".

3. MEDIUM — `require('../core/sync.ts')` in collectFilesByStrategy was the
   only `require` in the file (vs `import` everywhere else). Stylistic
   inconsistency + brittle if the project moves to ESM-only.

   Fixed by promoting to a top-level `import { isSyncable } from '...'`.

4. MEDIUM — duplicated 5MB constant. Extracted to a documented module-
   level `MAX_IMPORT_FILE_SIZE` to make the sync between this and
   `MAX_FILE_SIZE` in `import-file.ts` discoverable.

5. LOW — test coverage gaps: empty dir, non-existent dir, unreadable
   subdir, brain-convention strip behavior on `auto`. All added.

Test count for collect-files-by-strategy.test.ts: 10 → 13 pass.
Full import+sync suite (7 files): 77 pass, 0 fail (was 64 before, +13
new tests).

Equals-form spot-check confirms parity:
  --strategy code   → Found N code files
  --strategy=code   → Found N code files
  --strategy auto   → 4 files (3 md + 1 ts) on a 4-file fixture
  --strategy markdown → 3 files (md only)
  (no flag)        → Found N markdown files (backward compat)

The single-source `gbrain sync --source <id>` path that doesn't read
cfg.strategy from sources.config (`sync.ts:1071-1101`) is pre-existing
behavior and out of scope for garrytan#767. Documented in the related issue
discussion; potential follow-up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sync --strategy code dropped on first sync via performFullSync

1 participant