Skip to content

fix(import-markdown): P0 missing return + register test in CI#482

Open
jlin53882 wants to merge 16 commits intoCortexReach:masterfrom
jlin53882:fix/import-markdown-rebase
Open

fix(import-markdown): P0 missing return + register test in CI#482
jlin53882 wants to merge 16 commits intoCortexReach:masterfrom
jlin53882:fix/import-markdown-rebase

Conversation

@jlin53882
Copy link
Copy Markdown
Contributor

Summary

Fixes two critical issues identified in PR #426 review:

  1. P0: Missing return in runImportMarkdowncli.ts non-empty path was missing return { imported, skipped, foundFiles }, causing caller to receive undefined
  2. P1: Test not registered in CItest/import-markdown/import-markdown.test.mjs was not included in scripts.test

Additional fixes discovered during implementation

  • Jest → Node built-in test runner: Test file used @jest/globals but project uses node --test; all jest.fn() replaced with node:test's mock.fn()
  • Jiti TypeScript compilation: Direct import("../../cli.ts") fails with ERR_MODULE_NOT_FOUND; replaced with jiti factory pattern consistent with other tests in the repo

Changes

File Change
cli.ts Add return { imported, skipped, foundFiles } in non-empty import path
package.json Register test in scripts.test
test/import-markdown/import-markdown.test.mjs Jest→Node test runner + jiti import fix
package-lock.json Updated dependencies

Testing

All 9 tests pass:

  • node --test test/import-markdown/import-markdown.test.mjspass 9 / fail 0

jlin53882 and others added 13 commits April 1, 2026 14:22
Add `memory-pro import-markdown` command to migrate existing Markdown memories
(MEMORY.md, memory/YYYY-MM-DD.md) into the plugin LanceDB store for semantic recall.

This addresses Issue CortexReach#344 by providing a migration path from the Markdown layer
to the plugin memory layer.
…fig options + tests

## 實作改善(相對於原本的 PR CortexReach#426)

### 新增 CLI 選項
- --dedup:啟用 scope-aware exact match 去重(避免重複匯入)
- --min-text-length <n>:設定最短文字長度門檻(預設 5)
- --importance <n>:設定匯入記憶的 importance 值(預設 0.7)

### Bug 修復
- UTF-8 BOM 處理:讀檔後主動移除 \ufeFF prefix
- CRLF 正規化:改用 split(/\r?\n/) 同時支援 CRLF 和 LF
- Bullet 格式擴展:從只支援 '- ' 擴展到支援 '- '、'* '、'+ ' 三種

### 新增測試
- test/import-markdown/import-markdown.test.mjs:完整單元測試
  - BOM handling
  - CRLF normalization
  - Extended bullet formats (dash/star/plus)
  - minTextLength 參數
  - importance 參數
  - Dedup logic(scope-aware exact match)
  - Dry-run mode
  - Continue on error

### 分析文件
- test/import-markdown/ANALYSIS.md:完整分析報告
  - 效益分析(真實檔案 655 筆記錄實測)
  - 3 個程式碼缺口分析
  - 建議的 5 個新 config 欄位
  - 功能條列式說明
- test/import-markdown/recall-benchmark.py:實際 LanceDB 查詢對比腳本
  - 實測結果:7/8 個關鍵字在 Markdown 有但 LanceDB 找不到
  - 證明 import-markdown 的實際價值

## 實測效果(真實記憶檔案)
- James 的 workspace:MEMORY.md(20 筆)+ 30 個 daily notes(633 筆)= 653 筆記錄
- 無 dedup:每次執行浪費 50%(重複匯入)
- 有 dedup:第二次執行 100% skip,節省 644 次 embedder API 呼叫
- 關鍵字對比:7/8 個測試關鍵字在 Markdown 有、LanceDB 無

## 建議新增的 Config(共 5 項,預設值 = 現在行為,向下相容)
- importMarkdown.dedup: boolean = false
- importMarkdown.defaultScope: string = global
- importMarkdown.minTextLength: number = 5
- importMarkdown.importanceDefault: number = 0.7
- importMarkdown.workspaceFilter: string[] = []

Closes: PR CortexReach#426 (CortexReach/memory-lancedb-pro)
P1 fixes:
- embedQuery -> embedPassage (lines 1001, 1171): imported memory content
  is passage/document, not a query. Using embedQuery with asymmetric
  providers (e.g. Jina) causes query-query comparison at recall time,
  degrading retrieval quality.

- metadata: JSON.stringify the importedFrom object (line 1178):
  MemoryEntry.metadata is typed as string in store.ts; passing a plain
  object silently fails or produces unparseable data.

Minor fixes:
- workspaceEntries type: string[] -> Dirent[] (matches readdir withFileTypes)
- Hoist await import('node:fs/promises') out of loops: single import at
  handler level replaces repeated per-iteration dynamic imports

Ref: CortexReach/pull/426
The const fsPromises declaration was inside the try block, making it
scoped to that block only. Subsequent fsPromises.stat() calls in
MEMORY.md and memory/ processing code were failing with
'fsPromises is not defined'. Move declaration to handler scope.
Scans the flat \workspace/memory/\ directory (directly under
workspace root, not inside any workspace subdirectory) and imports
entries with scope='memory'. This supports the actual OpenClaw
structure where memory files live directly in workspace/memory/.
Before scanning, read openclaw.json agents list to find the agent
whose workspace path matches the current workspaceDir. Use that agent's
id as workspaceScope for flat memory/ entries instead of defaulting to
'memory'. Falls back to 'shared' when no matching agent is found
(e.g. shared workspace with no dedicated agent).
Must fix:
- Flat memory scan: move before the mdFiles.length===0 early return so it
  is always reachable (not just when nested workspaces are empty)
- Tests: runImportMarkdown now uses embedPassage (not embedQuery) and
  JSON.stringify(metadata) to match production. Added embedPassage mock.
- Tests: setupWorkspace now creates files at workspace/<name>/ to match
  the actual path structure runImportMarkdown expects

Worth considering:
- Flat memory scan now skips when workspaceGlob is set, avoiding accidental
  root flat memory import when user specifies --workspace
- Removed dev artifacts: ANALYSIS.md and recall-benchmark.py contained
  personal absolute paths and are not suitable for repo commit
Before: --dry-run skipped dedup check entirely, so --dry-run --dedup
would overcount imports (items counted as imported even if dedup
would skip them).

After: dedup check runs regardless of dry-run mode. In dry-run,
items that would be skipped by dedup are counted as skipped,
not imported. Restores the dry-run console log message.
Must fix:
- Source scopes discovered but discarded: scanner now falls back to per-file
  discovered scope instead of collapsing all workspaces into "global".
  Prevents cross-workspace leakage and incorrect dedup across workspaces.
- Scanner only descended one level: now also scans workspace/agents/<id>/
  for nested agent workspaces (e.g. workspace/agents/theia/MEMORY.md).

Minor fixes:
- NaN guardrails: --min-text-length and --importance now use clampInt
  and Number.isFinite to prevent invalid values from silently passing.
- Tests reimplement import logic: runImportMarkdown is now exported from
  cli.ts and tests call the production handler directly instead of a
  standalone copy. Prevents logic drift between tests and production.

Refs: PR CortexReach#426 review feedback
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c1bc08c3c0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

cli.ts Outdated
// This handles the structure used by session-recovery and other OpenClaw
// components: workspace/agents/<id>/MEMORY.md and workspace/agents/<id>/memory/.
// We scan one additional level deeper than the top-level workspace scan.
if (!workspaceGlob) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Apply workspace-glob when scanning nested agent workspaces

The nested workspace/agents/<id> scan is gated behind if (!workspaceGlob), so passing a workspace-glob skips agent workspaces entirely. Because the earlier top-level loop only filters direct children of workspace/ (e.g., agents), a command like import-markdown <agent-id> cannot find workspace/agents/<agent-id>/MEMORY.md or memory/*.md, causing targeted imports to silently return no files in common agent-based layouts.

Useful? React with 👍 / 👎.

@jlin53882
Copy link
Copy Markdown
Contributor Author

已開 Issue 追蹤:#484

@jlin53882
Copy link
Copy Markdown
Contributor Author

✅ 已修復。問題出在 if (!workspaceGlob) 把整個 nested agent scan gate 住了,導致 import-markdown <agent-id> 時 workspaceGlob 有值就直接略過 agents 目錄。

修改:將 nested scan 邏輯重構,現在當有明確 workspaceGlob 時只掃描 workspace/agents/<workspaceGlob>/;無指定時掃描全部 agents。

Commit: 212e96a

@rwmjhb
Copy link
Copy Markdown
Collaborator

rwmjhb commented Apr 3, 2026

Review: fix(import-markdown): P0 missing return + register test in CI

Verdict: request-changes | Confidence: 0.95 | Value: 45%

The missing return fix and test registration are both needed. A few issues to address:

Must Fix

1. Build failure — rebase needed

Build is red (stale_base=true). Rebase onto current main to confirm the PR's changes don't interact with the failure.

2. Import-markdown test not confirmed running

Verification only ran cli-smoke.mjs. The new test/import-markdown/import-markdown.test.mjs was not executed — can't confirm the 9 test cases pass in the full pipeline.

Should Fix

3. dedupEnabled reads wrong option in action handler

// current (action handler summary)
const dedupEnabled = !!options.dryRun;  // ← should be options.dedup

The actual dedup logic inside runImportMarkdown correctly uses options.dedup, so stored data is fine — but the CLI summary line lies to the user about whether dedup was active.

4. Dedup only checks top-1 BM25 hit — can miss exact duplicates

--dedup does bm25Search(text, 1, [scope]) and checks if the top result matches exactly. BM25 is a ranked lexical search, not exact-text lookup. If a similar-but-different entry ranks above the actual duplicate, the duplicate is missed and a second copy gets imported.

Suggestion: either increase the limit (e.g. bm25Search(text, 5, ...) and check all results), or use a hash-based dedup.

5. Scope inference misses flat root-memory files

agentsList.find(a => path.normalize(a.workspace) === workspaceDir) compares the root workspace dir against per-agent paths like .openclaw/workspace/agents/theia. This never matches, so flat workspace/memory/*.md files fall back to scope: "shared" instead of the intended workspace scope.

6. CLI help text says --scope defaults to global, but code uses discoveredScope

--scope <scope> description says "default: global", but the implementation falls through to the discovered agent/workspace scope. Users omitting --scope get unexpected scope assignment.

Nit

  • Duplicate summary logging (both runImportMarkdown and the action handler print a summary)
  • tsx added as devDependency but test uses jiti — unused dep?
  • Inconsistent indentation in runImportMarkdown body

Auto-reviewed by auto-pr-review-orchestrator | 7 rounds | Claude + Codex adversarial

Should Fix from PR CortexReach#482 review:

3. Fix dedupEnabled option read in CLI summary line
   - cli.ts:655: !!options.dryRun -> !!options.dedup

4. Expand dedup BM25 search from top-1 to top-5
   - cli.ts:682: bm25Search(text, 1, ...) -> bm25Search(text, 5, ...)
   - BM25 is lexical ranking, not exact match; top-1 can miss duplicates

5. Fallback scope for flat root-memory files
   - cli.ts:637: "shared" -> "global"
   - "shared" is not a valid scope (not in scopes.ts definitions)
   - flat workspace/memory/*.md are at workspace root level,
     semantics imply "global" (all agents accessible)

6. Fix CLI --scope help text to match actual behavior
   - cli.ts:1415: "default: global" -> "default: auto-discovered from workspace"
   - actual behavior: options.scope || discoveredScope (not global)
@jlin53882
Copy link
Copy Markdown
Contributor Author

Review 回覆:Fix 3/4/5/6 全部已處理

Fix 3 ✅ — dedupEnabled 讀錯選項

!!options.dryRun!!options.dedup(cli.ts:655)

Fix 4 ✅ — Dedup top-1 → top-5

bm25Search(text, 1, ...)bm25Search(text, 5, ...)(cli.ts:682)

Fix 5 ✅ — Scope fallback 從 "shared" → "global"

方向確認

查看 src/scopes.ts 的正式定義:

  • "global""Shared knowledge across all agents",系統預設 scope
  • "shared"不存在,不是正式 scope

flat root-memory (workspace/memory/*.md) 在 workspace root 層級,不在任何特定 agent 目錄下。語意上應該是「所有 agent 都能用」,對應到 "global"

原本用 "shared" 應該是開發者直覺命名,但 "shared" 在 scope 系統沒有實際意義(也沒有任何 agent 可以讀取 scope: "shared" 的資料)。

如果你認為應該 fallback 到某個特定 agent scope(例如只有一個 agent 的情況),請讓我知道,我可以調整方向。

Fix 6 ✅ — --scope help text 修正

"default: global""default: auto-discovered from workspace"

@jlin53882
Copy link
Copy Markdown
Contributor Author

Must Fix 回覆

1. Build failure — rebase needed ✅ 已處理

Branch fix/import-markdown-rebase 已 rebase 到 current origin/master,conflict(package.json)已解決,master 和 PR 的測試檔都保留。

2. Import-markdown test 未確認在 CI 執行 ✅ 已處理

測試結果:

  • node --test test/import-markdown/import-markdown.test.mjs9/9 PASS
  • 完整 npm run test 有 2 個預先存在的失敗(strip-envelope-metadata.test.mjs),與 import-markdown 無關

Should Fix 回覆(3/4/5/6)也已在另一個 comment 說明

以上,請確認。

Copy link
Copy Markdown
Collaborator

@AliceLJY AliceLJY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: PR #482 — fix(import-markdown): P0 missing return + register test in CI

Fixes two real issues from PR #426 review feedback. The extracted runImportMarkdown function is testable and the test suite covers the important edge cases.

Codex P2 assessment — "Apply workspace-glob when scanning nested agent workspaces"

Valid concern, and already addressed in the current code. The Codex comment noted that workspace-glob would skip agent workspaces because the nested scan was gated behind if (!workspaceGlob). The fix (lines ~137-153 in the diff) correctly handles both cases:

  • When workspaceGlob is set: scans only the matching agent via agentEntries.find(e => e.isDirectory() && e.name === workspaceGlob)
  • When not set: scans all agent workspaces

This was also filed as Issue #484, which this PR resolves.

Code review findings

  1. P0 fix verifiedreturn { imported, skipped, foundFiles } is now present at the end of the non-empty path (line 257 in diff). Without this, callers received undefined.

  2. Test registration — The import-markdown.test.mjs is appended to the scripts.test chain in package.json. Correct.

  3. Jest to Node test runner migration — Clean conversion from @jest/globals to node:test built-ins (describe, it, mock.fn()). Consistent with the rest of the repo.

  4. Indentation inconsistency — The function body of runImportMarkdown starts with extra indentation (the first const openclawHome line at line 30 has ~8 spaces of leading whitespace, while the function signature is at the normal level). This looks like the function body was copy-pasted from a nested context (the original inline action handler). Not a blocker but worth a cosmetic cleanup.

  5. Test calls runImportMarkdown(ctx, options) with 2 args but function signature takes 3 — The function signature is runImportMarkdown(ctx, workspaceGlob, options), but the tests call it as runImportMarkdown(ctx, { openclawHome, workspaceGlob, ... }), passing options as the second argument. This means workspaceGlob receives the entire options object (which is truthy), and options is undefined. The tests pass because the function internally reads workspaceGlob from the options object too (the .action handler passes it separately). Verify that the tests are actually exercising the exported function correctly — the 2-arg call signature suggests the tests may be testing a different code path than the CLI invocation.

  6. process.exit(1) in a library functionrunImportMarkdown calls process.exit(1) when the embedder is missing or the workspace directory is unreadable. For a function that's now exported and testable, throwing an error would be more appropriate than killing the process. The CLI action handler can catch and exit. Minor, follow-up material.

  7. Missing trailing newlinepackage.json lost its trailing newline (\ No newline at end of file). Trivial but worth fixing.

Overall: the P0 fix and test registration are correct. Item 5 (call signature mismatch) is the main concern — please verify the tests call the function with the right arity.

LGTM with the above note.

…view

## Fixes applied

### P1 — process.exit(1) in library function (critical)
- `runImportMarkdown` now throws `Error` instead of calling `process.exit(1)`
  when embedder is missing or workspace directory is unreadable.
- CLI handler (`registerMemoryCLI`) wraps the call with try/catch and
  handles the error gracefully (still exits with code 1, but intentionally).

### P1 — CLI handler missing error boundary
- Added try/catch wrapper to the `import-markdown` action handler so that
  errors thrown by `runImportMarkdown` are caught and reported cleanly
  instead of bubbling up as unhandled promise rejections.

### P2 — Dedup check silent failure
- The `bm25Search` catch block now logs a `console.warn` instead of
  silently continuing, making dedup failures visible in output.

### P3 — Function body indentation
- Normalized indentation: body lines were at column 0 / inconsistent 6-space
  indent; now consistently 2-space indented (matching the function signature).

### P3 — Duplicate CLI output removed
- Removed redundant summary console.log from CLI handler; the summary is
  printed only once inside `runImportMarkdown`.

### P3 — package.json trailing newline
- Added missing trailing newline to package.json (was ending with `}`).

## Regarding Codex Review P5 (test arity mismatch)
Codex noted: "Test calls `runImportMarkdown(ctx, {options})` with 2 args
but function signature takes 3 args."

Clarification: the test file has a **module-level adapter** at the bottom
(`async function runImportMarkdown(context, options = {})`) that correctly
reconstitutes the 3-argument call:

  return importMarkdown(context, options.workspaceGlob ?? null, {...options});

This adapter is the intentional API design — the outer test helper accepts
an options-object style call and normalizes it to the production 3-arg
signature. The tests ARE exercising the production code path correctly.
No functional issue; the Codex concern was a false positive from reading
only the top-level call signature without following the adapter.
@jlin53882
Copy link
Copy Markdown
Contributor Author

PR #482 — P5 Codex Review Clarification + P1/P2/P3 Fixes Committed

P5 — Test arity mismatch (Codex concern): NOT a real bug

Codex noted that tests call runImportMarkdown(ctx, {options}) with 2 args but the production function takes 3 args. This is a false positive.

The test file has an intentional adapter layer at module level:

// Bottom of test file — adapter that normalizes call signature
async function runImportMarkdown(context, options = {}) {
  return importMarkdown(
    context,
    options.workspaceGlob ?? null,  // ← correctly passes as 2nd argument
    { dryRun: !!options.dryRun, scope: options.scope, ... }
  );
}

The outer test helper accepts an options-object style call and internally converts it to the correct 3-argument production signature. The adapter is specifically designed to bridge the simpler test API and the full production API. Tests are exercising the real production code path correctly. No change needed for this concern.


P1/P2/P3 fixes — now committed in c1127ed

The following code quality fixes have been pushed to the PR branch:

Severity Issue Fix
P1 process.exit(1) in runImportMarkdown → throw Error instead; CLI handler catches
P1 No error boundary in CLI handler → added try/catch wrapper
P2 bm25Search catch was silently swallowed → added console.warn logging
P3 Function body indentation inconsistent → normalized to 2-space
P3 CLI handler duplicated summary output → removed duplicate
P3 package.json missing trailing newline → added \n

Commit: c1127ed (fix/import-markdown-rebase branch)

jlin53882 added a commit to jlin53882/memory-lancedb-pro that referenced this pull request Apr 4, 2026
Revert all changes except the isOwnedByAgent fix (src/reflection-store.ts):
- Remove import-markdown CLI (cli.ts) — tracked separately in PR CortexReach#426/CortexReach#482
- Remove autoRecallExcludeAgents config — tracked separately in PR CortexReach#516/CortexReach#521
- Remove idempotent register guard — separate feature request needed
- Remove recallMode parsing — unrelated to CortexReach#448
- Remove dual-memory docs (README.md) — already merged in PR CortexReach#367
- Remove script mode changes — unrelated
- Remove embedder/llm-client changes — unrelated
- Restore deleted nvidia test file — unrelated to CortexReach#448

Only src/reflection-store.ts isOwnedByAgent fix remains.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants