feat: add import-markdown CLI command#426
feat: add import-markdown CLI command#426jlin53882 wants to merge 9 commits intoCortexReach:masterfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ab501f5c18
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
cli.ts
Outdated
| } | ||
|
|
||
| try { | ||
| const vector = await context.embedder!.embedQuery(text); |
There was a problem hiding this comment.
Store passage vectors for imported markdown entries
The importer persists markdown bullets using embedQuery, but retrieval also embeds incoming searches with embedQuery (src/retriever.ts), so migrated rows are stored in the wrong embedding role. For task-aware models (for example providers that distinguish query vs document embeddings), this causes substantial recall degradation after migration because comparisons become query-query instead of query-document. Use embedPassage when writing memory content.
Useful? React with 👍 / 👎.
cli.ts
Outdated
| importance: 0.7, | ||
| category: "other", | ||
| scope: targetScope, | ||
| metadata: { importedFrom: filePath, sourceScope: scope }, |
There was a problem hiding this comment.
Serialize metadata before storing imported entries
MemoryStore.store expects metadata to be a JSON string (MemoryEntry.metadata in src/store.ts), but this command passes a plain object. With a typed LanceDB schema this can cause table.add to fail for each imported line (and the command will silently count them as skipped), and even if coerced, downstream metadata parsing assumes string JSON and will drop these fields. Serialize this value before calling store.
Useful? React with 👍 / 👎.
PR UpdateThis PR was split from #367 — What this PR doesAdds Review checklistThe following items were flagged during the original PR #367 review and should be verified here:
Related |
…fig options + tests ## 實作改善(相對於原本的 PR CortexReach#426) ### 新增 CLI 選項 - --dedup:啟用 scope-aware exact match 去重(避免重複匯入) - --min-text-length <n>:設定最短文字長度門檻(預設 5) - --importance <n>:設定匯入記憶的 importance 值(預設 0.7) ### Bug 修復 - UTF-8 BOM 處理:讀檔後主動移除 \ufeFF prefix - CRLF 正規化:改用 split(/\r?\n/) 同時支援 CRLF 和 LF - Bullet 格式擴展:從只支援 '- ' 擴展到支援 '- '、'* '、'+ ' 三種 ### 新增測試 - test/import-markdown/import-markdown.test.mjs:完整單元測試 - BOM handling - CRLF normalization - Extended bullet formats (dash/star/plus) - minTextLength 參數 - importance 參數 - Dedup logic(scope-aware exact match) - Dry-run mode - Continue on error ### 分析文件 - test/import-markdown/ANALYSIS.md:完整分析報告 - 效益分析(真實檔案 655 筆記錄實測) - 3 個程式碼缺口分析 - 建議的 5 個新 config 欄位 - 功能條列式說明 - test/import-markdown/recall-benchmark.py:實際 LanceDB 查詢對比腳本 - 實測結果:7/8 個關鍵字在 Markdown 有但 LanceDB 找不到 - 證明 import-markdown 的實際價值 ## 實測效果(真實記憶檔案) - James 的 workspace:MEMORY.md(20 筆)+ 30 個 daily notes(633 筆)= 653 筆記錄 - 無 dedup:每次執行浪費 50%(重複匯入) - 有 dedup:第二次執行 100% skip,節省 644 次 embedder API 呼叫 - 關鍵字對比:7/8 個測試關鍵字在 Markdown 有、LanceDB 無 ## 建議新增的 Config(共 5 項,預設值 = 現在行為,向下相容) - importMarkdown.dedup: boolean = false - importMarkdown.defaultScope: string = global - importMarkdown.minTextLength: number = 5 - importMarkdown.importanceDefault: number = 0.7 - importMarkdown.workspaceFilter: string[] = [] Closes: PR CortexReach#426 (CortexReach/memory-lancedb-pro)
|
Hey @jlin53882, thanks for the thorough write-up and the review checklist — really helpful context, especially the split from #367 and the real-world data showing the dual-memory gap. The file path resolution, error handling with continue-on-error, scope handling, dry-run mode, and dedup logic all look good. Nice work on the BOM/CRLF/multi-bullet fixes too. Two things need fixing before this can merge:
A couple of smaller things while you're in there:
Happy to re-review once those are addressed. The feature itself is valuable and the test coverage is solid! 🙏 |
…own reference, restore removed README sections - Restore cli.ts (was accidentally deleted, all CLI commands preserved) - Remove import-markdown command reference from dual-memory section (lives in PR CortexReach#426) - Restore beta.10 version banner and OpenClaw 2026.3+ badge - Restore Auto-recall timeout tuning FAQ section Ref: CortexReach#367
…own reference, restore removed README sections - Restore cli.ts (was accidentally deleted, all CLI commands preserved) - Remove import-markdown command reference from dual-memory section (lives in PR CortexReach#426) - Restore beta.10 version banner and OpenClaw 2026.3+ badge - Restore Auto-recall timeout tuning FAQ section Ref: CortexReach#367
P1 fixes:
- embedQuery -> embedPassage (lines 1001, 1171): imported memory content
is passage/document, not a query. Using embedQuery with asymmetric
providers (e.g. Jina) causes query-query comparison at recall time,
degrading retrieval quality.
- metadata: JSON.stringify the importedFrom object (line 1178):
MemoryEntry.metadata is typed as string in store.ts; passing a plain
object silently fails or produces unparseable data.
Minor fixes:
- workspaceEntries type: string[] -> Dirent[] (matches readdir withFileTypes)
- Hoist await import('node:fs/promises') out of loops: single import at
handler level replaces repeated per-iteration dynamic imports
Ref: CortexReach/pull/426
|
Hi @AliceLJY — all review items addressed in the latest push: P1 fixes:
Minor fixes:
Ready for re-review 🙏 |
|
Additionally, during local testing I found and fixed two extra issues beyond your original review: Extra fix 1 — fsPromises scope bug Extra fix 2 — workspace scope inference for flat memory/ Both fixes are included in the latest push. Please re-review 🙏 |
Review: REQUEST-CHANGESThe feature addresses a real gap — Markdown memories aren't in the LanceDB store so they're invisible to semantic recall. A few issues need fixing before this is mergeable. Must fix:
Worth considering (not blocking):
|
|
Hi @AliceLJY — addressed all must-fix items and worth-considering items in the latest push: Must fix:
Worth considering (all addressed): Please re-review 🙏 |
Add `memory-pro import-markdown` command to migrate existing Markdown memories (MEMORY.md, memory/YYYY-MM-DD.md) into the plugin LanceDB store for semantic recall. This addresses Issue CortexReach#344 by providing a migration path from the Markdown layer to the plugin memory layer.
…fig options + tests ## 實作改善(相對於原本的 PR CortexReach#426) ### 新增 CLI 選項 - --dedup:啟用 scope-aware exact match 去重(避免重複匯入) - --min-text-length <n>:設定最短文字長度門檻(預設 5) - --importance <n>:設定匯入記憶的 importance 值(預設 0.7) ### Bug 修復 - UTF-8 BOM 處理:讀檔後主動移除 \ufeFF prefix - CRLF 正規化:改用 split(/\r?\n/) 同時支援 CRLF 和 LF - Bullet 格式擴展:從只支援 '- ' 擴展到支援 '- '、'* '、'+ ' 三種 ### 新增測試 - test/import-markdown/import-markdown.test.mjs:完整單元測試 - BOM handling - CRLF normalization - Extended bullet formats (dash/star/plus) - minTextLength 參數 - importance 參數 - Dedup logic(scope-aware exact match) - Dry-run mode - Continue on error ### 分析文件 - test/import-markdown/ANALYSIS.md:完整分析報告 - 效益分析(真實檔案 655 筆記錄實測) - 3 個程式碼缺口分析 - 建議的 5 個新 config 欄位 - 功能條列式說明 - test/import-markdown/recall-benchmark.py:實際 LanceDB 查詢對比腳本 - 實測結果:7/8 個關鍵字在 Markdown 有但 LanceDB 找不到 - 證明 import-markdown 的實際價值 ## 實測效果(真實記憶檔案) - James 的 workspace:MEMORY.md(20 筆)+ 30 個 daily notes(633 筆)= 653 筆記錄 - 無 dedup:每次執行浪費 50%(重複匯入) - 有 dedup:第二次執行 100% skip,節省 644 次 embedder API 呼叫 - 關鍵字對比:7/8 個測試關鍵字在 Markdown 有、LanceDB 無 ## 建議新增的 Config(共 5 項,預設值 = 現在行為,向下相容) - importMarkdown.dedup: boolean = false - importMarkdown.defaultScope: string = global - importMarkdown.minTextLength: number = 5 - importMarkdown.importanceDefault: number = 0.7 - importMarkdown.workspaceFilter: string[] = [] Closes: PR CortexReach#426 (CortexReach/memory-lancedb-pro)
P1 fixes:
- embedQuery -> embedPassage (lines 1001, 1171): imported memory content
is passage/document, not a query. Using embedQuery with asymmetric
providers (e.g. Jina) causes query-query comparison at recall time,
degrading retrieval quality.
- metadata: JSON.stringify the importedFrom object (line 1178):
MemoryEntry.metadata is typed as string in store.ts; passing a plain
object silently fails or produces unparseable data.
Minor fixes:
- workspaceEntries type: string[] -> Dirent[] (matches readdir withFileTypes)
- Hoist await import('node:fs/promises') out of loops: single import at
handler level replaces repeated per-iteration dynamic imports
Ref: CortexReach/pull/426
The const fsPromises declaration was inside the try block, making it scoped to that block only. Subsequent fsPromises.stat() calls in MEMORY.md and memory/ processing code were failing with 'fsPromises is not defined'. Move declaration to handler scope.
Scans the flat \workspace/memory/\ directory (directly under workspace root, not inside any workspace subdirectory) and imports entries with scope='memory'. This supports the actual OpenClaw structure where memory files live directly in workspace/memory/.
Before scanning, read openclaw.json agents list to find the agent whose workspace path matches the current workspaceDir. Use that agent's id as workspaceScope for flat memory/ entries instead of defaulting to 'memory'. Falls back to 'shared' when no matching agent is found (e.g. shared workspace with no dedicated agent).
Must fix: - Flat memory scan: move before the mdFiles.length===0 early return so it is always reachable (not just when nested workspaces are empty) - Tests: runImportMarkdown now uses embedPassage (not embedQuery) and JSON.stringify(metadata) to match production. Added embedPassage mock. - Tests: setupWorkspace now creates files at workspace/<name>/ to match the actual path structure runImportMarkdown expects Worth considering: - Flat memory scan now skips when workspaceGlob is set, avoiding accidental root flat memory import when user specifies --workspace - Removed dev artifacts: ANALYSIS.md and recall-benchmark.py contained personal absolute paths and are not suitable for repo commit
Before: --dry-run skipped dedup check entirely, so --dry-run --dedup would overcount imports (items counted as imported even if dedup would skip them). After: dedup check runs regardless of dry-run mode. In dry-run, items that would be skipped by dedup are counted as skipped, not imported. Restores the dry-run console log message.
054ae85 to
9c39480
Compare
|
Branch rebased onto latest upstream/master (8 commits replayed cleanly, no conflicts). Ready for re-review 🙏 |
CI Failure AnalysisThe What failedRoot causeIn if (config.selfImprovement?.beforeResetNote !== false) {
api.registerHook("command:new", appendSelfImprovementNote, {...});
}When
This causes PR #426 is not responsiblePR #426 only modifies FixAuthor jlin53882's branch See also: #405 |
AliceLJY
left a comment
There was a problem hiding this comment.
Re-reviewed after 8-commit update. All feedback addressed:
- ✅ embedQuery → embedPassage (production code + test mock split correctly)
- ✅ metadata serialized via JSON.stringify
- ✅ Flat workspace/memory/ scan moved before early return
- ✅ fsPromises hoisted out of loop
- ✅ --dry-run now runs dedup check first
- ✅ Rebased onto latest main
@rwmjhb your REQUEST_CHANGES items also look addressed — please verify and merge when ready.
|
Thanks for the quick turnaround — flat scan, embedPassage, metadata stringify, dry-run dedup, filter, and dev artifact cleanup all confirmed fixed. A few new issues surfaced in the updated code: Must fix:
Minor:
|
Must fix: - Source scopes discovered but discarded: scanner now falls back to per-file discovered scope instead of collapsing all workspaces into "global". Prevents cross-workspace leakage and incorrect dedup across workspaces. - Scanner only descended one level: now also scans workspace/agents/<id>/ for nested agent workspaces (e.g. workspace/agents/theia/MEMORY.md). Minor fixes: - NaN guardrails: --min-text-length and --importance now use clampInt and Number.isFinite to prevent invalid values from silently passing. - Tests reimplement import logic: runImportMarkdown is now exported from cli.ts and tests call the production handler directly instead of a standalone copy. Prevents logic drift between tests and production. Refs: PR CortexReach#426 review feedback
|
All 4 issues from your review have been addressed in commit 9658f53: Must fix:
Minor fixes:
Please take another look when you have time. |
* docs: clarify dual-memory architecture (fixes #344) Add two improvements addressing Issue #344: 1. README: add Dual-Memory Architecture section explaining Plugin Memory (LanceDB) vs Markdown Memory distinction 2. index.ts: log dual-memory warning on plugin startup Refs: #344 * Address AliceLJY review comments: restore cli.ts, remove import-markdown reference, restore removed README sections - Restore cli.ts (was accidentally deleted, all CLI commands preserved) - Remove import-markdown command reference from dual-memory section (lives in PR #426) - Restore beta.10 version banner and OpenClaw 2026.3+ badge - Restore Auto-recall timeout tuning FAQ section Ref: #367
PR #426 分析與改善:feat: add import-markdown CLI command
📌 摘要
本 PR 為
import-markdownCLI 子命令的完整分析與改善,包含單元測試、實際效益驗證、以及 3 個程式碼缺口的修復。✅ 實作改善(相對於原本的 PR #426)
新增 CLI 選項
--dedupfalse--min-text-length <n>5--importance <n>0.7Bug 修復
\uFEFFprefix(Windows 記事本產生的檔案)split(/\r?\n/)同時支援 CRLF 和 LF-擴展到支援-、*、+三種標準 Markdown bullet🧪 測試項目(共 12 項,全部通過)
*、+)📊 實際效益驗證(真實資料)
測試資料:
~/.openclaw/workspace-dc-channel--1476866394556465252/Scenario A:無 dedup(現在的行為)
Scenario B:有 dedup(加功能後的行為)
關鍵字對比(LanceDB vs Markdown)
🔧 程式碼缺口修復(3 個)
缺口 1:其他 Markdown bullet 格式不支援
根因: 只檢查
line.startsWith("- ")修法:
/^[-*+]\s/.test(line)缺口 2:UTF-8 BOM 破壞第一行解析
根因: Windows 編輯器產生的檔案帶 BOM (
\uFEFF)修法:
content.replace(/^\uFEFF/, "")缺口 3:CRLF 行結尾
\r殘留根因: Windows 行結尾是
\r\n修法:
content.split(/\r?\n/)📋 建議新增的 Config 欄位(共 5 項)
importMarkdown.dedupfalseimportMarkdown.defaultScope"global"importMarkdown.minTextLength5importMarkdown.importanceDefault0.7importMarkdown.workspaceFilter[](全部掃)📁 新增檔案
test/import-markdown/import-markdown.test.mjs— 完整單元測試test/import-markdown/ANALYSIS.md— 完整分析報告test/import-markdown/recall-benchmark.py— 實際 LanceDB 查詢對比腳本🔗 相關連結