Skip to content

feat: add import-memories command for markdown chat exports#149

Open
danielaustralia1 wants to merge 1 commit intoMartian-Engineering:mainfrom
danielaustralia1:feat/import-memories
Open

feat: add import-memories command for markdown chat exports#149
danielaustralia1 wants to merge 1 commit intoMartian-Engineering:mainfrom
danielaustralia1:feat/import-memories

Conversation

@danielaustralia1
Copy link

Summary

Adds a new lcm-tui import-memories <path> CLI command that bulk-imports markdown conversation files from external AI chat platforms (ChatGPT, Claude, OpenClaw, etc.) into the LCM database. This enables users to bring their historical AI conversations into LCM's memory system, making them searchable via lcm_grep, inspectable via lcm_describe, and expandable via lcm_expand_query.

Motivation

Users accumulate valuable conversation history across multiple AI platforms. Currently, LCM only ingests conversations that happen within OpenClaw sessions. This feature bridges that gap by allowing users to import their existing chat exports — organized as markdown files in folders — directly into LCM's database where they become part of the agent's long-term memory.

What it does

New CLI command

# Dry-run: see what would be imported (no database needed)
lcm-tui import-memories ~/memories

# Import for real
lcm-tui import-memories ~/memories --apply

# Force re-import + optional compaction
lcm-tui import-memories ~/memories --apply --force --compact --model claude-opus-4

Directory structure mapping

memories/
├── chatgpt/           → agent="chatgpt"
│   ├── debugging.md   → title="debugging", 1 conversation
│   └── refactoring.md → title="refactoring", 1 conversation
├── claude/            → agent="claude"
│   ├── 2026-02-18 - API access.md → title="2026-02-18 - API access"
│   └── 2026-02-17 - Setup.md      → title="2026-02-17 - Setup"
└── openclaw/          → agent="openclaw"
    └── ...
  • Each .md file → one conversation in the database
  • Parent folder name → agent identifier (overridable with --agent)
  • Filename (without extension) → conversation title

Markdown turn parser

Supports 5 marker styles detected automatically per file:

Style Example
h4 heading #### You: / #### ChatGPT
h2 heading ## User / ## Assistant
h3 heading ### Human / ### Assistant
Bold **User:** / **Assistant:**
Inline Human: / Assistant:

Role mapping: User/Human/You → user, Assistant/AI/ChatGPT/Claude → assistant, System → system

Fallback: Files without recognized markers are imported as a single assistant message (useful for knowledge documents).

Timestamp extraction (3-tier fallback)

  1. <time> HTML tags — ChatGPT exports include <time datetime="2025-11-19T05:05:22.247Z"> per message. These are parsed for per-message precision and stripped from content.
  2. Filename date — Claude exports use 2026-02-18 - Title.md naming. The YYYY-MM-DD prefix is extracted as the conversation date, with messages spaced 1 minute apart.
  3. Import time — Falls back to current UTC time if no date source is available.

Additional features

  • YAML frontmatter stripping — ChatGPT exports include --- delimited frontmatter (title, source URL). This is stripped before parsing so it doesn't pollute message content.
  • UTF-8 BOM handling — Windows-exported files with BOM prefix are handled correctly.
  • Code block awareness — Turn markers inside fenced code blocks (```) are ignored, preventing false splits.
  • Deduplication — Deterministic session IDs from file paths. Re-running the command skips already-imported files.
  • Force re-import--force flag deletes existing conversation data and re-imports from the source file.
  • Optional compaction--compact flag runs the existing backfill compaction engine on each imported conversation to build the summary DAG immediately.
  • Dry-run by default — Shows what would be imported without touching the database. Works even without a database file.

CLI flags

Flags:
  --apply              actually import (default: dry-run)
  --force              re-import files even if already imported
  --agent <name>       override agent name (default: parent folder name)
  --db <path>          database path (default: ~/.openclaw/lcm.db)
  --compact            run compaction after importing each conversation
  --provider <id>      API provider for compaction
  --model <id>         API model for compaction
  --leaf-chunk-tokens  max input tokens per leaf chunk (default 20000)
  --leaf-target-tokens target output tokens for leaf summaries (default 1200)
  --condensed-target-tokens target for condensed summaries (default 2000)
  --leaf-fanout        minimum leaf summaries before condensation (default 8)
  --condensed-fanout   minimum summaries before d2+ condensation (default 4)
  --hard-fanout        minimum summaries in forced root fold (default 2)
  --fresh-tail         freshest raw messages to preserve (default 32)
  --prompt-dir         custom prompt template directory

Implementation details

Files changed

File Change Lines
tui/import-memories.go New 638
tui/import_memories_test.go New 819
tui/main.go Modified +7

Architecture

The implementation reuses the existing backfill infrastructure:

  • applyBackfillImport() — conversation creation, message insertion, FTS indexing, message part creation
  • inspectBackfillImportPlan() — deduplication checks
  • runBackfillCompaction() — optional post-import summary DAG construction
  • backfillMessage struct — message representation
  • backfillSessionInput struct — import input packaging

New code is limited to:

  • Directory walking and .md file discovery
  • Markdown turn marker detection and parsing
  • Timestamp extraction from <time> tags and filenames
  • YAML frontmatter stripping
  • Conversation data deletion for --force re-import
  • Agent name derivation from folder structure

Database integration

Imported conversations use the same schema as native backfill:

  • conversations table with session_id = "import:<sha256-hash>" for deterministic dedup
  • messages + context_items + message_parts tables for full message storage
  • messages_fts for full-text search indexing

Test plan

44 tests covering:

Parser tests (20)

  • All 5 marker styles: h4, h2, h3, bold, inline
  • User's exact formats: #### You: / #### ChatGPT and **Human:** / **Assistant:**
  • Role mappings: User/Human/You/Assistant/AI/ChatGPT/Claude/System + unknown fallback
  • Markers inside fenced code blocks are ignored
  • Multi-line content spanning multiple paragraphs
  • Code blocks within messages preserved
  • Headings with optional trailing colon
  • Pre-marker content (preamble/title) discarded
  • Empty file → 0 messages
  • Whitespace-only file → 0 messages
  • No markers → single assistant fallback message
  • Empty turns between markers skipped
  • Single turn with no response
  • Sequential seq numbering

Timestamp tests (7)

  • <time datetime="..."> tag extraction with nanosecond precision
  • <time> tag stripped from message content
  • Filename date extraction (YYYY-MM-DD - Title.md)
  • Fallback: no <time> tags → uses filename date
  • Fallback: no date anywhere → uses current time
  • extractTimeTag() with valid, missing, and malformed tags
  • extractFilenameDate() with dated and undated filenames

Frontmatter & encoding tests (5)

  • YAML frontmatter stripped from content
  • Frontmatter with BOM prefix
  • Unclosed frontmatter left intact
  • UTF-8 BOM stripped — first marker matches correctly
  • ChatGPT full format: frontmatter + h1 title + #### You: markers + <time> tags

Infrastructure tests (12)

  • memorySessionID determinism and prefix
  • deriveAgentName: nested folders, root-level files, deep nesting
  • discoverMarkdownFiles: finds .md, ignores .txt, case-insensitive extension
  • parseImportMemoriesArgs: valid path, --apply mode, missing path, non-directory, non-existent
  • deleteConversationData: clears all related tables (messages, context_items, message_parts, FTS)
  • Force re-import: delete + re-import end-to-end
  • End-to-end: discover → parse → import → verify DB → idempotency check
  • FTS indexing: rows created, searchable, cleaned up on delete
  • Dry-run makes no database writes

Tested against real data

Successfully dry-run tested against 824 real markdown files (7,120 messages) across multiple folders:

  • ChatGPT exports with <time> tags and YAML frontmatter
  • Claude exports with YYYY-MM-DD filename dates
  • Multiple agent folders with varied conversation lengths (1–80 messages per file)
  • Zero parsing warnings, zero skipped files

🤖 Generated with Claude Code

Add a new `lcm-tui import-memories` CLI command that imports markdown
conversation files from external AI chat exports (ChatGPT, Claude, etc.)
into the LCM database as searchable conversations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant