Skip to content

feat(importers): read full Claude Code transcripts from ~/.claude/projects#40

Open
seilk wants to merge 1 commit into
NousResearch:mainfrom
seilk:feat/claude-code-projects-importer
Open

feat(importers): read full Claude Code transcripts from ~/.claude/projects#40
seilk wants to merge 1 commit into
NousResearch:mainfrom
seilk:feat/claude-code-projects-importer

Conversation

@seilk
Copy link
Copy Markdown

@seilk seilk commented Apr 26, 2026

Summary

ClaudeCodeImporter previously only read ~/.claude/history.jsonl, which logs user prompts only — no assistant responses. That made Claude Code the one sessiondb source producing unpaired examples while Copilot and Hermes both yielded (task_input, assistant_response) pairs.

Modern Claude Code (≥ 2.x) writes full session transcripts to ~/.claude/projects/<encoded-cwd>/<session-id>.jsonl, with interleaved user, assistant, attachment, permission-mode, etc. records. Downstream code (RelevanceFilter, build_dataset_from_external) already plumbs assistant_response through (see external_importers.py lines 504–505), so the data-shape gap was the only blocker.

This PR closes that gap.

What changed

evolution/core/external_importers.py

  • New ClaudeCodeImporter.PROJECTS_DIR = ~/.claude/projects.
  • extract_messages(limit, source=...) gains a source arg:
    • "auto" (default) — prefer projects/, fall back to history.jsonl
    • "projects" — transcripts only
    • "history" — flat log only (legacy semantics)
  • New _parse_claude_code_session(path, project) helper, mirroring _parse_copilot_events. It walks records in order, tracks the current user prompt, accumulates text blocks across consecutive assistant turns (so tool-call interleavings still produce one clean response), skips tool_result user records and short prompts, and rejects pairs containing detected secrets in either side.

tests/core/test_external_importers.py

  • Existing TestClaudeCodeImporter tests updated to pass source=\"history\" so they remain isolated from the new auto behavior.
  • New TestClaudeCodeProjectsImporter class with 12 tests covering:
    • paired turn extraction
    • multi-block assistant concatenation (text → tool_use → text)
    • skipping tool_result user records
    • secret redaction on either side of the pair
    • short prompt filtering
    • pairs with no assistant text (drops, doesn't pollute)
    • malformed JSONL lines
    • multi-project, multi-session walking
    • limit parameter
    • missing projects/ directory
    • auto mode prefers projects, and falls back to history when projects is empty

Verification

$ pytest tests/ -q
152 passed, 11 warnings in 53.05s

Smoke-tested on a real ~/.claude/projects/ dataset (32k+ session messages):

ClaudeCodeImporter.extract_messages(limit=5, source='projects')
# => 5 pairs with both task_input and assistant_response populated

Compatibility

  • Backwards-compatible: extract_messages() with no args returns paired data when projects/ exists, identical legacy data otherwise. Existing downstream consumers (build_claude_code_examples, RelevanceFilter.score) already use msg.get(\"assistant_response\", \"\"), so they handle both shapes.
  • source=\"history\" preserves the exact prior behavior for any caller that wants it.

Why this matters for Claude Code users

The repo's pitch is sessiondb-based skill evolution. With this fix, Claude Code becomes a first-class data source on par with Hermes and Copilot, instead of a degraded user-only fallback. The richer data also unlocks better LLM relevance scoring downstream — the scorer already accepts assistant_response as input but was always getting an empty string from Claude Code.

Related context: the original sessiondb proposal (#3) and the recent sessiondb-quality fixes in #26, neither of which addressed the projects/ transcript source.

…jects

Claude Code stores rich session transcripts (user prompts + assistant
responses + tool calls) at ~/.claude/projects/<encoded-cwd>/<id>.jsonl.
The previous ClaudeCodeImporter only read ~/.claude/history.jsonl, which
is a flat log of *user prompts only* — no assistant responses.

That meant Claude Code was the only sessiondb source that produced
unpaired examples, while Copilot and Hermes both yielded
(task_input, assistant_response) pairs. Downstream consumers
(RelevanceFilter, build_dataset_from_external) already plumb
assistant_response through, so the data shape gap was the only blocker.

Changes:
- Extend ClaudeCodeImporter with PROJECTS_DIR + _extract_from_projects.
- Add _parse_claude_code_session helper, mirroring _parse_copilot_events.
  Handles user/assistant interleaving, tool_use/tool_result skipping,
  multi-block assistant turns, malformed JSON, and secret redaction.
- New `source` arg on extract_messages: "auto" (default, prefers
  projects/, falls back to history.jsonl), "projects", or "history".
- Existing tests updated to pass `source="history"` (now explicit), plus
  12 new tests covering pair extraction, tool-result skipping, secret
  filtering, multi-session walking, limits, and auto fallback.

Verified on real ~/.claude/projects/ data: yields paired examples with
both task_input and assistant_response fields.

Closes the data-quality gap noted in NousResearch#3 for Claude Code users.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant