feat(importers): read full Claude Code transcripts from ~/.claude/projects#40
Open
seilk wants to merge 1 commit into
Open
feat(importers): read full Claude Code transcripts from ~/.claude/projects#40seilk wants to merge 1 commit into
seilk wants to merge 1 commit into
Conversation
…jects Claude Code stores rich session transcripts (user prompts + assistant responses + tool calls) at ~/.claude/projects/<encoded-cwd>/<id>.jsonl. The previous ClaudeCodeImporter only read ~/.claude/history.jsonl, which is a flat log of *user prompts only* — no assistant responses. That meant Claude Code was the only sessiondb source that produced unpaired examples, while Copilot and Hermes both yielded (task_input, assistant_response) pairs. Downstream consumers (RelevanceFilter, build_dataset_from_external) already plumb assistant_response through, so the data shape gap was the only blocker. Changes: - Extend ClaudeCodeImporter with PROJECTS_DIR + _extract_from_projects. - Add _parse_claude_code_session helper, mirroring _parse_copilot_events. Handles user/assistant interleaving, tool_use/tool_result skipping, multi-block assistant turns, malformed JSON, and secret redaction. - New `source` arg on extract_messages: "auto" (default, prefers projects/, falls back to history.jsonl), "projects", or "history". - Existing tests updated to pass `source="history"` (now explicit), plus 12 new tests covering pair extraction, tool-result skipping, secret filtering, multi-session walking, limits, and auto fallback. Verified on real ~/.claude/projects/ data: yields paired examples with both task_input and assistant_response fields. Closes the data-quality gap noted in NousResearch#3 for Claude Code users.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ClaudeCodeImporterpreviously only read~/.claude/history.jsonl, which logs user prompts only — no assistant responses. That made Claude Code the one sessiondb source producing unpaired examples while Copilot and Hermes both yielded(task_input, assistant_response)pairs.Modern Claude Code (≥ 2.x) writes full session transcripts to
~/.claude/projects/<encoded-cwd>/<session-id>.jsonl, with interleaveduser,assistant,attachment,permission-mode, etc. records. Downstream code (RelevanceFilter,build_dataset_from_external) already plumbsassistant_responsethrough (seeexternal_importers.pylines 504–505), so the data-shape gap was the only blocker.This PR closes that gap.
What changed
evolution/core/external_importers.pyClaudeCodeImporter.PROJECTS_DIR = ~/.claude/projects.extract_messages(limit, source=...)gains asourcearg:"auto"(default) — preferprojects/, fall back tohistory.jsonl"projects"— transcripts only"history"— flat log only (legacy semantics)_parse_claude_code_session(path, project)helper, mirroring_parse_copilot_events. It walks records in order, tracks the current user prompt, accumulatestextblocks across consecutive assistant turns (so tool-call interleavings still produce one clean response), skipstool_resultuser records and short prompts, and rejects pairs containing detected secrets in either side.tests/core/test_external_importers.pyTestClaudeCodeImportertests updated to passsource=\"history\"so they remain isolated from the new auto behavior.TestClaudeCodeProjectsImporterclass with 12 tests covering:tool_resultuser recordslimitparameterprojects/directoryautomode prefers projects, and falls back to history when projects is emptyVerification
Smoke-tested on a real
~/.claude/projects/dataset (32k+ session messages):Compatibility
extract_messages()with no args returns paired data whenprojects/exists, identical legacy data otherwise. Existing downstream consumers (build_claude_code_examples,RelevanceFilter.score) already usemsg.get(\"assistant_response\", \"\"), so they handle both shapes.source=\"history\"preserves the exact prior behavior for any caller that wants it.Why this matters for Claude Code users
The repo's pitch is sessiondb-based skill evolution. With this fix, Claude Code becomes a first-class data source on par with Hermes and Copilot, instead of a degraded user-only fallback. The richer data also unlocks better LLM relevance scoring downstream — the scorer already accepts
assistant_responseas input but was always getting an empty string from Claude Code.Related context: the original sessiondb proposal (#3) and the recent sessiondb-quality fixes in #26, neither of which addressed the projects/ transcript source.