Skip to content

feat(normalize): Gemini CLI session JSONL adapter#1234

Merged
igorls merged 3 commits intodevelopfrom
feat/normalize-gemini-cli
Apr 27, 2026
Merged

feat(normalize): Gemini CLI session JSONL adapter#1234
igorls merged 3 commits intodevelopfrom
feat/normalize-gemini-cli

Conversation

@milla-jovovich
Copy link
Copy Markdown
Collaborator

Adds a fifth format adapter to mempalace.normalize alongside the existing Claude Code, Codex, Claude.ai, ChatGPT, and Slack parsers. After this lands, mempalace mine --mode convos ingests Gemini CLI session history without manual export.

Why now: Claude Code and Codex CLI are already supported by convo_miner; adding Gemini closes the major-CLI-tool coverage gap. After this lands, the README's "verbatim conversation history" promise is honestly delivered for all three top-tier API-keyed coding CLIs (Claude Code, Codex CLI, Gemini CLI), not just two of them. This is the third leg of the trio Aya pushed for so the public claim matches the actual ingest pipeline.

Gemini CLI stores sessions at ~/.gemini/tmp/<project_hash>/chats/ as JSONL. The on-disk schema (per google-gemini/gemini-cli#15292):

{"type":"session_metadata","sessionId":"...","projectHash":"...",...}
{"type":"user","id":"msg1","content":[{"text":"Hello"}]}
{"type":"gemini","id":"msg2","content":[{"text":"Hi"}]}
{"type":"message_update","id":"msg2","tokens":{"input":10,"output":5}}

The new _try_gemini_jsonl parser:

  • requires a session_metadata record so it does not false-positive against Claude Code or Codex JSONL passing through the dispatch chain in _try_normalize_json
  • extracts user/gemini message text from each entry's content array of {"text": "..."} blocks, joining multiple blocks per message in order
  • skips message_update entries (token-count deltas with no message text) and any other unknown record types
  • returns None when fewer than two conversational messages are present, mirroring the codex parser's >=2-message guard

Test coverage: 9 new unit tests in tests/test_normalize.py mirroring the codex test pattern - happy path, multi-turn, missing session metadata, message_update skip, single-message rejection, multi-block content concatenation, empty content skip, malformed-line resilience, and explicit no-match against codex JSONL fixtures. Schema-level only; real Gemini CLI session fixtures are a follow-up once a real user file is available.

Closes part of #59 (the Gemini CLI portion of the umbrella request).

What does this PR do?

How to test

Checklist

  • Tests pass (python -m pytest tests/ -v)
  • No hardcoded paths
  • Linter passes (ruff check .)

Adds a fifth format adapter to mempalace.normalize alongside the
existing Claude Code, Codex, Claude.ai, ChatGPT, and Slack parsers.
After this lands, mempalace mine --mode convos ingests Gemini CLI
session history without manual export.

Why now: Claude Code and Codex CLI are already supported by convo_miner;
adding Gemini closes the major-CLI-tool coverage gap. After this lands,
the README's "verbatim conversation history" promise is honestly
delivered for all three top-tier API-keyed coding CLIs (Claude Code,
Codex CLI, Gemini CLI), not just two of them. This is the third leg
of the trio Aya pushed for so the public claim matches the actual
ingest pipeline.

Gemini CLI stores sessions at ~/.gemini/tmp/<project_hash>/chats/ as
JSONL. The on-disk schema (per google-gemini/gemini-cli#15292):

    {"type":"session_metadata","sessionId":"...","projectHash":"...",...}
    {"type":"user","id":"msg1","content":[{"text":"Hello"}]}
    {"type":"gemini","id":"msg2","content":[{"text":"Hi"}]}
    {"type":"message_update","id":"msg2","tokens":{"input":10,"output":5}}

The new _try_gemini_jsonl parser:

  - requires a session_metadata record so it does not false-positive
    against Claude Code or Codex JSONL passing through the dispatch
    chain in _try_normalize_json
  - extracts user/gemini message text from each entry's content array
    of {"text": "..."} blocks, joining multiple blocks per message
    in order
  - skips message_update entries (token-count deltas with no message
    text) and any other unknown record types
  - returns None when fewer than two conversational messages are
    present, mirroring the codex parser's >=2-message guard

Test coverage: 9 new unit tests in tests/test_normalize.py mirroring
the codex test pattern - happy path, multi-turn, missing session
metadata, message_update skip, single-message rejection, multi-block
content concatenation, empty content skip, malformed-line resilience,
and explicit no-match against codex JSONL fixtures. Schema-level only;
real Gemini CLI session fixtures are a follow-up once a real user file
is available.

Closes part of #59 (the Gemini CLI portion of the umbrella request).
@igorls igorls requested a review from Copilot April 27, 2026 18:54
@igorls
Copy link
Copy Markdown
Member

igorls commented Apr 27, 2026

@copilot review this

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 27, 2026

@copilot review this

Reviewed. Here's a summary of what I found:

Implementation (normalize.py)

  • _try_gemini_jsonl correctly requires a session_metadata sentinel before accepting input, which is the right guard against false-positives against Claude Code and Codex JSONL files that also use "type": "user" entries.
  • Dispatch order in _try_normalize_json is correct: Claude Code → Codex → Gemini → full JSON parsers.
  • Cross-format isolation verified programmatically — no parser steals another's input.
  • message_update entries (token-count only, no text) are correctly skipped.
  • Multi-block content arrays are concatenated in order.

Tests (test_normalize.py)

  • 9 new test cases covering: happy path, multi-turn, missing sentinel (returns None), skipped message_update, too-few-messages, multi-block content, empty content skipped, invalid JSON lines skipped, and Codex non-match guard.
  • All 116 tests pass.

Fix applied (a3e3691): The module-level docstring in normalize.py was missing "Gemini CLI JSONL" from the supported formats list — added it in line with the other entries.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Gemini CLI JSONL format adapter to mempalace.normalize so mempalace mine --mode convos can ingest Gemini CLI session logs stored as JSONL without manual export, aligning normalization support with other major coding CLIs already supported.

Changes:

  • Add _try_gemini_jsonl() parser to detect/parse Gemini CLI JSONL sessions into transcript format.
  • Wire the new parser into _try_normalize_json() dispatch chain.
  • Add unit tests covering Gemini JSONL detection, parsing behaviors, and false-positive guards.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
mempalace/normalize.py Adds Gemini CLI JSONL normalization and dispatch-chain integration.
tests/test_normalize.py Adds unit tests validating Gemini JSONL parsing and non-matching behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mempalace/normalize.py Outdated
Comment on lines +297 to +301
Detection requires a ``session_metadata`` record so this parser does
not false-positive against Claude Code or Codex JSONL passed through
the dispatch chain. ``message_update`` entries are skipped — they
have no message text. Multiple text blocks within a single message's
content array are concatenated in order, separated by newlines.
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring states session_metadata is on the first line, but the implementation accepts session_metadata anywhere and will also collect user/gemini messages that appear before it. To match the stated schema and strengthen the false-positive guard, require the first non-empty parsed entry to be type=="session_metadata" (or ignore messages until session_metadata has been seen).

Copilot uses AI. Check for mistakes.
@igorls
Copy link
Copy Markdown
Member

igorls commented Apr 27, 2026

@copilot apply changes based on the comments in this thread

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 27, 2026

@copilot apply changes based on the comments in this thread

Done in e7fe6ca. Added if not has_session_metadata: continue before the user/gemini collection block so any turns appearing before the sentinel are discarded as preamble. Updated the docstring to document this behavior, and added test_gemini_jsonl_messages_before_session_metadata_discarded to cover it explicitly. All 117 tests pass.

@igorls igorls merged commit fdfaf01 into develop Apr 27, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants