Skip to content

Conversation JSONL bloats ~10× when chat uploads include binary files #54

@mgoldsborough

Description

@mgoldsborough

Symptom

Conversations that include chat-uploaded files persist to JSONL at roughly 10× the size they should. A single 100 KB image balloons its conversation file by ~1 MB.

Evidence: conv_ce7995f740af44d4.jsonl is 8.8 MB and mostly binary-as-dict serializations of uploaded images/PDFs.

Root cause

Uploaded file bytes reach runtime.chat() as Buffer instances on the contentParts array of the user-message event. The conversation event store calls JSON.stringify on the event as-is; Buffer serializes to a one-key-per-byte dict shape:

```json
{"type":"image","image":{"0":137,"1":80,"2":78,"3":71, ...}, "mimeType":"image/png"}
```

~10× overhead vs. the raw bytes, because each byte produces ``"NN":NNN,`` (5–7 characters) instead of 1 byte.

Cost framing

  • Disk/IOPS today — conversation files grow faster than they should.
  • LLM tokens on reload — if the full conversation is ever rehydrated into a prompt (e.g., resume, export, analytics), the bloated payload burns tokens. Even if today's reload path doesn't feed these parts back to the model, any future path that does will pay the inflated cost.
  • Bandwidth on reads — frontends streaming conversation events pay the bloat.

Fix

Strip binary-bearing content parts before persist; rehydrate from the workspace file store on reload via the `fileRefs` that are already on the user-message metadata.

  • The bytes are already stored durably at `/workspaces//files/` (after PR Unify chat-upload and files__* tool file stores (bug 4) #52's unification).
  • The conversation event only needs a reference by file id — `fileRefs` is the existing field.
  • On reload, only rehydrate if a caller actually needs the bytes. Most replay paths (history to the LLM) can work off the extracted-text content part, which is already persisted separately and is cheap.

Files likely to touch

  • `src/runtime/runtime.ts` — around where the user message is assembled before the store append (see ~line 519 for the `userContent` construction from `request.contentParts`).
  • `src/conversation/event-sourced-store.ts` (and memory/jsonl siblings) — assert that the event being appended has no `Buffer`/`Uint8Array` payloads. Throw loudly if it does, so this regression can't silently come back.
  • New test in `test/unit/conversation/` — round-trip a user message with a `contentParts` image; assert the on-disk JSONL does not contain `"0":`, `"1":`, etc.

Out of scope

  • Replay UX changes (e.g., lazy-loading file bytes in the web client). Track separately if we need it.
  • Re-writing existing bloated JSONL files. If desired, a one-shot cleanup can come later — needs decisions about whether to mutate historical events.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions