Skip to content

Add features: type-safe models, streaming, plugins, semantic optimization & batch processing #25

@dino65-dev

Description

@dino65-dev

Problem Statement

As a user working with LLMs and large datasets, I need TOON Format to support:

  1. Type-safe serialization - Currently no native support for Pydantic models, dataclasses, or attrs classes
  2. Memory-efficient processing - Cannot handle datasets larger than available RAM
  3. Custom type handlers - No way to serialize UUIDs, datetime objects, NumPy arrays, or Pandas DataFrames
  4. Token optimization - Missing AI-aware optimizations like field abbreviation and semantic ordering
  5. Format conversion - No built-in way to convert between JSON/YAML/XML/CSV and TOON

These limitations prevent TOON from competing with modern serialization libraries (msgspec, orjson, Pydantic v2) while limiting its adoption for production LLM applications.

Proposed Solution

Implement 5 core feature modules:

1. Type-Safe Model Integration (integrations.py)

from pydantic import BaseModel
from toon_format import encode_model, decode_model

class User(BaseModel):
    name: str
    age: int

user = User(name="Alice", age=30)
toon_str = encode_model(user)  # Full validation
decoded = decode_model(toon_str, User)  # Type preserved

2. Streaming Processing (streaming.py)

from toon_format.streaming import StreamEncoder

with StreamEncoder("large_data.toon") as encoder:
    encoder.start_array(fields=["id", "name"])
    for item in database.query():  # Millions of records
        encoder.encode_item(item)  # O(1) memory

3. Plugin System (plugins.py)

from toon_format.plugins import register_encoder, register_decoder

# Auto-support for UUID, datetime, NumPy, Pandas, Decimal, Path
data = {"id": uuid.uuid4(), "created": datetime.now()}
encode(data)  # Just works

4. Semantic Optimization (semantic.py)

from toon_format.semantic import optimize_for_llm

optimized = optimize_for_llm(data, 
    abbreviate_keys=True,      # emp_id vs employee_identifier
    order_fields=True,          # Important fields first
    remove_nulls=True           # Strip unnecessary data
)
# Achieves 10-30% additional token reduction

5. Batch Processing (batch.py)

from toon_format.batch import convert_file, batch_convert

# Auto-detect format and convert
convert_file("data.json", "data.toon")  # JSON → TOON
convert_file("data.yaml", "data.toon")  # YAML → TOON

# Batch process directories
batch_convert("./data/*.json", output_dir="./toon/", to_format="toon")

Alternatives Considered

  1. Use existing libraries - But they don't optimize for LLM token efficiency (TOON's core value)
  2. Manual implementations - Requires users to write boilerplate for each use case
  3. External tools - Adds dependencies and complexity to workflows

SPEC Compliance

This does not modify the TOON specification. All features are:

  • Built on top of existing spec-compliant encoder/decoder
  • Additive (no changes to core syntax or parsing)
  • Optional (existing code continues to work unchanged)

Additional Context

Benefits

  • Backward compatible - Zero breaking changes
  • Production-ready - Comprehensive tests, documentation, examples
  • Performance - Streaming = O(1) memory, semantic optimization = 10-30% extra savings
  • Modern - Matches 2025 best practices (Pydantic v2, msgspec patterns)

Impact

  • Makes TOON competitive with modern serialization libraries
  • Enables enterprise adoption for production LLM applications
  • Maintains core strength: 30-60% token reduction vs JSON

Implementation Plan

  1. Create 5 new modules
  2. Add comprehensive documentation
  3. Create working examples and demos
  4. Ensure 100% test coverage
  5. Maintain Python 3.10+ compatibility

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions