-
Notifications
You must be signed in to change notification settings - Fork 44
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Problem Statement
As a user working with LLMs and large datasets, I need TOON Format to support:
- Type-safe serialization - Currently no native support for Pydantic models, dataclasses, or attrs classes
- Memory-efficient processing - Cannot handle datasets larger than available RAM
- Custom type handlers - No way to serialize UUIDs, datetime objects, NumPy arrays, or Pandas DataFrames
- Token optimization - Missing AI-aware optimizations like field abbreviation and semantic ordering
- Format conversion - No built-in way to convert between JSON/YAML/XML/CSV and TOON
These limitations prevent TOON from competing with modern serialization libraries (msgspec, orjson, Pydantic v2) while limiting its adoption for production LLM applications.
Proposed Solution
Implement 5 core feature modules:
1. Type-Safe Model Integration (integrations.py)
from pydantic import BaseModel
from toon_format import encode_model, decode_model
class User(BaseModel):
name: str
age: int
user = User(name="Alice", age=30)
toon_str = encode_model(user) # Full validation
decoded = decode_model(toon_str, User) # Type preserved2. Streaming Processing (streaming.py)
from toon_format.streaming import StreamEncoder
with StreamEncoder("large_data.toon") as encoder:
encoder.start_array(fields=["id", "name"])
for item in database.query(): # Millions of records
encoder.encode_item(item) # O(1) memory3. Plugin System (plugins.py)
from toon_format.plugins import register_encoder, register_decoder
# Auto-support for UUID, datetime, NumPy, Pandas, Decimal, Path
data = {"id": uuid.uuid4(), "created": datetime.now()}
encode(data) # Just works4. Semantic Optimization (semantic.py)
from toon_format.semantic import optimize_for_llm
optimized = optimize_for_llm(data,
abbreviate_keys=True, # emp_id vs employee_identifier
order_fields=True, # Important fields first
remove_nulls=True # Strip unnecessary data
)
# Achieves 10-30% additional token reduction5. Batch Processing (batch.py)
from toon_format.batch import convert_file, batch_convert
# Auto-detect format and convert
convert_file("data.json", "data.toon") # JSON → TOON
convert_file("data.yaml", "data.toon") # YAML → TOON
# Batch process directories
batch_convert("./data/*.json", output_dir="./toon/", to_format="toon")Alternatives Considered
- Use existing libraries - But they don't optimize for LLM token efficiency (TOON's core value)
- Manual implementations - Requires users to write boilerplate for each use case
- External tools - Adds dependencies and complexity to workflows
SPEC Compliance
This does not modify the TOON specification. All features are:
- Built on top of existing spec-compliant encoder/decoder
- Additive (no changes to core syntax or parsing)
- Optional (existing code continues to work unchanged)
Additional Context
Benefits
- Backward compatible - Zero breaking changes
- Production-ready - Comprehensive tests, documentation, examples
- Performance - Streaming = O(1) memory, semantic optimization = 10-30% extra savings
- Modern - Matches 2025 best practices (Pydantic v2, msgspec patterns)
Impact
- Makes TOON competitive with modern serialization libraries
- Enables enterprise adoption for production LLM applications
- Maintains core strength: 30-60% token reduction vs JSON
Implementation Plan
- Create 5 new modules
- Add comprehensive documentation
- Create working examples and demos
- Ensure 100% test coverage
- Maintain Python 3.10+ compatibility
carlenmy
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request