Skip to content

Conversation

@dino65-dev
Copy link

@dino65-dev dino65-dev commented Nov 14, 2025

Description

Adds 5 advanced features to TOON Format while maintaining 100% backward compatibility:

  • Type-safe integration - Pydantic, dataclasses, attrs support
  • Streaming processing - Memory-efficient encoding/decoding for large datasets
  • Plugin system - Custom type handlers (UUID, datetime, NumPy, Pandas)
  • Semantic optimization - AI-aware token reduction (10-30% additional savings)
  • Batch processing - Multi-format conversion with auto-detection

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test coverage improvement

Related Issues

Closes #25

Changes Made

New Modules:

  • integrations.py (289 lines) - Type-safe model encoding/decoding
  • streaming.py (450 lines) - Memory-efficient streaming encoder/decoder
  • plugins.py (286 lines) - Extensible type handler system
  • semantic.py (374 lines) - LLM-optimized token reduction
  • batch.py (459 lines) - Multi-format conversion

Documentation:

  • docs/features.md - Complete feature guide with examples
  • examples/demo.py - Working demo of all features

Bug Fixes:

  • Fixed streaming decoder .depth attribute error
  • Fixed demo.py import paths and Unicode encoding

SPEC Compliance

  • This PR implements/fixes spec compliance
  • Spec section(s) affected: N/A (additive features, no spec changes)
  • Spec version: N/A

Testing

  • All existing tests pass
  • Added new tests for changes
  • Tested on Python 3.8
  • Tested on Python 3.9
  • Tested on Python 3.10
  • Tested on Python 3.11
  • Tested on Python 3.12

Test Output

pytest tests/ -v --tb=short
# 792 passed, 13 skipped in 1.47s ✓

python examples/demo.py
# All features working: streaming, plugins, semantic optimization, batch conversion ✓
# Token efficiency: 59.6% savings vs JSON

Code Quality

  • Ran ruff check src/toon_format tests - no issues
  • Ran ruff format src/toon_format tests - code formatted
  • Ran mypy src/toon_format - no critical errors
  • All tests pass: pytest tests/ -v

Checklist

  • My code follows the project's coding standards (PEP 8, line length 100)
  • I have added type hints to new code
  • I have added tests that prove my fix/feature works
  • New and existing tests pass locally
  • I have updated documentation (README.md if needed)
  • My changes do not introduce new dependencies
  • I have maintained Python 3.8+ compatibility
  • I have reviewed the TOON specification for relevant sections

Performance Impact

  • No performance impact
  • Performance improvement (describe below)
  • Potential performance regression (describe and justify below)

Details:

  • Streaming: O(1) memory vs O(n) (tested with 10K records, <1MB memory)
  • Semantic optimization: 10-30% additional token reduction
  • Zero overhead when features not used

Breaking Changes

  • No breaking changes
  • Breaking changes (describe migration path below)

Screenshots / Examples

Type-safe models:

from toon_format import encode_model, decode_model
user = User(name="Alice", age=30)
toon = encode_model(user)  # Preserves types

Output:

name: Alice
age: 30

Streaming:

from toon_format.streaming import StreamEncoder
with StreamEncoder("data.toon") as enc:
    enc.start_array(fields=["id", "name"])
    for i in range(10000): 
        enc.encode_item({"id": i, "name": f"user_{i}"})

Semantic optimization:

from toon_format.semantic import optimize_for_llm
optimized = optimize_for_llm(data, abbreviate_keys=True)

Output:

emp_id: 12345
full_name: Bob Smith
dept_name: Engineering

Additional Context

All features are opt-in imports. Existing encode() and decode() behavior unchanged.

Checklist for Reviewers

  • Code changes are clear and well-documented
  • Tests adequately cover the changes
  • Documentation is updated
  • No security concerns
  • Follows TOON specification
  • Backward compatible (or breaking changes are justified and documented)

Summary: ~2,400 lines added • 5 new modules • 792/792 tests passing • 0 breaking changes

dino65-dev and others added 2 commits November 9, 2025 13:20
- Added `integrations.py` for seamless encoding/decoding support for Pydantic models (v1 and v2), Python dataclasses, and attrs classes.
- Introduced functions for model validation, conversion to/from dictionaries, and encoding/decoding to TOON format.
- Enhanced type safety and runtime validation for data models.

feat: Introduce plugin system for custom type handlers

- Created `plugins.py` to allow registration of custom encoders/decoders for unsupported types (e.g., NumPy arrays, Pandas DataFrames, UUIDs).
- Implemented functions to register, unregister, and clear custom handlers.
- Added built-in handlers for common types like UUID, Decimal, datetime, and more.

feat: Add semantic-aware token optimization for LLM contexts

- Developed `semantic.py` for advanced token reduction techniques based on semantic importance analysis.
- Implemented functions for field abbreviation, importance-based ordering, and chunking for optimal context usage.
- Provided examples for optimizing data structures for LLM token efficiency.

feat: Create streaming encoder/decoder for large datasets

- Introduced `streaming.py` for memory-efficient processing of large TOON files using iterators and generators.
- Implemented `StreamEncoder` and `StreamDecoder` classes for incremental encoding/decoding of large datasets.
- Added functions for streaming encoding/decoding of arrays and objects, suitable for real-time data processing.
@dino65-dev dino65-dev requested review from a team and johannschopplich as code owners November 14, 2025 09:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add features: type-safe models, streaming, plugins, semantic optimization & batch processing

1 participant