-
Notifications
You must be signed in to change notification settings - Fork 952
Description
Hi @johannschopplich , I am doing TOON benchmark for generation tasks. The goal is to test cases of varying complexity across multiple runs using 21 models
I'm comparing JSON vs JSON+SO (constrained decoding) vs TOON across three metrics:
- One-shot accuracy
- Final accuracy after repair cycles
- Overall token budget
Pipeline:
- LLM generates output in J, JSO, or T format (21 models via Nebius API)
- For TOON: CLI decoding to JSON. CLI errors trigger repair cycle
- Validation of all formats via Pydantic
- Data canonicalization
- Comparison with gold standard JSON
- If data doesn't match, repair cycle starts (up to 3 attempts) with previous output and error text inserted into prompt
Test cases:
users - tabular array
order - nested object + array
invoice- nested objects + array
company- nested arrays within arrays
I run this 10 times for each model, then get mean for each metric per model and per case.
TOON example (I use it across all cases):
"Return data in TOON format (2-space indent; arrays show length and fields).\n"
"Example:\n"
"```toon\n"
"id: 100\n"
"title: Sample\n"
"author:\n"
" id: 5\n"
" name: Alex\n"
"items[2]{code,value,price}:\n"
" A1,10,5.5\n"
" B2,20,8.0\n"
"summary:\n"
" count: 2\n"
" total: 13.5\n"
"notes: Example data.\n"
"```\n\n"
"Output ONLY a TOON code block. Use correct headers and [N] values.\n\n"Task prompt example (order case):
"Create an order record with fields for id, customer (with id and name), and items array with sku, qty, price fields:\n"
"- Order ID: 101\n"
"- Customer: Ada (ID: 9)\n"
"- Items:\n"
" * Product A1: quantity 2, price $9.99 each\n"
" * Product B2: quantity 1, price $14.50 each\n"I've run preliminary tests and have initial results, but I'm still refining prompt quality and case balance. Could you review this setup and provide suggestions, particularly around:
- Whether the universal TOON example is good one.
- Case selection - especially the company case with nested arrays, which seems outside TOON's proved acc for uniform arrays
- Any prompt improvements that might better demonstrate TOON's strengths
- Maybe split metrics between TOON strength area of application and weak one.
Thanks!
order.toon.txt
company.toon.txt
invoice.toon.txt
users.toon.txt