TOON benchmark for generation tasks

Hi @johannschopplich , I am doing  TOON benchmark for generation tasks. The goal is to test cases of varying complexity across multiple runs using 21 models

I'm comparing JSON vs JSON+SO (constrained decoding) vs TOON across three metrics:
- One-shot accuracy
- Final accuracy after repair cycles
- Overall token budget

Pipeline:
1. LLM generates output in J, JSO, or T format (21 models via Nebius API)
2. For TOON: CLI decoding to JSON. CLI errors trigger repair cycle
3. Validation of all formats via Pydantic
4. Data canonicalization
5. Comparison with gold standard JSON
6. If data doesn't match, repair cycle starts (up to 3 attempts) with previous output and error text inserted into prompt


Test cases:
users - tabular array
order  - nested object + array
invoice- nested objects + array
company- nested arrays within arrays

I run this 10 times for each model, then get mean for each metric per model and per case.

TOON example  (I use it across all cases):

```python
"Return data in TOON format (2-space indent; arrays show length and fields).\n"
"Example:\n"
"```toon\n"
"id: 100\n"
"title: Sample\n"
"author:\n"
"  id: 5\n"
"  name: Alex\n"
"items[2]{code,value,price}:\n"
"  A1,10,5.5\n"
"  B2,20,8.0\n"
"summary:\n"
"  count: 2\n"
"  total: 13.5\n"
"notes: Example data.\n"
"```\n\n"
"Output ONLY a TOON code block. Use correct headers and [N] values.\n\n"
```

Task prompt example (order case):
```python
"Create an order record with fields for id, customer (with id and name), and items array with sku, qty, price fields:\n"
"- Order ID: 101\n"
"- Customer: Ada (ID: 9)\n"
"- Items:\n"
"  * Product A1: quantity 2, price $9.99 each\n"
"  * Product B2: quantity 1, price $14.50 each\n"
```

I've run preliminary tests and have initial results, but I'm still refining prompt quality and case balance. Could you review this setup and provide suggestions, particularly around:
1. Whether the universal TOON example is  good one.
2. Case selection - especially the company case with nested arrays, which seems outside TOON's  proved acc for uniform arrays
3. Any prompt improvements that might better demonstrate TOON's strengths
4. Maybe split metrics between TOON strength area of application  and weak one.

Thanks!


[order.toon.txt](https://github.com/user-attachments/files/23561814/order.toon.txt)
[company.toon.txt](https://github.com/user-attachments/files/23561815/company.toon.txt)
[invoice.toon.txt](https://github.com/user-attachments/files/23561813/invoice.toon.txt)
[users.toon.txt](https://github.com/user-attachments/files/23561812/users.toon.txt)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TOON benchmark for generation tasks #207

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TOON benchmark for generation tasks #207

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions