Skip to content

TOON benchmark for generation tasks #207

@vetertann

Description

@vetertann

Hi @johannschopplich , I am doing TOON benchmark for generation tasks. The goal is to test cases of varying complexity across multiple runs using 21 models

I'm comparing JSON vs JSON+SO (constrained decoding) vs TOON across three metrics:

  • One-shot accuracy
  • Final accuracy after repair cycles
  • Overall token budget

Pipeline:

  1. LLM generates output in J, JSO, or T format (21 models via Nebius API)
  2. For TOON: CLI decoding to JSON. CLI errors trigger repair cycle
  3. Validation of all formats via Pydantic
  4. Data canonicalization
  5. Comparison with gold standard JSON
  6. If data doesn't match, repair cycle starts (up to 3 attempts) with previous output and error text inserted into prompt

Test cases:
users - tabular array
order - nested object + array
invoice- nested objects + array
company- nested arrays within arrays

I run this 10 times for each model, then get mean for each metric per model and per case.

TOON example (I use it across all cases):

"Return data in TOON format (2-space indent; arrays show length and fields).\n"
"Example:\n"
"```toon\n"
"id: 100\n"
"title: Sample\n"
"author:\n"
"  id: 5\n"
"  name: Alex\n"
"items[2]{code,value,price}:\n"
"  A1,10,5.5\n"
"  B2,20,8.0\n"
"summary:\n"
"  count: 2\n"
"  total: 13.5\n"
"notes: Example data.\n"
"```\n\n"
"Output ONLY a TOON code block. Use correct headers and [N] values.\n\n"

Task prompt example (order case):

"Create an order record with fields for id, customer (with id and name), and items array with sku, qty, price fields:\n"
"- Order ID: 101\n"
"- Customer: Ada (ID: 9)\n"
"- Items:\n"
"  * Product A1: quantity 2, price $9.99 each\n"
"  * Product B2: quantity 1, price $14.50 each\n"

I've run preliminary tests and have initial results, but I'm still refining prompt quality and case balance. Could you review this setup and provide suggestions, particularly around:

  1. Whether the universal TOON example is good one.
  2. Case selection - especially the company case with nested arrays, which seems outside TOON's proved acc for uniform arrays
  3. Any prompt improvements that might better demonstrate TOON's strengths
  4. Maybe split metrics between TOON strength area of application and weak one.

Thanks!

order.toon.txt
company.toon.txt
invoice.toon.txt
users.toon.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions