Skip to content

fix(hf): route all uploads through bench.load_data()#10

Merged
mohitgargai merged 3 commits intomainfrom
fix/hf-schema-drift
Apr 24, 2026
Merged

fix(hf): route all uploads through bench.load_data()#10
mohitgargai merged 3 commits intomainfrom
fix/hf-schema-drift

Conversation

@mohitgargai
Copy link
Copy Markdown
Contributor

Summary

Fixes schema drift between the uploaded HF parquet files (lica-world/GDB) and what each benchmark's build_model_input() expects at runtime. The first-user drill surfaced a crash on gdb eval --benchmarks svg-1 (KeyError: 'options'); an in-process sweep of all 39 benchmarks against HF showed 8 broken — all for the same root cause.

Root causes

The upload script had three alternative code paths (load_csv_benchmark, load_json_benchmark, load_manifest_benchmark) that bypassed each benchmark's own load_data() to save time on dataset enumeration. They drifted from the runtime contract — e.g. svg-1's local load_data() expands nested questions into ~1200 flat samples per question, but load_json_benchmark produced 300 rows with the nested structure dumped verbatim into metadata, so build_model_input() crashed.

A second bug in load_via_registry was over-eagerly excluding keys like video_path, input_image, source_image, input_composite from metadata — assuming they'd be packed into the image column. That column holds at most one PIL blob, so path-valued keys were silently lost. This is why temporal-2/3 (video_path) and layout-8 (input_image) also broke.

Changes

  • Delete load_csv_benchmark, load_json_benchmark, load_manifest_benchmark and the MANIFEST_BENCHMARKS config table. Upload now routes exclusively through load_via_registry, which calls bench.load_data() — one source of truth.
  • In load_via_registry, only exclude sample_id, ground_truth, prompt from metadata (those are faithfully preserved in their own parquet columns). Everything else survives in metadata, including path-valued keys that the image column doesn't cover.
  • Strip absolute-path prefixes from metadata so parquets are portable across machines.
  • Add per-500-sample progress logging so slow load_data() invocations aren't a silent black box.
  • src/gdb/hf.py: guard against non-PIL image column values (Value(\"string\") for generation-only configs) to avoid 'str' object has no attribute 'save'.

Verification

In-process sweep (load_from_hubbench.build_model_input):

Before After
31 / 39 passing 38 / 39 passing

Fixed: svg-1, svg-2, svg-5, layout-8, temporal-1, temporal-2, temporal-3.

layout-2 upload is in progress — its load_data() takes ~10 min on macOS (realpath-bound on a ~2k-row manifest with nested asset resolution); the code fix is verified via local round-trip but the re-upload is still running and will land in a follow-up commit once complete.

Test plan

  • Local round-trip test for all 8 broken benchmarks (passes)
  • Regression test for 11 representative currently-passing benchmarks (no regressions)
  • End-to-end HF sweep showing 38/39 benchmarks now pass build_model_input
  • layout-2 re-upload completes and brings count to 39/39

Made with Cursor

…ndling

The upload script had three alternative code paths (load_csv_benchmark,
load_json_benchmark, load_manifest_benchmark) that bypassed each
benchmark's own load_data() to save time on dataset enumeration. Those
shortcuts drifted from the contract build_model_input() expects at
runtime, producing nested / half-formed sample rows on HF that crashed
downstream. The sweep showed 8 of 39 benchmarks broken on HF:

    svg-1, svg-2, svg-5        KeyError: 'options' / 'original_svg'
    layout-2, layout-8         TypeError / KeyError: 'input_image'
    temporal-1, temporal-2, temporal-3
                               KeyError: 'shuffled_keyframe_paths' / 'video_path'

Two bugs were at play:

1. The shortcut paths produced the wrong sample shape. Fix: delete
   them; always go through load_via_registry so HF parquet is
   round-trip equivalent to local load_data() output.

2. load_via_registry was over-eagerly excluding keys like video_path,
   input_image, source_image, input_composite from metadata, assuming
   they'd be packed into the `image` column. That column holds at most
   one PIL blob, so path-valued keys were silently lost. Fix: only
   exclude sample_id / ground_truth / prompt (the columns that
   faithfully preserve those values); everything else survives in
   metadata.

Also:
- Strip absolute-path prefixes from metadata (portability — HF
  consumers don't share the uploader's filesystem layout).
- Add per-500-sample progress logging so slow load_data() invocations
  aren't a silent black box.
- hf.py: guard against non-PIL image column values (Value("string")
  for generation-task configs) to avoid `'str' object has no attribute
  'save'`.

Verification: in-process sweep (load_from_hub → build_model_input) now
reports 38/39 benchmarks passing, up from 31/39. layout-2 upload
pending completion (load_data() takes ~10 min on macOS — one-time
upload cost, not a runtime concern).

Made-with: Cursor
@mohitgargai mohitgargai requested a review from purvanshi as a code owner April 24, 2026 13:37
- drop SKIP_BENCHMARKS (never populated) and its guard in load_benchmark
- trim load_benchmark / _normalize_paths / _merge_card_configs docstrings
  (historical refactor narrative belongs in the commit log, not the code)
- collapse multi-line comments in load_via_registry and hf.load_from_hub
  to one-liners that describe invariants rather than restate the code
- drop unused api param from _merge_card_configs
- shorten hf.py module/function docstrings

No behavior change.

Made-with: Cursor
@mohitgargai mohitgargai merged commit cbb11c0 into main Apr 24, 2026
12 checks passed
@mohitgargai mohitgargai deleted the fix/hf-schema-drift branch April 24, 2026 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant