fix(hf): route all uploads through bench.load_data() by mohitgargai · Pull Request #10 · lica-world/GDB

mohitgargai · 2026-04-24T13:37:36Z

Summary

Fixes schema drift between the uploaded HF parquet files (lica-world/GDB) and what each benchmark's build_model_input() expects at runtime. The first-user drill surfaced a crash on gdb eval --benchmarks svg-1 (KeyError: 'options'); an in-process sweep of all 39 benchmarks against HF showed 8 broken — all for the same root cause.

Root causes

The upload script had three alternative code paths (load_csv_benchmark, load_json_benchmark, load_manifest_benchmark) that bypassed each benchmark's own load_data() to save time on dataset enumeration. They drifted from the runtime contract — e.g. svg-1's local load_data() expands nested questions into ~1200 flat samples per question, but load_json_benchmark produced 300 rows with the nested structure dumped verbatim into metadata, so build_model_input() crashed.

A second bug in load_via_registry was over-eagerly excluding keys like video_path, input_image, source_image, input_composite from metadata — assuming they'd be packed into the image column. That column holds at most one PIL blob, so path-valued keys were silently lost. This is why temporal-2/3 (video_path) and layout-8 (input_image) also broke.

Changes

Delete load_csv_benchmark, load_json_benchmark, load_manifest_benchmark and the MANIFEST_BENCHMARKS config table. Upload now routes exclusively through load_via_registry, which calls bench.load_data() — one source of truth.
In load_via_registry, only exclude sample_id, ground_truth, prompt from metadata (those are faithfully preserved in their own parquet columns). Everything else survives in metadata, including path-valued keys that the image column doesn't cover.
Strip absolute-path prefixes from metadata so parquets are portable across machines.
Add per-500-sample progress logging so slow load_data() invocations aren't a silent black box.
src/gdb/hf.py: guard against non-PIL image column values (Value(\"string\") for generation-only configs) to avoid 'str' object has no attribute 'save'.

Verification

In-process sweep (load_from_hub → bench.build_model_input):

Before	After
31 / 39 passing	38 / 39 passing

Fixed: svg-1, svg-2, svg-5, layout-8, temporal-1, temporal-2, temporal-3.

layout-2 upload is in progress — its load_data() takes ~10 min on macOS (realpath-bound on a ~2k-row manifest with nested asset resolution); the code fix is verified via local round-trip but the re-upload is still running and will land in a follow-up commit once complete.

Test plan

Local round-trip test for all 8 broken benchmarks (passes)
Regression test for 11 representative currently-passing benchmarks (no regressions)
End-to-end HF sweep showing 38/39 benchmarks now pass build_model_input
layout-2 re-upload completes and brings count to 39/39

Made with Cursor

…ndling The upload script had three alternative code paths (load_csv_benchmark, load_json_benchmark, load_manifest_benchmark) that bypassed each benchmark's own load_data() to save time on dataset enumeration. Those shortcuts drifted from the contract build_model_input() expects at runtime, producing nested / half-formed sample rows on HF that crashed downstream. The sweep showed 8 of 39 benchmarks broken on HF: svg-1, svg-2, svg-5 KeyError: 'options' / 'original_svg' layout-2, layout-8 TypeError / KeyError: 'input_image' temporal-1, temporal-2, temporal-3 KeyError: 'shuffled_keyframe_paths' / 'video_path' Two bugs were at play: 1. The shortcut paths produced the wrong sample shape. Fix: delete them; always go through load_via_registry so HF parquet is round-trip equivalent to local load_data() output. 2. load_via_registry was over-eagerly excluding keys like video_path, input_image, source_image, input_composite from metadata, assuming they'd be packed into the `image` column. That column holds at most one PIL blob, so path-valued keys were silently lost. Fix: only exclude sample_id / ground_truth / prompt (the columns that faithfully preserve those values); everything else survives in metadata. Also: - Strip absolute-path prefixes from metadata (portability — HF consumers don't share the uploader's filesystem layout). - Add per-500-sample progress logging so slow load_data() invocations aren't a silent black box. - hf.py: guard against non-PIL image column values (Value("string") for generation-task configs) to avoid `'str' object has no attribute 'save'`. Verification: in-process sweep (load_from_hub → build_model_input) now reports 38/39 benchmarks passing, up from 31/39. layout-2 upload pending completion (load_data() takes ~10 min on macOS — one-time upload cost, not a runtime concern). Made-with: Cursor

- drop SKIP_BENCHMARKS (never populated) and its guard in load_benchmark - trim load_benchmark / _normalize_paths / _merge_card_configs docstrings (historical refactor narrative belongs in the commit log, not the code) - collapse multi-line comments in load_via_registry and hf.load_from_hub to one-liners that describe invariants rather than restate the code - drop unused api param from _merge_card_configs - shorten hf.py module/function docstrings No behavior change. Made-with: Cursor

Made-with: Cursor

mohitgargai requested a review from purvanshi as a code owner April 24, 2026 13:37

mohitgargai added 2 commits April 24, 2026 19:15

lint: separate stdlib from third-party imports in _merge_card_configs

796df1a

Made-with: Cursor

mohitgargai merged commit cbb11c0 into main Apr 24, 2026
12 checks passed

mohitgargai deleted the fix/hf-schema-drift branch April 24, 2026 14:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(hf): route all uploads through bench.load_data()#10

fix(hf): route all uploads through bench.load_data()#10
mohitgargai merged 3 commits intomainfrom
fix/hf-schema-drift

mohitgargai commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mohitgargai commented Apr 24, 2026

Summary

Root causes

Changes

Verification

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant