fix(hf): route all uploads through bench.load_data()#10
Merged
mohitgargai merged 3 commits intomainfrom Apr 24, 2026
Merged
Conversation
…ndling
The upload script had three alternative code paths (load_csv_benchmark,
load_json_benchmark, load_manifest_benchmark) that bypassed each
benchmark's own load_data() to save time on dataset enumeration. Those
shortcuts drifted from the contract build_model_input() expects at
runtime, producing nested / half-formed sample rows on HF that crashed
downstream. The sweep showed 8 of 39 benchmarks broken on HF:
svg-1, svg-2, svg-5 KeyError: 'options' / 'original_svg'
layout-2, layout-8 TypeError / KeyError: 'input_image'
temporal-1, temporal-2, temporal-3
KeyError: 'shuffled_keyframe_paths' / 'video_path'
Two bugs were at play:
1. The shortcut paths produced the wrong sample shape. Fix: delete
them; always go through load_via_registry so HF parquet is
round-trip equivalent to local load_data() output.
2. load_via_registry was over-eagerly excluding keys like video_path,
input_image, source_image, input_composite from metadata, assuming
they'd be packed into the `image` column. That column holds at most
one PIL blob, so path-valued keys were silently lost. Fix: only
exclude sample_id / ground_truth / prompt (the columns that
faithfully preserve those values); everything else survives in
metadata.
Also:
- Strip absolute-path prefixes from metadata (portability — HF
consumers don't share the uploader's filesystem layout).
- Add per-500-sample progress logging so slow load_data() invocations
aren't a silent black box.
- hf.py: guard against non-PIL image column values (Value("string")
for generation-task configs) to avoid `'str' object has no attribute
'save'`.
Verification: in-process sweep (load_from_hub → build_model_input) now
reports 38/39 benchmarks passing, up from 31/39. layout-2 upload
pending completion (load_data() takes ~10 min on macOS — one-time
upload cost, not a runtime concern).
Made-with: Cursor
- drop SKIP_BENCHMARKS (never populated) and its guard in load_benchmark - trim load_benchmark / _normalize_paths / _merge_card_configs docstrings (historical refactor narrative belongs in the commit log, not the code) - collapse multi-line comments in load_via_registry and hf.load_from_hub to one-liners that describe invariants rather than restate the code - drop unused api param from _merge_card_configs - shorten hf.py module/function docstrings No behavior change. Made-with: Cursor
Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes schema drift between the uploaded HF parquet files (
lica-world/GDB) and what each benchmark'sbuild_model_input()expects at runtime. The first-user drill surfaced a crash ongdb eval --benchmarks svg-1(KeyError: 'options'); an in-process sweep of all 39 benchmarks against HF showed 8 broken — all for the same root cause.Root causes
The upload script had three alternative code paths (
load_csv_benchmark,load_json_benchmark,load_manifest_benchmark) that bypassed each benchmark's ownload_data()to save time on dataset enumeration. They drifted from the runtime contract — e.g.svg-1's localload_data()expands nestedquestionsinto ~1200 flat samples per question, butload_json_benchmarkproduced 300 rows with the nested structure dumped verbatim into metadata, sobuild_model_input()crashed.A second bug in
load_via_registrywas over-eagerly excluding keys likevideo_path,input_image,source_image,input_compositefrommetadata— assuming they'd be packed into theimagecolumn. That column holds at most one PIL blob, so path-valued keys were silently lost. This is whytemporal-2/3(video_path) andlayout-8(input_image) also broke.Changes
load_csv_benchmark,load_json_benchmark,load_manifest_benchmarkand theMANIFEST_BENCHMARKSconfig table. Upload now routes exclusively throughload_via_registry, which callsbench.load_data()— one source of truth.load_via_registry, only excludesample_id,ground_truth,promptfrommetadata(those are faithfully preserved in their own parquet columns). Everything else survives in metadata, including path-valued keys that theimagecolumn doesn't cover.load_data()invocations aren't a silent black box.src/gdb/hf.py: guard against non-PILimagecolumn values (Value(\"string\")for generation-only configs) to avoid'str' object has no attribute 'save'.Verification
In-process sweep (
load_from_hub→bench.build_model_input):Fixed:
svg-1,svg-2,svg-5,layout-8,temporal-1,temporal-2,temporal-3.layout-2upload is in progress — itsload_data()takes ~10 min on macOS (realpath-bound on a ~2k-row manifest with nested asset resolution); the code fix is verified via local round-trip but the re-upload is still running and will land in a follow-up commit once complete.Test plan
build_model_inputlayout-2re-upload completes and brings count to 39/39Made with Cursor