Skip to content

feat(cli): gdb console script, bundled verify fixture, and v0-* suites (0.2.0)#7

Merged
mohitgargai merged 6 commits intomainfrom
feat/gdb-cli-and-bundled-verify
Apr 24, 2026
Merged

feat(cli): gdb console script, bundled verify fixture, and v0-* suites (0.2.0)#7
mohitgargai merged 6 commits intomainfrom
feat/gdb-cli-and-bundled-verify

Conversation

@mohitgargai
Copy link
Copy Markdown
Contributor

Summary

Makes lica-gdb installable and runnable with one command on a fresh
machine. After pip install lica-gdb, researchers can now run
gdb verify to confirm the install is working end-to-end (zero API keys,
no dataset download), and gdb eval --provider ... --suite v0-all to
actually benchmark a model. This is the reproducibility foundation the
repo needs before we attempt any external-facing leaderboard/reference
site work.

Also clarifies the long-standing "39 vs 49 tasks" confusion in the README,
and demotes aspirational v1.0-* suite names to v0-* until the
evaluation definitions are formally frozen.

Changes

  • New gdb console script (src/gdb/cli.py) exposing list,
    info, suites, eval, verify, submit, collect, score,
    report. Wired up via [project.scripts] in pyproject.toml.
    python -m gdb now delegates to the same entry point.
  • Named suites (src/gdb/suites.py): v0-all, v0-smoke,
    v0-understanding, v0-generation. The v0-* prefix is intentional
    — the module docstring spells out that evaluation definitions are not
    yet frozen and what needs to happen before a v1.0-* family ships.
  • Bundled verify fixture (src/gdb/_verify_data/, ~216 KB, 27
    files): a downsampled slice of the public Lica dataset covering the
    v0-smoke suite. gdb verify defaults to this fixture when no data
    source is supplied, so the smoke test runs offline with zero API keys.
    Reproducible via scripts/build_verify_dataset.py.
  • scripts/run_benchmarks.py is now a deprecation shim. Behavior
    unchanged; docstring and a stderr warning point users at the new CLI.
  • README updates: (1) quickstart examples migrated from
    python scripts/run_benchmarks.py ... to gdb ...; (2) Benchmarks
    section now separates the 39 repo pipelines from the 45 paper tasks
    they cover and explicitly names the 4 paper tasks (layout-u-5..8)
    that don't yet have a runnable pipeline; (3) a new gdb submit /
    gdb collect example.
  • Version bump to 0.2.0 (no breaking API changes).
  • __version__ fix: importlib.metadata.version now looks up the
    correct distribution name (lica-gdb, not gdb), so gdb --version
    prints the installed version instead of a stale 0.1.0 fallback.

Testing

  • ruff check src/ scripts/ passes locally
  • python scripts/run_benchmarks.py --list still works (shim emits
    deprecation notice on stderr, exits cleanly)
  • gdb list shows 39 benchmarks
  • gdb suites shows the four v0-* suites with correct counts
  • gdb verify completes end-to-end against the bundled fixture, no
    --dataset-root required
  • gdb eval --stub-model --suite v0-smoke --dataset-root src/gdb/_verify_data
    completes for all 6 smoke tasks
  • gdb eval --stub-model --benchmarks layout-4 typography-1 --output <file>.json
    produces a valid single-file report
  • Verified bundled _verify_data/ contains no absolute paths,
    credentials, emails, or URLs; only relative paths and ordinary
    public-Lica design annotations

Real-provider paths (gdb eval --provider openai|gemini|anthropic|hf|vllm|...,
gdb submit, gdb collect, gdb score, gdb report) were not smoke-tested
in this PR. They share the same underlying runner code as the deprecated
script, but worth confirming on at least one provider before tagging 0.2.0
as a PyPI release.

Known follow-ups (not in this PR)

  • scripts/README.md still describes the old run_benchmarks.py
    invocations verbatim; worth a separate migration pass.
  • Formal benchmark versioning: write BENCHMARK.md pinning tasks,
    prompts, sample selection, and metric wiring, and surface a fingerprint
    via gdb suites --show-fingerprint. Prerequisite for introducing a
    v1.0-* suite family that papers can cite.
  • The 4 unimplemented paper tasks (layout-u-5..8) either need
    pipelines or an explicit note in the paper's next revision.

Made with Cursor

Introduce `gdb.cli` as the single CLI entry point and wire it up as a
`[project.scripts]` console script, so `pip install lica-gdb` gives users a
`gdb` command directly. Subcommands: `list`, `info`, `suites`, `eval`,
`verify`, `submit`, `collect`, `score`, `report`.

Also add `gdb.suites`, which defines named benchmark suites. The module
docstring is explicit that evaluation definitions are not yet frozen, so
suites live under the `v0-*` prefix (`v0-all`, `v0-smoke`,
`v0-understanding`, `v0-generation`). A `v1.0-*` family will be introduced
once tasks, prompts, sample selection, and metric wiring are pinned with a
documented fingerprint; until then, cite suite name + `lica-gdb` version.

Other changes:
- `python -m gdb` now delegates to `gdb.cli.main` (one-line shim).
- `__version__` looks up the correct distribution name (`lica-gdb`, not
  `gdb`), so `gdb --version` prints the installed version instead of the
  hard-coded `0.1.0` fallback.

Made-with: Cursor
Add `src/gdb/_verify_data/` (27 files, ~216 KB), a downsampled slice of the
public Lica dataset covering the `v0-smoke` suite only: one benchmark per
domain (category, layout, typography, svg, template), ~2 samples each.

`gdb verify` now defaults to this bundled fixture when neither
`--dataset-root` nor `--data` is supplied, so a fresh
`pip install lica-gdb && gdb verify` completes end-to-end with zero API
keys and no dataset downloads. This is the smoke path only; real
evaluations still point at the full dataset (local bundle or HuggingFace
Hub).

Also add `scripts/build_verify_dataset.py`, the reproducible builder for
the fixture, so the bundled data can be regenerated from any future
revision of the Lica dataset. The `[tool.setuptools.package-data]` block
is extended to ship the fixture inside the wheel.

Made-with: Cursor
Rewrite the module docstring as a deprecation notice with explicit
command equivalents (`--list` → `gdb list`, `--stub-model` →
`gdb eval --stub-model` or `gdb verify`, `--batch-submit` → `gdb submit`,
`--collect` → `gdb collect`), and emit a one-line stderr warning from
`main()` so CI and ad-hoc callers see the notice at runtime.

Behavior is unchanged; existing invocations and the CI `--list` smoke
test continue to work.

Made-with: Cursor
The headline count mismatch between the paper (49 tasks) and the repo (39
runnable benchmarks) has been a repeated source of confusion. Rewrite the
Benchmarks section to spell out the relationship:

- The paper defines 49 evaluation tasks.
- This repo ships 39 benchmark pipelines covering 45 of those tasks, via
  an N:1 mapping where two pipelines (`typography-3`, `temporal-3`) score
  multiple paper tasks from a single model call.
- The remaining four paper tasks are layout-understanding only and do not
  have runnable pipelines in the repo yet (`layout-u-5..8`); they are
  listed by name so readers can cross-reference the paper.

The Totals row in the benchmarks table now sums to 39 / 45, matching what
`gdb list` reports.

Made-with: Cursor
Replace every `python scripts/run_benchmarks.py ...` example in the
Verify, Run benchmarks, and Vertex AI sections with the equivalent
`gdb` invocation, and add a zero-config `gdb verify` line showcasing the
bundled fixture. Also:

- Add a `gdb eval --suite v0-all` example for whole-suite runs.
- Add a `gdb submit` / `gdb collect` pair for the batch path.
- Call out the `v0-*` suite naming in line with `src/gdb/suites.py`.
- Mark `scripts/run_benchmarks.py` as a deprecated shim in the project
  tree and add `scripts/build_verify_dataset.py` alongside it.

Made-with: Cursor
This release contains:

- New `gdb` console script with `list`, `info`, `suites`, `eval`,
  `verify`, `submit`, `collect`, `score`, `report` subcommands.
- `gdb verify` now runs a zero-config smoke test against a bundled
  fixture — no API keys, no dataset download required.
- Named benchmark suites (`v0-all`, `v0-smoke`, `v0-understanding`,
  `v0-generation`). The `v0-*` prefix is deliberate: evaluation
  definitions are not yet frozen. A `v1.0-*` family will be introduced
  once tasks, prompts, sample selection, and metric wiring are pinned.
- `scripts/run_benchmarks.py` is now a deprecation shim; prefer `gdb`.
- README quickstart migrated to the `gdb` CLI.
- Benchmarks table now clearly separates repo pipelines (39) from
  paper tasks (49, of which 45 are covered).

No breaking API changes: `python -m gdb` and the legacy
`scripts/run_benchmarks.py` invocations continue to work.

Made-with: Cursor
@mohitgargai mohitgargai requested a review from purvanshi as a code owner April 24, 2026 10:05
@mohitgargai mohitgargai changed the title feat(cli): usage: gdb [-h] [--version] [-v] COMMAND ... GDB — GraphicDesignBench CLI positional arguments: COMMAND list List registered benchmarks. info Show details for a single benchmark. suites List named suites (or expand one). eval Run online inference against one or more benchmarks. verify Smoke-test the install with the stub model (no API keys). submit Submit a batch-API job for a single benchmark. collect Collect results from a previous gdb submit manifest. score Re-score a precomputed CSV of model outputs. report Render a run-report JSON as markdown. options: -h, --help show this help message and exit --version Print the installed lica-gdb version and exit. -v, --verbose Enable debug logging. console script, bundled verify fixture, and v0-* suites (0.2.0) feat(cli): gdb console script, bundled verify fixture, and v0-* suites (0.2.0) Apr 24, 2026
@mohitgargai mohitgargai merged commit ff346cf into main Apr 24, 2026
4 checks passed
@mohitgargai mohitgargai deleted the feat/gdb-cli-and-bundled-verify branch April 24, 2026 10:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant