feat(cli): gdb console script, bundled verify fixture, and v0-* suites (0.2.0)#7
Merged
mohitgargai merged 6 commits intomainfrom Apr 24, 2026
Merged
Conversation
Introduce `gdb.cli` as the single CLI entry point and wire it up as a `[project.scripts]` console script, so `pip install lica-gdb` gives users a `gdb` command directly. Subcommands: `list`, `info`, `suites`, `eval`, `verify`, `submit`, `collect`, `score`, `report`. Also add `gdb.suites`, which defines named benchmark suites. The module docstring is explicit that evaluation definitions are not yet frozen, so suites live under the `v0-*` prefix (`v0-all`, `v0-smoke`, `v0-understanding`, `v0-generation`). A `v1.0-*` family will be introduced once tasks, prompts, sample selection, and metric wiring are pinned with a documented fingerprint; until then, cite suite name + `lica-gdb` version. Other changes: - `python -m gdb` now delegates to `gdb.cli.main` (one-line shim). - `__version__` looks up the correct distribution name (`lica-gdb`, not `gdb`), so `gdb --version` prints the installed version instead of the hard-coded `0.1.0` fallback. Made-with: Cursor
Add `src/gdb/_verify_data/` (27 files, ~216 KB), a downsampled slice of the public Lica dataset covering the `v0-smoke` suite only: one benchmark per domain (category, layout, typography, svg, template), ~2 samples each. `gdb verify` now defaults to this bundled fixture when neither `--dataset-root` nor `--data` is supplied, so a fresh `pip install lica-gdb && gdb verify` completes end-to-end with zero API keys and no dataset downloads. This is the smoke path only; real evaluations still point at the full dataset (local bundle or HuggingFace Hub). Also add `scripts/build_verify_dataset.py`, the reproducible builder for the fixture, so the bundled data can be regenerated from any future revision of the Lica dataset. The `[tool.setuptools.package-data]` block is extended to ship the fixture inside the wheel. Made-with: Cursor
Rewrite the module docstring as a deprecation notice with explicit command equivalents (`--list` → `gdb list`, `--stub-model` → `gdb eval --stub-model` or `gdb verify`, `--batch-submit` → `gdb submit`, `--collect` → `gdb collect`), and emit a one-line stderr warning from `main()` so CI and ad-hoc callers see the notice at runtime. Behavior is unchanged; existing invocations and the CI `--list` smoke test continue to work. Made-with: Cursor
The headline count mismatch between the paper (49 tasks) and the repo (39 runnable benchmarks) has been a repeated source of confusion. Rewrite the Benchmarks section to spell out the relationship: - The paper defines 49 evaluation tasks. - This repo ships 39 benchmark pipelines covering 45 of those tasks, via an N:1 mapping where two pipelines (`typography-3`, `temporal-3`) score multiple paper tasks from a single model call. - The remaining four paper tasks are layout-understanding only and do not have runnable pipelines in the repo yet (`layout-u-5..8`); they are listed by name so readers can cross-reference the paper. The Totals row in the benchmarks table now sums to 39 / 45, matching what `gdb list` reports. Made-with: Cursor
Replace every `python scripts/run_benchmarks.py ...` example in the Verify, Run benchmarks, and Vertex AI sections with the equivalent `gdb` invocation, and add a zero-config `gdb verify` line showcasing the bundled fixture. Also: - Add a `gdb eval --suite v0-all` example for whole-suite runs. - Add a `gdb submit` / `gdb collect` pair for the batch path. - Call out the `v0-*` suite naming in line with `src/gdb/suites.py`. - Mark `scripts/run_benchmarks.py` as a deprecated shim in the project tree and add `scripts/build_verify_dataset.py` alongside it. Made-with: Cursor
This release contains: - New `gdb` console script with `list`, `info`, `suites`, `eval`, `verify`, `submit`, `collect`, `score`, `report` subcommands. - `gdb verify` now runs a zero-config smoke test against a bundled fixture — no API keys, no dataset download required. - Named benchmark suites (`v0-all`, `v0-smoke`, `v0-understanding`, `v0-generation`). The `v0-*` prefix is deliberate: evaluation definitions are not yet frozen. A `v1.0-*` family will be introduced once tasks, prompts, sample selection, and metric wiring are pinned. - `scripts/run_benchmarks.py` is now a deprecation shim; prefer `gdb`. - README quickstart migrated to the `gdb` CLI. - Benchmarks table now clearly separates repo pipelines (39) from paper tasks (49, of which 45 are covered). No breaking API changes: `python -m gdb` and the legacy `scripts/run_benchmarks.py` invocations continue to work. Made-with: Cursor
gdb submit manifest.
score Re-score a precomputed CSV of model outputs.
report Render a run-report JSON as markdown.
options:
-h, --help show this help message and exit
--version Print the installed lica-gdb version and exit.
-v, --verbose Enable debug logging. console script, bundled verify fixture, and v0-* suites (0.2.0)
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Makes
lica-gdbinstallable and runnable with one command on a freshmachine. After
pip install lica-gdb, researchers can now rungdb verifyto confirm the install is working end-to-end (zero API keys,no dataset download), and
gdb eval --provider ... --suite v0-alltoactually benchmark a model. This is the reproducibility foundation the
repo needs before we attempt any external-facing leaderboard/reference
site work.
Also clarifies the long-standing "39 vs 49 tasks" confusion in the README,
and demotes aspirational
v1.0-*suite names tov0-*until theevaluation definitions are formally frozen.
Changes
gdbconsole script (src/gdb/cli.py) exposinglist,info,suites,eval,verify,submit,collect,score,report. Wired up via[project.scripts]inpyproject.toml.python -m gdbnow delegates to the same entry point.src/gdb/suites.py):v0-all,v0-smoke,v0-understanding,v0-generation. Thev0-*prefix is intentional— the module docstring spells out that evaluation definitions are not
yet frozen and what needs to happen before a
v1.0-*family ships.src/gdb/_verify_data/, ~216 KB, 27files): a downsampled slice of the public Lica dataset covering the
v0-smokesuite.gdb verifydefaults to this fixture when no datasource is supplied, so the smoke test runs offline with zero API keys.
Reproducible via
scripts/build_verify_dataset.py.scripts/run_benchmarks.pyis now a deprecation shim. Behaviorunchanged; docstring and a stderr warning point users at the new CLI.
python scripts/run_benchmarks.py ...togdb ...; (2) Benchmarkssection now separates the 39 repo pipelines from the 45 paper tasks
they cover and explicitly names the 4 paper tasks (
layout-u-5..8)that don't yet have a runnable pipeline; (3) a new
gdb submit/gdb collectexample.__version__fix:importlib.metadata.versionnow looks up thecorrect distribution name (
lica-gdb, notgdb), sogdb --versionprints the installed version instead of a stale
0.1.0fallback.Testing
ruff check src/ scripts/passes locallypython scripts/run_benchmarks.py --liststill works (shim emitsdeprecation notice on stderr, exits cleanly)
gdb listshows 39 benchmarksgdb suitesshows the fourv0-*suites with correct countsgdb verifycompletes end-to-end against the bundled fixture, no--dataset-rootrequiredgdb eval --stub-model --suite v0-smoke --dataset-root src/gdb/_verify_datacompletes for all 6 smoke tasks
gdb eval --stub-model --benchmarks layout-4 typography-1 --output <file>.jsonproduces a valid single-file report
_verify_data/contains no absolute paths,credentials, emails, or URLs; only relative paths and ordinary
public-Lica design annotations
Real-provider paths (
gdb eval --provider openai|gemini|anthropic|hf|vllm|...,gdb submit,gdb collect,gdb score,gdb report) were not smoke-testedin this PR. They share the same underlying runner code as the deprecated
script, but worth confirming on at least one provider before tagging 0.2.0
as a PyPI release.
Known follow-ups (not in this PR)
scripts/README.mdstill describes the oldrun_benchmarks.pyinvocations verbatim; worth a separate migration pass.
BENCHMARK.mdpinning tasks,prompts, sample selection, and metric wiring, and surface a fingerprint
via
gdb suites --show-fingerprint. Prerequisite for introducing av1.0-*suite family that papers can cite.layout-u-5..8) either needpipelines or an explicit note in the paper's next revision.
Made with Cursor