feat(cli): gdb console script, bundled verify fixture, and v0-* suites (0.2.0) by mohitgargai · Pull Request #7 · lica-world/GDB

mohitgargai · 2026-04-24T10:05:36Z

Summary

Makes lica-gdb installable and runnable with one command on a fresh
machine. After pip install lica-gdb, researchers can now run
gdb verify to confirm the install is working end-to-end (zero API keys,
no dataset download), and gdb eval --provider ... --suite v0-all to
actually benchmark a model. This is the reproducibility foundation the
repo needs before we attempt any external-facing leaderboard/reference
site work.

Also clarifies the long-standing "39 vs 49 tasks" confusion in the README,
and demotes aspirational v1.0-* suite names to v0-* until the
evaluation definitions are formally frozen.

Changes

New gdb console script (src/gdb/cli.py) exposing list,
info, suites, eval, verify, submit, collect, score,
report. Wired up via [project.scripts] in pyproject.toml.
python -m gdb now delegates to the same entry point.
Named suites (src/gdb/suites.py): v0-all, v0-smoke,
v0-understanding, v0-generation. The v0-* prefix is intentional
— the module docstring spells out that evaluation definitions are not
yet frozen and what needs to happen before a v1.0-* family ships.
Bundled verify fixture (src/gdb/_verify_data/, ~216 KB, 27
files): a downsampled slice of the public Lica dataset covering the
v0-smoke suite. gdb verify defaults to this fixture when no data
source is supplied, so the smoke test runs offline with zero API keys.
Reproducible via scripts/build_verify_dataset.py.
scripts/run_benchmarks.py is now a deprecation shim. Behavior
unchanged; docstring and a stderr warning point users at the new CLI.
README updates: (1) quickstart examples migrated from
python scripts/run_benchmarks.py ... to gdb ...; (2) Benchmarks
section now separates the 39 repo pipelines from the 45 paper tasks
they cover and explicitly names the 4 paper tasks (layout-u-5..8)
that don't yet have a runnable pipeline; (3) a new gdb submit /
gdb collect example.
Version bump to 0.2.0 (no breaking API changes).
__version__ fix: importlib.metadata.version now looks up the
correct distribution name (lica-gdb, not gdb), so gdb --version
prints the installed version instead of a stale 0.1.0 fallback.

Testing

ruff check src/ scripts/ passes locally
python scripts/run_benchmarks.py --list still works (shim emits
deprecation notice on stderr, exits cleanly)
gdb list shows 39 benchmarks
gdb suites shows the four v0-* suites with correct counts
gdb verify completes end-to-end against the bundled fixture, no
--dataset-root required
gdb eval --stub-model --suite v0-smoke --dataset-root src/gdb/_verify_data
completes for all 6 smoke tasks
gdb eval --stub-model --benchmarks layout-4 typography-1 --output <file>.json
produces a valid single-file report
Verified bundled _verify_data/ contains no absolute paths,
credentials, emails, or URLs; only relative paths and ordinary
public-Lica design annotations

Real-provider paths (gdb eval --provider openai|gemini|anthropic|hf|vllm|...,
gdb submit, gdb collect, gdb score, gdb report) were not smoke-tested
in this PR. They share the same underlying runner code as the deprecated
script, but worth confirming on at least one provider before tagging 0.2.0
as a PyPI release.

Known follow-ups (not in this PR)

scripts/README.md still describes the old run_benchmarks.py
invocations verbatim; worth a separate migration pass.
Formal benchmark versioning: write BENCHMARK.md pinning tasks,
prompts, sample selection, and metric wiring, and surface a fingerprint
via gdb suites --show-fingerprint. Prerequisite for introducing a
v1.0-* suite family that papers can cite.
The 4 unimplemented paper tasks (layout-u-5..8) either need
pipelines or an explicit note in the paper's next revision.

Made with Cursor

Introduce `gdb.cli` as the single CLI entry point and wire it up as a `[project.scripts]` console script, so `pip install lica-gdb` gives users a `gdb` command directly. Subcommands: `list`, `info`, `suites`, `eval`, `verify`, `submit`, `collect`, `score`, `report`. Also add `gdb.suites`, which defines named benchmark suites. The module docstring is explicit that evaluation definitions are not yet frozen, so suites live under the `v0-*` prefix (`v0-all`, `v0-smoke`, `v0-understanding`, `v0-generation`). A `v1.0-*` family will be introduced once tasks, prompts, sample selection, and metric wiring are pinned with a documented fingerprint; until then, cite suite name + `lica-gdb` version. Other changes: - `python -m gdb` now delegates to `gdb.cli.main` (one-line shim). - `__version__` looks up the correct distribution name (`lica-gdb`, not `gdb`), so `gdb --version` prints the installed version instead of the hard-coded `0.1.0` fallback. Made-with: Cursor

Add `src/gdb/_verify_data/` (27 files, ~216 KB), a downsampled slice of the public Lica dataset covering the `v0-smoke` suite only: one benchmark per domain (category, layout, typography, svg, template), ~2 samples each. `gdb verify` now defaults to this bundled fixture when neither `--dataset-root` nor `--data` is supplied, so a fresh `pip install lica-gdb && gdb verify` completes end-to-end with zero API keys and no dataset downloads. This is the smoke path only; real evaluations still point at the full dataset (local bundle or HuggingFace Hub). Also add `scripts/build_verify_dataset.py`, the reproducible builder for the fixture, so the bundled data can be regenerated from any future revision of the Lica dataset. The `[tool.setuptools.package-data]` block is extended to ship the fixture inside the wheel. Made-with: Cursor

Rewrite the module docstring as a deprecation notice with explicit command equivalents (`--list` → `gdb list`, `--stub-model` → `gdb eval --stub-model` or `gdb verify`, `--batch-submit` → `gdb submit`, `--collect` → `gdb collect`), and emit a one-line stderr warning from `main()` so CI and ad-hoc callers see the notice at runtime. Behavior is unchanged; existing invocations and the CI `--list` smoke test continue to work. Made-with: Cursor

The headline count mismatch between the paper (49 tasks) and the repo (39 runnable benchmarks) has been a repeated source of confusion. Rewrite the Benchmarks section to spell out the relationship: - The paper defines 49 evaluation tasks. - This repo ships 39 benchmark pipelines covering 45 of those tasks, via an N:1 mapping where two pipelines (`typography-3`, `temporal-3`) score multiple paper tasks from a single model call. - The remaining four paper tasks are layout-understanding only and do not have runnable pipelines in the repo yet (`layout-u-5..8`); they are listed by name so readers can cross-reference the paper. The Totals row in the benchmarks table now sums to 39 / 45, matching what `gdb list` reports. Made-with: Cursor

Replace every `python scripts/run_benchmarks.py ...` example in the Verify, Run benchmarks, and Vertex AI sections with the equivalent `gdb` invocation, and add a zero-config `gdb verify` line showcasing the bundled fixture. Also: - Add a `gdb eval --suite v0-all` example for whole-suite runs. - Add a `gdb submit` / `gdb collect` pair for the batch path. - Call out the `v0-*` suite naming in line with `src/gdb/suites.py`. - Mark `scripts/run_benchmarks.py` as a deprecated shim in the project tree and add `scripts/build_verify_dataset.py` alongside it. Made-with: Cursor

This release contains: - New `gdb` console script with `list`, `info`, `suites`, `eval`, `verify`, `submit`, `collect`, `score`, `report` subcommands. - `gdb verify` now runs a zero-config smoke test against a bundled fixture — no API keys, no dataset download required. - Named benchmark suites (`v0-all`, `v0-smoke`, `v0-understanding`, `v0-generation`). The `v0-*` prefix is deliberate: evaluation definitions are not yet frozen. A `v1.0-*` family will be introduced once tasks, prompts, sample selection, and metric wiring are pinned. - `scripts/run_benchmarks.py` is now a deprecation shim; prefer `gdb`. - README quickstart migrated to the `gdb` CLI. - Benchmarks table now clearly separates repo pipelines (39) from paper tasks (49, of which 45 are covered). No breaking API changes: `python -m gdb` and the legacy `scripts/run_benchmarks.py` invocations continue to work. Made-with: Cursor

mohitgargai added 6 commits April 24, 2026 15:29

mohitgargai requested a review from purvanshi as a code owner April 24, 2026 10:05

mohitgargai merged commit ff346cf into main Apr 24, 2026
4 checks passed

mohitgargai deleted the feat/gdb-cli-and-bundled-verify branch April 24, 2026 10:07

mohitgargai mentioned this pull request Apr 24, 2026

test: add unit + subprocess tests for the gdb CLI #8

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cli): gdb console script, bundled verify fixture, and v0-* suites (0.2.0)#7

feat(cli): gdb console script, bundled verify fixture, and v0-* suites (0.2.0)#7
mohitgargai merged 6 commits intomainfrom
feat/gdb-cli-and-bundled-verify

mohitgargai commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mohitgargai commented Apr 24, 2026

Summary

Changes

Testing

Known follow-ups (not in this PR)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant