Proposal: make ere benchmark semantics explicit before publishing more cross-zkVM comparisons

Disclosure: this issue text and the linked implementation plan were drafted by GPT-5.4-xhigh after auditing this repository. I reviewed both carefully, agree with the analysis and proposal, and am explicitly approving/posting this issue to get maintainer feedback before any implementation starts.

After that audit, the main conclusion is:

`ere` currently standardizes the host API better than it standardizes benchmark semantics.

That is reasonable for a library, but it creates a real risk for benchmarking: two rows can look comparable while actually measure different things.

I wrote up a concrete implementation plan here:
https://gist.github.com/quangvdao/1c01a57cfbecd0c2868f7c3774008b99

I am opening this issue first because I would rather get maintainers aligned on the direction before anyone spends time implementing a large cross-cutting refactor.

## What seems misleading today

Several benchmark-relevant semantics are currently implicit:

- Setup/preprocessing costs are included in different places depending on the backend.
- `ProofKind` is too coarse for benchmark interpretation: `Compressed` does not mean the same proof artifact everywhere.
- Public output semantics differ across backends: raw bytes, hashed output, field-encoded output, or proof-bound digests.
- Wrapper/orchestration layers are not first-class benchmark dimensions: in-process SDKs, CLI wrappers, Dockerized RPC, local servers, and cluster proving all have different measurement surfaces.
- Benchmark helpers still encode implicit choices such as randomized fixtures and default proof kinds.

## Concrete examples

- Randomized benchmark-like fixtures:
  `crates/test-utils/src/program/basic.rs:62-72`
- Implicit proof target via `ProofKind::default()`:
  `crates/test-utils/src/host.rs:32-35`
- Single-field prove reporting:
  `crates/zkvm-interface/src/zkvm/report.rs:31-38`
- Coarse proof-kind surface:
  `crates/zkvm-interface/src/zkvm/proof.rs:11-17`
- Constructor-time preprocessing in OpenVM:
  `crates/zkvm/openvm/src/zkvm.rs:75-100`
- Timed prove path plus post-proof verification in OpenVM:
  `crates/zkvm/openvm/src/zkvm.rs:169-202`
- Constructor-time preprocessing in Jolt:
  `crates/zkvm/jolt/src/zkvm/sdk.rs:93-101`
- Proof/public-value digest binding in Pico:
  `crates/zkvm/pico/src/zkvm.rs:108-146`
- Dockerized wrapper still reporting wrapped backend identity:
  `crates/dockerized/src/zkvm.rs:436-441`
- ZisK mixing local timing and remote-reported timing:
  `crates/zkvm/zisk/src/zkvm/sdk.rs:172-184`

None of these are necessarily bugs in isolation. The issue is that benchmark consumers currently have to infer these differences from backend internals.

## Proposed direction

I think `ere` should adopt an explicit benchmark contract.

At a high level, that would mean:

1. Define benchmark vocabulary and canonical workload families.
2. Make each backend declare its benchmark semantics explicitly.
3. Replace the single proving-time number with phase-based reporting.
4. Make wrapper/orchestration overhead visible instead of implicit.
5. Add a reference harness that exports self-describing rows and rejects invalid comparisons.

The goal is not to force all backends into identical semantics. The goal is to make differences explicit, so published comparisons are either truthful or rejected.

## Why this seems aligned with existing work

This seems like a continuation of the repo’s direction, not a departure from it:

- `#28` / `#29` already move toward better timing separation.
- `#67` and `#68` intentionally move preprocessing into instantiation for some backends; this proposal would expose that policy rather than undo it.
- `#161` already recognizes multiple proof kinds, but benchmarking still needs richer proof-artifact metadata.
- `#205` makes output hashing a real portability mechanism; benchmark families should name that explicitly.
- `#303` asks for prove-side metadata in `ere-server`, which seems closely related.

## Proposed principle

If two benchmark rows are comparable, the code should be able to explain why they are comparable.

If it cannot, the difference should be surfaced explicitly or the comparison should be rejected.

## Questions for maintainers

- Does this direction make sense for `ere`?
- Should comparability rules live in `ere` itself, or only in downstream benchmark harnesses?
- Would you prefer work to start with workload vocabulary, backend metadata, or reporting?
- Are there any current benchmark ambiguities that are intentional and should remain implicit?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: make ere benchmark semantics explicit before publishing more cross-zkVM comparisons #306

What seems misleading today

Concrete examples

Proposed direction

Why this seems aligned with existing work

Proposed principle

Questions for maintainers

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: make ere benchmark semantics explicit before publishing more cross-zkVM comparisons #306

Description

What seems misleading today

Concrete examples

Proposed direction

Why this seems aligned with existing work

Proposed principle

Questions for maintainers

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions