Skip to content

Proposal: make ere benchmark semantics explicit before publishing more cross-zkVM comparisons #306

@quangvdao

Description

@quangvdao

Disclosure: this issue text and the linked implementation plan were drafted by GPT-5.4-xhigh after auditing this repository. I reviewed both carefully, agree with the analysis and proposal, and am explicitly approving/posting this issue to get maintainer feedback before any implementation starts.

After that audit, the main conclusion is:

ere currently standardizes the host API better than it standardizes benchmark semantics.

That is reasonable for a library, but it creates a real risk for benchmarking: two rows can look comparable while actually measure different things.

I wrote up a concrete implementation plan here:
https://gist.github.com/quangvdao/1c01a57cfbecd0c2868f7c3774008b99

I am opening this issue first because I would rather get maintainers aligned on the direction before anyone spends time implementing a large cross-cutting refactor.

What seems misleading today

Several benchmark-relevant semantics are currently implicit:

  • Setup/preprocessing costs are included in different places depending on the backend.
  • ProofKind is too coarse for benchmark interpretation: Compressed does not mean the same proof artifact everywhere.
  • Public output semantics differ across backends: raw bytes, hashed output, field-encoded output, or proof-bound digests.
  • Wrapper/orchestration layers are not first-class benchmark dimensions: in-process SDKs, CLI wrappers, Dockerized RPC, local servers, and cluster proving all have different measurement surfaces.
  • Benchmark helpers still encode implicit choices such as randomized fixtures and default proof kinds.

Concrete examples

  • Randomized benchmark-like fixtures:
    crates/test-utils/src/program/basic.rs:62-72
  • Implicit proof target via ProofKind::default():
    crates/test-utils/src/host.rs:32-35
  • Single-field prove reporting:
    crates/zkvm-interface/src/zkvm/report.rs:31-38
  • Coarse proof-kind surface:
    crates/zkvm-interface/src/zkvm/proof.rs:11-17
  • Constructor-time preprocessing in OpenVM:
    crates/zkvm/openvm/src/zkvm.rs:75-100
  • Timed prove path plus post-proof verification in OpenVM:
    crates/zkvm/openvm/src/zkvm.rs:169-202
  • Constructor-time preprocessing in Jolt:
    crates/zkvm/jolt/src/zkvm/sdk.rs:93-101
  • Proof/public-value digest binding in Pico:
    crates/zkvm/pico/src/zkvm.rs:108-146
  • Dockerized wrapper still reporting wrapped backend identity:
    crates/dockerized/src/zkvm.rs:436-441
  • ZisK mixing local timing and remote-reported timing:
    crates/zkvm/zisk/src/zkvm/sdk.rs:172-184

None of these are necessarily bugs in isolation. The issue is that benchmark consumers currently have to infer these differences from backend internals.

Proposed direction

I think ere should adopt an explicit benchmark contract.

At a high level, that would mean:

  1. Define benchmark vocabulary and canonical workload families.
  2. Make each backend declare its benchmark semantics explicitly.
  3. Replace the single proving-time number with phase-based reporting.
  4. Make wrapper/orchestration overhead visible instead of implicit.
  5. Add a reference harness that exports self-describing rows and rejects invalid comparisons.

The goal is not to force all backends into identical semantics. The goal is to make differences explicit, so published comparisons are either truthful or rejected.

Why this seems aligned with existing work

This seems like a continuation of the repo’s direction, not a departure from it:

  • #28 / #29 already move toward better timing separation.
  • #67 and #68 intentionally move preprocessing into instantiation for some backends; this proposal would expose that policy rather than undo it.
  • #161 already recognizes multiple proof kinds, but benchmarking still needs richer proof-artifact metadata.
  • #205 makes output hashing a real portability mechanism; benchmark families should name that explicitly.
  • #303 asks for prove-side metadata in ere-server, which seems closely related.

Proposed principle

If two benchmark rows are comparable, the code should be able to explain why they are comparable.

If it cannot, the difference should be surfaced explicitly or the comparison should be rejected.

Questions for maintainers

  • Does this direction make sense for ere?
  • Should comparability rules live in ere itself, or only in downstream benchmark harnesses?
  • Would you prefer work to start with workload vocabulary, backend metadata, or reporting?
  • Are there any current benchmark ambiguities that are intentional and should remain implicit?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions