-
Notifications
You must be signed in to change notification settings - Fork 21
Description
Disclosure: this issue text and the linked implementation plan were drafted by GPT-5.4-xhigh after auditing this repository. I reviewed both carefully, agree with the analysis and proposal, and am explicitly approving/posting this issue to get maintainer feedback before any implementation starts.
After that audit, the main conclusion is:
ere currently standardizes the host API better than it standardizes benchmark semantics.
That is reasonable for a library, but it creates a real risk for benchmarking: two rows can look comparable while actually measure different things.
I wrote up a concrete implementation plan here:
https://gist.github.com/quangvdao/1c01a57cfbecd0c2868f7c3774008b99
I am opening this issue first because I would rather get maintainers aligned on the direction before anyone spends time implementing a large cross-cutting refactor.
What seems misleading today
Several benchmark-relevant semantics are currently implicit:
- Setup/preprocessing costs are included in different places depending on the backend.
ProofKindis too coarse for benchmark interpretation:Compresseddoes not mean the same proof artifact everywhere.- Public output semantics differ across backends: raw bytes, hashed output, field-encoded output, or proof-bound digests.
- Wrapper/orchestration layers are not first-class benchmark dimensions: in-process SDKs, CLI wrappers, Dockerized RPC, local servers, and cluster proving all have different measurement surfaces.
- Benchmark helpers still encode implicit choices such as randomized fixtures and default proof kinds.
Concrete examples
- Randomized benchmark-like fixtures:
crates/test-utils/src/program/basic.rs:62-72 - Implicit proof target via
ProofKind::default():
crates/test-utils/src/host.rs:32-35 - Single-field prove reporting:
crates/zkvm-interface/src/zkvm/report.rs:31-38 - Coarse proof-kind surface:
crates/zkvm-interface/src/zkvm/proof.rs:11-17 - Constructor-time preprocessing in OpenVM:
crates/zkvm/openvm/src/zkvm.rs:75-100 - Timed prove path plus post-proof verification in OpenVM:
crates/zkvm/openvm/src/zkvm.rs:169-202 - Constructor-time preprocessing in Jolt:
crates/zkvm/jolt/src/zkvm/sdk.rs:93-101 - Proof/public-value digest binding in Pico:
crates/zkvm/pico/src/zkvm.rs:108-146 - Dockerized wrapper still reporting wrapped backend identity:
crates/dockerized/src/zkvm.rs:436-441 - ZisK mixing local timing and remote-reported timing:
crates/zkvm/zisk/src/zkvm/sdk.rs:172-184
None of these are necessarily bugs in isolation. The issue is that benchmark consumers currently have to infer these differences from backend internals.
Proposed direction
I think ere should adopt an explicit benchmark contract.
At a high level, that would mean:
- Define benchmark vocabulary and canonical workload families.
- Make each backend declare its benchmark semantics explicitly.
- Replace the single proving-time number with phase-based reporting.
- Make wrapper/orchestration overhead visible instead of implicit.
- Add a reference harness that exports self-describing rows and rejects invalid comparisons.
The goal is not to force all backends into identical semantics. The goal is to make differences explicit, so published comparisons are either truthful or rejected.
Why this seems aligned with existing work
This seems like a continuation of the repo’s direction, not a departure from it:
#28/#29already move toward better timing separation.#67and#68intentionally move preprocessing into instantiation for some backends; this proposal would expose that policy rather than undo it.#161already recognizes multiple proof kinds, but benchmarking still needs richer proof-artifact metadata.#205makes output hashing a real portability mechanism; benchmark families should name that explicitly.#303asks for prove-side metadata inere-server, which seems closely related.
Proposed principle
If two benchmark rows are comparable, the code should be able to explain why they are comparable.
If it cannot, the difference should be surfaced explicitly or the comparison should be rejected.
Questions for maintainers
- Does this direction make sense for
ere? - Should comparability rules live in
ereitself, or only in downstream benchmark harnesses? - Would you prefer work to start with workload vocabulary, backend metadata, or reporting?
- Are there any current benchmark ambiguities that are intentional and should remain implicit?