Fix: auto-strip ONNX post-training quantization on model load by philippelaporteconcordia · Pull Request #1025 · zkonduit/ezkl

philippelaporteconcordia · 2026-04-23T22:52:58Z

Summary

Closes #942.

EZKL crashes with an opaque tract panic ("Failed analyse for node ... ConvHir") when given an ONNX model that was post-training-quantized by onnxruntime.quantization (e.g. quantize_dynamic(weight_type=QInt8)). The PTQ ops — QuantizeLinear, DequantizeLinear, DynamicQuantizeLinear, MatMulInteger, ConvInteger, the QLinear* family — are not analysable by tract, and they're conceptually redundant with EZKL's own internal scale-driven quantization.

This PR makes pre-quantized models load and run end-to-end without any manual preprocessing.

What changed

Auto-dequantize on model load. A new in-process pass at src/graph/dequantize.rs (dequantize::apply(&mut ModelProto)) runs inside Model::new and canonicalises three patterns back to float equivalents:

Activation Q/DQ identity pairs — QuantizeLinear → DequantizeLinear with shared scale/zp is folded into a direct edge.
Weight DequantizeLinear — DequantizeLinear(W_int, scale, zp) on a weight initializer is folded into a single float initializer ((W_int - zp) * scale).
DynamicQuantizeLinear + integer-op fusion — the DynamicQuantizeLinear → ConvInteger/MatMulInteger → Cast → Mul subgraph that quantize_dynamic emits is collapsed to a plain Conv/MatMul over (x, dequantized_W). Spatial attributes on ConvInteger are preserved on the replacement Conv. Trailing bias Add nodes are left untouched.

The pass is purely protobuf-level: read bytes → prost::Message::decode → rewrite → re-encode → hand cleaned bytes to tract via a Cursor. Idempotent on already-clean models, and a no-op on float-only graphs.

Safety net + opt-out. A pre-existing detector (Model::reject_onnx_quantization_ops) survives as a fallback for patterns we don't rewrite (QLinearConv, QLinearMatMul, QLinearAdd, …) and for users who pass the new --disable-quantization-fixup flag. In both cases the loader returns GraphError::UnsupportedQuantizationOps with an actionable message that names the offending nodes and points at the dequantize subcommand.

ezkl dequantize subcommand. Same rewrite, exposed as a one-shot tool that writes a cleaned .onnx to disk — useful for inspection, audit, sharing, or feeding non-EZKL toolchains:

ezkl dequantize -M input.onnx -O cleaned.onnx

The command prints a per-pattern report (collapsed N Q/DQ pairs, folded N weight DQ, replaced N dynamic-quantize fusions).

Bindings + CLI surface.

RunArgs.disable_quantization_fixup: bool (default false), CLI: --disable-quantization-fixup.
PyRunArgs.disable_quantization_fixup mirrors the field for the Python bindings.

Dependencies. Adds prost = "0.11" as a direct optional dep (gated on the existing onnx feature), version-pinned to match tract's transitive use so the Message trait impls resolve consistently.

Demo on the issue's model

$ ezkl gen-settings -M face_landmark_quantized.onnx
[*] succeeded
← auto-dequantize transparently rewrote 45 fusions

$ ezkl dequantize -M face_landmark_quantized.onnx -O face_clean.onnx
[*] wrote cleaned ONNX to face_clean.onnx
(collapsed 0 Q/DQ pairs, folded 0 weight DQ, replaced 45 dynamic-quantize fusions)

$ ezkl gen-settings -M face_landmark_quantized.onnx --disable-quantization-fixup
[E] [graph] model contains ONNX quantization operators EZKL cannot rewrite
(conv2d_1__52:0_QuantizeLinear (DynamicQuantizeLinear), … (+39 more)).
EZKL handles quantization internally via the scale run argument and
transparently strips post-training-quantization patterns it recognises.
The operators above were not recognised — please export the original
floating-point model, or run ezkl dequantize -M <input.onnx> -O <output.onnx>
to inspect the partial rewrite.

The dequantize pass adds essentially zero overhead to gen-settings: timed at ~0.02 s on the issue's 86-node face_landmark model; gen-settings end-to-end timings on the quantized vs. pre-cleaned variant are within noise (~90 s either way, dominated by tract analysis + circuit construction).

Tests

Suite	Count	What it covers
`cargo test --lib graph::dequantize`	8	Each rewrite pattern, idempotence, float-only no-op, unsupported `QLinearConv` reporting, shared-`DynamicQuantizeLinear`-feeding-multiple-integer-ops
`cargo test --test quantization_detection`	4	Default `Model::new` accepts each Q/DQ fixture; with `disable_quantization_fixup=true` the safety net fires
`cargo test --test dequantize_pipeline`	2	`ezkl dequantize` subcommand round-trips; cleaned model loads with auto-pass disabled
`cargo test --test dequantize_e2e`	2 + 1 ignored	Full pipeline `gen-settings → calibrate → compile-circuit → gen-witness → mock` on the pre-quantized fixture; witness output (after dequantising via calibrated scales) matches a tract inference of the equivalent float model within `0.5` per element. Negative test: `--disable-quantization-fixup` halts at `gen-settings`. `#[ignore]`d companion drives the full SNARK `setup → prove → verify` (~4 s on the fixture; opt in with `cargo test -- --ignored`)
`tests/python/binding_tests.py::test_py_run_args`	extended	`PyRunArgs.disable_quantization_fixup` round-trip
`tests/integration_tests.rs::TESTS[]`	+1	New `quantized_qdq` fixture picked up by 36 existing wrappers (mock_, kzg_prove_and_verify_, accuracy_measurement_*) for free coverage. Smoke-tested `mock_public_outputs_::tests_100_expects` passes with witness max-abs-error of `2.4e-4`

Two small fixtures are checked in:

tests/assets/quantized_qdq.onnx — minimal QuantizeLinear → DequantizeLinear → Conv graph (440 B).
tests/assets/quantized_dynamic.onnx — DynamicQuantizeLinear → MatMulInteger graph mirroring the issue's quantize_dynamic output (697 B).
examples/onnx/quantized_qdq/{network.onnx,input.json} — same Q/DQ fixture plus a deterministic input, picked up by the existing TESTS[] harness.

Clippy clean on all changed files. cargo fmt --check clean on all changed files (the existing pre-merge diffs in eth.rs / pfsys/srs.rs are untouched).

Test plan

cargo test --lib graph::dequantize — 8/8 pass
cargo test --test quantization_detection — 4/4 pass
cargo test --test dequantize_pipeline — 2/2 pass
cargo test --test dequantize_e2e — 2/2 pass; 1 ignored
cargo test --test dequantize_e2e -- --ignored — full SNARK 1/1 pass (~3.9 s)
cargo test --test integration_tests mock_public_outputs_::tests_100_expects — 1/1 pass (~76 s)
cargo clippy --features ezkl --tests --no-deps — no new warnings on changed files
cargo fmt --check — no diffs on changed files
End-to-end on the issue's face_landmark_quantized.onnx: default load succeeds, --disable-quantization-fixup produces the actionable error, ezkl dequantize produces a cleaned model that gen-settings then accepts.

Notes for reviewers

The branch is two commits — fix: detect ONNX post-training quantization ops before tract analysis (the original detector, which now serves as the safety net) and the dequantize pass + subcommand that lives on top...Happy to squash on request.
The auto-dequantize default is on; the --disable-quantization-fixup flag exists primarily for debugging the rewriter or for users who deliberately want EZKL to see the original pre-quantized graph.
I'd value a sanity check on (1) whether the prost = "0.11" pin is the right way to consume tract_onnx::pb::* directly, and (2) whether the seq! range bumps in tests/integration_tests.rs (from 0..99 to 0..=100, which incidentally adds coverage for the previously-uncovered large_mlp at idx 99) match your testing intent.

Future work

This PR closes the immediate bug, but leaves room for follow-ups in two distinct directions.

Native (circuit-level) support for quantization ops

The current PR rewrites PTQ patterns back to their float equivalents before the constraint compiler ever sees them. An alternative direction would be to src/circuit/ops/ and mapping them onto field-element arithmetic. That would unlock things this PR does not:

Letting EZKL prove a quantized model as quantized, preserving whatever benefits the original quantization brought (smaller integer-domain values can mean tighter lookup ranges and lower logrows).
Avoiding the float round-trip through (W_int - W_zp) * W_scale, which today re-introduces fixed-point noise on top of the int8 noise already baked into the model.
Exposing per-channel quantization parameters (which the current pass cannot fold — see below) directly to the prover.

The work would touch src/circuit/ops/poly/ (linear arithmetic), src/circuit/ops/lookup/ (round/clamp at quantization boundaries), and src/graph/utilities.rs (op dispatch). It's a substantial piece of work — probably 2–3× the size of this PR — and would benefit from maintainer guidance on the field-arithmetic representation of (x - zp) * scale and zero-point handling.

Broader ONNX quantization-op coverage

Today the dequantize pass handles the patterns ONNX Runtime's quantize_dynamic emits and the textbook activation Q/DQ identity pair. The safety-net detector catches everything else and reports it. Concrete extensions, ranked by likelihood-of-being-needed:

QLinearConv / QLinearMatMul / QLinearAdd / QLinearMul / QLinearGlobalAveragePool / QLinearLeakyRelu / QLinearSigmoid / QLinearConcat — the QOperator family that ORT's static quantize_static emits (as opposed to quantize_dynamic, which we already handle). Same conceptual rewrite —
fold the inline scale/zp tensors and emit a plain float equivalent — but each op has its own argument layout and there are many of them. Probably the highest-value extension since static PTQ is the more common ORT path in production.
Per-channel quantization parameters. The dequantize pass currently rejects per-channel scale/zp (see
DequantizationError::UnsupportedQuantParamShape). Adding per-axis broadcasting in dequantize_weight would cover most CNN weight quantization in the wild (weight_axis = 0 for Conv, weight_axis = 1 for MatMul). Mechanical change, ~50 lines.
MatMulIntegerToFloat — an ORT-internal fused op that combines MatMulInteger + Cast + Mul into one node. Adding it to the pattern table next to MatMulInteger would be a small extension.
Quantization Aware Training (QAT) graphs — torch.ao.quantization.convert(...)-exported models. Pattern-wise these often look like the static QOperator family, but with per-channel scales — so this lands naturally if (1) and (2) ship.
tf2onnx-converted TFLite quantized models. TFLite uses asymmetric uint8 quantization with sometimes-different graph topology after conversion. Worth a separate fixture and pattern-by-pattern triage.
fp16 / bfloat16 weight tensors. The tensor_to_f32 helper currently errors on Float16/Bfloat16. Straightforward to add (use half::f16 for the reinterpret), but probably belongs as part of a broader half-precision support story rather than this dequantize pass.
Symmetric vs. asymmetric quantization. The current code already handles both (zp can be zero or non-zero, signed or unsigned), but the test fixtures are all int8 symmetric. Adding an asymmetric uint8 fixture would harden coverage.

Low-risk follow-ups

Telemetry / logging surface for the dequantize report. Today gen-settings writes the rewrite counts to a debug! log. Promoting that to info! (or surfacing it in --verbose mode) would let users see at-a-glance that their model was auto-rewritten — useful to dispel the "what just happened to my graph?" question.
ezkl dequantize --check — a flag that runs the rewrite, reports what would be done, but does not write the cleaned file. Useful for CI integration where teams want to assert their models don't need rewriting.
Idempotence assertion in the loader. The pass is idempotent today (a second run is a no-op), and there's a unit test for that. A debug_assert! after the rewrite would catch any future regression where a pattern accidentally re-introduces a Q/DQ op it just removed.

…zkonduit#942) ONNX Runtime's PTQ inserts QuantizeLinear, DequantizeLinear, DynamicQuantizeLinear, MatMulInteger, ConvInteger, and QLinear* operators that tract cannot analyse, producing an opaque "Failed analyse for node ... ConvHir" panic. EZKL already quantizes internally via the `scale` run argument, so a pre-quantized model is both redundant and unsupported. Scan the parsed InferenceModel right after `model_for_read` and surface a `GraphError::UnsupportedQuantizationOps` listing the offending nodes, with guidance to export the float model or strip the Q/DQ pairs. Cover both the static `QuantizeLinear`/`DequantizeLinear` pattern and the dynamic `DynamicQuantizeLinear`/`MatMulInteger` pattern from the issue. Add a new tests/quantization_detection.rs integration suite with two checked-in fixtures under tests/assets/: - quantized_qdq.onnx — minimal QuantizeLinear -> DequantizeLinear -> Conv graph exercising the static-PTQ path. - quantized_dynamic.onnx — DynamicQuantizeLinear -> MatMulInteger graph mirroring the issue's onnxruntime quantize_dynamic output. Each fixture is loaded via Model::new and the test asserts the new UnsupportedQuantizationOps variant is returned with a non-empty report. Verified end-to-end on the issue's face_landmark_quantized.onnx: the detector reports 44 offending nodes with the actionable message instead of the original tract panic. Existing `quantize_dequantize` example is unaffected (its exported ONNX is a folded float Gemm with no Q/DQ ops).

…onduit#942) ONNX Runtime's PTQ inserts QuantizeLinear, DequantizeLinear, DynamicQuantizeLinear, MatMulInteger, ConvInteger, and QLinear* operators that tract cannot analyse, producing an opaque "Failed analyse for node ... ConvHir" panic. EZKL already quantizes internally via the `scale` run argument, so a pre-quantized model is both redundant and unsupported. Add `src/graph/dequantize.rs`, a protobuf-level rewriter exposed as `dequantize::apply(&mut ModelProto) -> Result<DequantizationReport, DequantizationError>`. It collapses three patterns: * `QuantizeLinear -> DequantizeLinear` activation identity pairs are folded into a direct edge. * Standalone `DequantizeLinear(W_int, scale, zp)` on weight initializers is folded into a single float initializer (`(W_int - zp) * scale`). * The `DynamicQuantizeLinear -> ConvInteger/MatMulInteger -> Cast -> Mul` fusion that `quantize_dynamic` emits is collapsed to a plain `Conv`/`MatMul` over (x, dequantized_W). Spatial attributes on `ConvInteger` are preserved on the replacement `Conv`. The pass runs automatically inside `Model::new`: read bytes, decode via `prost::Message::decode`, rewrite, re-encode, hand cleaned bytes to tract through a `Cursor`. The existing `reject_onnx_quantization_ops` detector survives as a safety net for unsupported patterns (e.g. QLinearConv) and for the new `--disable-quantization-fixup` opt-out flag, which is the only way to surface the safety-net error today. Add an `ezkl dequantize -M <input.onnx> -O <output.onnx>` subcommand that exposes the same rewrite as a one-shot tool — useful for inspecting what the auto-pass did, sharing cleaned models, or feeding non-EZKL toolchains. Tests: * 8 unit tests in `src/graph/dequantize.rs` cover each pattern, idempotence, a float-only no-op case, an unsupported `QLinearConv` case, and the shared-`DynamicQuantizeLinear`-feeding-multiple- integer-ops scenario that previously broke producer lookup. * `tests/quantization_detection.rs` (4 tests): default `Model::new` accepts each Q/DQ fixture; with `disable_quantization_fixup=true` the safety net fires with `UnsupportedQuantizationOps`. * `tests/dequantize_pipeline.rs` (2 tests): shells out to `ezkl dequantize`, then loads the cleaned model with the auto-pass *disabled* to prove the persisted bytes alone are accepted. * `tests/dequantize_e2e.rs` (3 tests, new) drives the full pipeline end-to-end on a pre-quantized fixture: - `gen-settings → calibrate-settings → compile-circuit → gen-witness → mock` succeeds with no manual dequantize step; the witness output (after dequantising via the calibrated output scales) agrees with a tract inference of the equivalent float model within ~0.5 per element. - `--disable-quantization-fixup` halts at `gen-settings` with `UnsupportedQuantizationOps`. - `#[ignore]`-gated companion runs the full SNARK `setup → prove → verify` on top (~4 s on the fixture, opt in with `cargo test -- --ignored`). * `tests/python/binding_tests.py::test_py_run_args` round-trips the new `PyRunArgs.disable_quantization_fixup` attribute. * Two checked-in fixtures under `tests/assets/`: - `quantized_qdq.onnx` — minimal QuantizeLinear → DequantizeLinear → Conv graph (with explicit Conv attrs so the cleaned model is also valid for `gen-settings`). - `quantized_dynamic.onnx` — DynamicQuantizeLinear → MatMulInteger graph mirroring the issue's `quantize_dynamic` output. * Mirror fixture under `examples/onnx/quantized_qdq/` plus `quantized_qdq` added to `tests/integration_tests.rs::TESTS[]` (bumped from `[&str; 100]` to `[&str; 101]`); two `seq!` macro ranges bumped from `0..99`/`0..=99` to `0..=100` so the new fixture (and the previously-uncovered `large_mlp` at idx 99) get picked up by 36 existing mock/prove/accuracy wrappers. Verified end-to-end on the issue's `face_landmark_quantized.onnx`: * `ezkl gen-settings -M face_landmark_quantized.onnx` → succeeds (45 dynamic-quantize fusions auto-rewritten transparently). * `ezkl dequantize -M face_landmark_quantized.onnx -O face_clean.onnx` → reports the per-pattern rewrite counts and writes the cleaned model. * `ezkl gen-settings -M face_landmark_quantized.onnx --disable-quantization-fixup` → safety-net error listing 44 unrecognised quantization operators with an actionable message pointing at `ezkl dequantize`. The dequantize pass adds essentially zero overhead to `gen-settings` wall time (the ~90 s on this model is intrinsic to tract analysis + ezkl circuit construction; quantized-vs-cleaned timings are within noise). Adds `prost = "0.11"` as a direct optional dep (gated on the existing `onnx` feature) so we can `Message::decode`/`encode` `tract_onnx::pb` types directly. Tract already pulled prost in transitively; the direct dep just pins the major version we compile against.

philippela added 2 commits April 23, 2026 15:57

philippelaporteconcordia mentioned this pull request Apr 23, 2026

Ezkl unable to use model from post-training quantization #942

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: auto-strip ONNX post-training quantization on model load#1025

Fix: auto-strip ONNX post-training quantization on model load#1025
philippelaporteconcordia wants to merge 2 commits into
zkonduit:mainfrom
philippelaporteconcordia:fix/942-quantized-model-detection

philippelaporteconcordia commented Apr 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

philippelaporteconcordia commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Demo on the issue's model

Tests

Test plan

Notes for reviewers

Future work

Native (circuit-level) support for quantization ops

Broader ONNX quantization-op coverage

Low-risk follow-ups

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

philippelaporteconcordia commented Apr 23, 2026 •

edited

Loading