Skip to content

Fix: auto-strip ONNX post-training quantization on model load#1025

Open
philippelaporteconcordia wants to merge 2 commits into
zkonduit:mainfrom
philippelaporteconcordia:fix/942-quantized-model-detection
Open

Fix: auto-strip ONNX post-training quantization on model load#1025
philippelaporteconcordia wants to merge 2 commits into
zkonduit:mainfrom
philippelaporteconcordia:fix/942-quantized-model-detection

Conversation

@philippelaporteconcordia
Copy link
Copy Markdown

@philippelaporteconcordia philippelaporteconcordia commented Apr 23, 2026

Summary

Closes #942.

EZKL crashes with an opaque tract panic ("Failed analyse for node ... ConvHir") when given an ONNX model that was post-training-quantized by onnxruntime.quantization (e.g. quantize_dynamic(weight_type=QInt8)). The PTQ ops — QuantizeLinear, DequantizeLinear, DynamicQuantizeLinear, MatMulInteger, ConvInteger, the QLinear* family — are not analysable by tract, and they're conceptually redundant with EZKL's own internal scale-driven quantization.

This PR makes pre-quantized models load and run end-to-end without any manual preprocessing.

What changed

Auto-dequantize on model load. A new in-process pass at src/graph/dequantize.rs (dequantize::apply(&mut ModelProto)) runs inside Model::new and canonicalises three patterns back to float equivalents:

  1. Activation Q/DQ identity pairsQuantizeLinear → DequantizeLinear with shared scale/zp is folded into a direct edge.
  2. Weight DequantizeLinearDequantizeLinear(W_int, scale, zp) on a weight initializer is folded into a single float initializer ((W_int - zp) * scale).
  3. DynamicQuantizeLinear + integer-op fusion — the DynamicQuantizeLinear → ConvInteger/MatMulInteger → Cast → Mul subgraph that quantize_dynamic emits is collapsed to a plain Conv/MatMul over (x, dequantized_W). Spatial attributes on ConvInteger are preserved on the replacement Conv. Trailing bias Add nodes are left untouched.

The pass is purely protobuf-level: read bytes → prost::Message::decode → rewrite → re-encode → hand cleaned bytes to tract via a Cursor. Idempotent on already-clean models, and a no-op on float-only graphs.

Safety net + opt-out. A pre-existing detector (Model::reject_onnx_quantization_ops) survives as a fallback for patterns we don't rewrite (QLinearConv, QLinearMatMul, QLinearAdd, …) and for users who pass the new --disable-quantization-fixup flag. In both cases the loader returns GraphError::UnsupportedQuantizationOps with an actionable message that names the offending nodes and points at the dequantize subcommand.

ezkl dequantize subcommand. Same rewrite, exposed as a one-shot tool that writes a cleaned .onnx to disk — useful for inspection, audit, sharing, or feeding non-EZKL toolchains:

ezkl dequantize -M input.onnx -O cleaned.onnx

The command prints a per-pattern report (collapsed N Q/DQ pairs, folded N weight DQ, replaced N dynamic-quantize fusions).

Bindings + CLI surface.

  • RunArgs.disable_quantization_fixup: bool (default false), CLI: --disable-quantization-fixup.
  • PyRunArgs.disable_quantization_fixup mirrors the field for the Python bindings.

Dependencies. Adds prost = "0.11" as a direct optional dep (gated on the existing onnx feature), version-pinned to match tract's transitive use so the Message trait impls resolve consistently.

Demo on the issue's model

$ ezkl gen-settings -M face_landmark_quantized.onnx
[*] succeeded
← auto-dequantize transparently rewrote 45 fusions

$ ezkl dequantize -M face_landmark_quantized.onnx -O face_clean.onnx
[*] wrote cleaned ONNX to face_clean.onnx
(collapsed 0 Q/DQ pairs, folded 0 weight DQ, replaced 45 dynamic-quantize fusions)

$ ezkl gen-settings -M face_landmark_quantized.onnx --disable-quantization-fixup
[E] [graph] model contains ONNX quantization operators EZKL cannot rewrite
(conv2d_1__52:0_QuantizeLinear (DynamicQuantizeLinear), … (+39 more)).
EZKL handles quantization internally via the scale run argument and
transparently strips post-training-quantization patterns it recognises.
The operators above were not recognised — please export the original
floating-point model, or run ezkl dequantize -M <input.onnx> -O <output.onnx>
to inspect the partial rewrite.

The dequantize pass adds essentially zero overhead to gen-settings: timed at ~0.02 s on the issue's 86-node face_landmark model; gen-settings end-to-end timings on the quantized vs. pre-cleaned variant are within noise (~90 s either way, dominated by tract analysis + circuit construction).

Tests

Suite Count What it covers
cargo test --lib graph::dequantize 8 Each rewrite pattern, idempotence, float-only no-op, unsupported QLinearConv reporting, shared-DynamicQuantizeLinear-feeding-multiple-integer-ops
cargo test --test quantization_detection 4 Default Model::new accepts each Q/DQ fixture; with disable_quantization_fixup=true the safety net fires
cargo test --test dequantize_pipeline 2 ezkl dequantize subcommand round-trips; cleaned model loads with auto-pass disabled
cargo test --test dequantize_e2e 2 + 1 ignored Full pipeline gen-settings → calibrate → compile-circuit → gen-witness → mock on the pre-quantized fixture; witness output (after dequantising via calibrated scales) matches a tract inference of the equivalent float model within 0.5 per element. Negative test: --disable-quantization-fixup halts at gen-settings. #[ignore]d companion drives the full SNARK setup → prove → verify (~4 s on the fixture; opt in with cargo test -- --ignored)
tests/python/binding_tests.py::test_py_run_args extended PyRunArgs.disable_quantization_fixup round-trip
tests/integration_tests.rs::TESTS[] +1 New quantized_qdq fixture picked up by 36 existing wrappers (mock_, kzg_prove_and_verify_, accuracy_measurement_*) for free coverage. Smoke-tested mock_public_outputs_::tests_100_expects passes with witness max-abs-error of 2.4e-4

Two small fixtures are checked in:

  • tests/assets/quantized_qdq.onnx — minimal QuantizeLinear → DequantizeLinear → Conv graph (440 B).
  • tests/assets/quantized_dynamic.onnxDynamicQuantizeLinear → MatMulInteger graph mirroring the issue's quantize_dynamic output (697 B).
  • examples/onnx/quantized_qdq/{network.onnx,input.json} — same Q/DQ fixture plus a deterministic input, picked up by the existing TESTS[] harness.

Clippy clean on all changed files. cargo fmt --check clean on all changed files (the existing pre-merge diffs in eth.rs / pfsys/srs.rs are untouched).

Test plan

  • cargo test --lib graph::dequantize — 8/8 pass
  • cargo test --test quantization_detection — 4/4 pass
  • cargo test --test dequantize_pipeline — 2/2 pass
  • cargo test --test dequantize_e2e — 2/2 pass; 1 ignored
  • cargo test --test dequantize_e2e -- --ignored — full SNARK 1/1 pass (~3.9 s)
  • cargo test --test integration_tests mock_public_outputs_::tests_100_expects — 1/1 pass (~76 s)
  • cargo clippy --features ezkl --tests --no-deps — no new warnings on changed files
  • cargo fmt --check — no diffs on changed files
  • End-to-end on the issue's face_landmark_quantized.onnx: default load succeeds, --disable-quantization-fixup produces the actionable error, ezkl dequantize produces a cleaned model that gen-settings then accepts.

Notes for reviewers

  • The branch is two commits — fix: detect ONNX post-training quantization ops before tract analysis (the original detector, which now serves as the safety net) and the dequantize pass + subcommand that lives on top...Happy to squash on request.
  • The auto-dequantize default is on; the --disable-quantization-fixup flag exists primarily for debugging the rewriter or for users who deliberately want EZKL to see the original pre-quantized graph.
  • I'd value a sanity check on (1) whether the prost = "0.11" pin is the right way to consume tract_onnx::pb::* directly, and (2) whether the seq! range bumps in tests/integration_tests.rs (from 0..99 to 0..=100, which incidentally adds coverage for the previously-uncovered large_mlp at idx 99) match your testing intent.

Future work

This PR closes the immediate bug, but leaves room for follow-ups in two distinct directions.

Native (circuit-level) support for quantization ops

The current PR rewrites PTQ patterns back to their float equivalents before the constraint compiler ever sees them. An alternative direction would be to src/circuit/ops/ and mapping them onto field-element arithmetic. That would unlock things this PR does not:

  • Letting EZKL prove a quantized model as quantized, preserving whatever benefits the original quantization brought (smaller integer-domain values can mean tighter lookup ranges and lower logrows).
  • Avoiding the float round-trip through (W_int - W_zp) * W_scale, which today re-introduces fixed-point noise on top of the int8 noise already baked into the model.
  • Exposing per-channel quantization parameters (which the current pass cannot fold — see below) directly to the prover.

The work would touch src/circuit/ops/poly/ (linear arithmetic), src/circuit/ops/lookup/ (round/clamp at quantization boundaries), and src/graph/utilities.rs (op dispatch). It's a substantial piece of work — probably 2–3× the size of this PR — and would benefit from maintainer guidance on the field-arithmetic representation of (x - zp) * scale and zero-point handling.

Broader ONNX quantization-op coverage

Today the dequantize pass handles the patterns ONNX Runtime's quantize_dynamic emits and the textbook activation Q/DQ identity pair. The safety-net detector catches everything else and reports it. Concrete extensions, ranked by likelihood-of-being-needed:

  1. QLinearConv / QLinearMatMul / QLinearAdd / QLinearMul / QLinearGlobalAveragePool / QLinearLeakyRelu / QLinearSigmoid / QLinearConcat — the QOperator family that ORT's static quantize_static emits (as opposed to quantize_dynamic, which we already handle). Same conceptual rewrite —
    fold the inline scale/zp tensors and emit a plain float equivalent — but each op has its own argument layout and there are many of them. Probably the highest-value extension since static PTQ is the more common ORT path in production.

  2. Per-channel quantization parameters. The dequantize pass currently rejects per-channel scale/zp (see
    DequantizationError::UnsupportedQuantParamShape). Adding per-axis broadcasting in dequantize_weight would cover most CNN weight quantization in the wild (weight_axis = 0 for Conv, weight_axis = 1 for MatMul). Mechanical change, ~50 lines.

  3. MatMulIntegerToFloat — an ORT-internal fused op that combines MatMulInteger + Cast + Mul into one node. Adding it to the pattern table next to MatMulInteger would be a small extension.

  4. Quantization Aware Training (QAT) graphstorch.ao.quantization.convert(...)-exported models. Pattern-wise these often look like the static QOperator family, but with per-channel scales — so this lands naturally if (1) and (2) ship.

  5. tf2onnx-converted TFLite quantized models. TFLite uses asymmetric uint8 quantization with sometimes-different graph topology after conversion. Worth a separate fixture and pattern-by-pattern triage.

  6. fp16 / bfloat16 weight tensors. The tensor_to_f32 helper currently errors on Float16/Bfloat16. Straightforward to add (use half::f16 for the reinterpret), but probably belongs as part of a broader half-precision support story rather than this dequantize pass.

  7. Symmetric vs. asymmetric quantization. The current code already handles both (zp can be zero or non-zero, signed or unsigned), but the test fixtures are all int8 symmetric. Adding an asymmetric uint8 fixture would harden coverage.

Low-risk follow-ups

  • Telemetry / logging surface for the dequantize report. Today gen-settings writes the rewrite counts to a debug! log. Promoting that to info! (or surfacing it in --verbose mode) would let users see at-a-glance that their model was auto-rewritten — useful to dispel the "what just happened to my graph?" question.
  • ezkl dequantize --check — a flag that runs the rewrite, reports what would be done, but does not write the cleaned file. Useful for CI integration where teams want to assert their models don't need rewriting.
  • Idempotence assertion in the loader. The pass is idempotent today (a second run is a no-op), and there's a unit test for that. A debug_assert! after the rewrite would catch any future regression where a pattern accidentally re-introduces a Q/DQ op it just removed.

…zkonduit#942)

ONNX Runtime's PTQ inserts QuantizeLinear, DequantizeLinear,
DynamicQuantizeLinear, MatMulInteger, ConvInteger, and QLinear* operators
that tract cannot analyse, producing an opaque "Failed analyse for node
... ConvHir" panic. EZKL already quantizes internally via the `scale`
run argument, so a pre-quantized model is both redundant and unsupported.

Scan the parsed InferenceModel right after `model_for_read` and surface
a `GraphError::UnsupportedQuantizationOps` listing the offending nodes,
with guidance to export the float model or strip the Q/DQ pairs. Cover
both the static `QuantizeLinear`/`DequantizeLinear` pattern and the
dynamic `DynamicQuantizeLinear`/`MatMulInteger` pattern from the issue.

Add a new tests/quantization_detection.rs integration suite with two checked-in fixtures
under tests/assets/:

  - quantized_qdq.onnx — minimal QuantizeLinear -> DequantizeLinear ->
    Conv graph exercising the static-PTQ path.
  - quantized_dynamic.onnx — DynamicQuantizeLinear -> MatMulInteger
    graph mirroring the issue's onnxruntime quantize_dynamic output.

Each fixture is loaded via Model::new and the test asserts the new
UnsupportedQuantizationOps variant is returned with a non-empty report.

Verified end-to-end on the issue's face_landmark_quantized.onnx: the
detector reports 44 offending nodes with the actionable message instead
of the original tract panic. Existing `quantize_dequantize` example is
unaffected (its exported ONNX is a folded float Gemm with no Q/DQ ops).
…onduit#942)

ONNX Runtime's PTQ inserts QuantizeLinear, DequantizeLinear,
  DynamicQuantizeLinear, MatMulInteger, ConvInteger, and QLinear* operators
  that tract cannot analyse, producing an opaque "Failed analyse for node
  ... ConvHir" panic. EZKL already quantizes internally via the `scale` run
  argument, so a pre-quantized model is both redundant and unsupported.

  Add `src/graph/dequantize.rs`, a protobuf-level rewriter exposed as
  `dequantize::apply(&mut ModelProto) -> Result<DequantizationReport,
  DequantizationError>`. It collapses three patterns:

    * `QuantizeLinear -> DequantizeLinear` activation identity pairs are
      folded into a direct edge.
    * Standalone `DequantizeLinear(W_int, scale, zp)` on weight initializers
      is folded into a single float initializer (`(W_int - zp) * scale`).
    * The `DynamicQuantizeLinear -> ConvInteger/MatMulInteger -> Cast -> Mul`
      fusion that `quantize_dynamic` emits is collapsed to a plain
      `Conv`/`MatMul` over (x, dequantized_W). Spatial attributes on
      `ConvInteger` are preserved on the replacement `Conv`.

  The pass runs automatically inside `Model::new`: read bytes, decode via
  `prost::Message::decode`, rewrite, re-encode, hand cleaned bytes to tract
  through a `Cursor`. The existing `reject_onnx_quantization_ops` detector
  survives as a safety net for unsupported patterns (e.g. QLinearConv) and
  for the new `--disable-quantization-fixup` opt-out flag, which is the
  only way to surface the safety-net error today.

  Add an `ezkl dequantize -M <input.onnx> -O <output.onnx>` subcommand that
  exposes the same rewrite as a one-shot tool — useful for inspecting what
  the auto-pass did, sharing cleaned models, or feeding non-EZKL toolchains.

  Tests:

    * 8 unit tests in `src/graph/dequantize.rs` cover each pattern,
      idempotence, a float-only no-op case, an unsupported `QLinearConv`
      case, and the shared-`DynamicQuantizeLinear`-feeding-multiple-
      integer-ops scenario that previously broke producer lookup.
    * `tests/quantization_detection.rs` (4 tests): default `Model::new`
      accepts each Q/DQ fixture; with `disable_quantization_fixup=true` the
      safety net fires with `UnsupportedQuantizationOps`.
    * `tests/dequantize_pipeline.rs` (2 tests): shells out to
      `ezkl dequantize`, then loads the cleaned model with the auto-pass
      *disabled* to prove the persisted bytes alone are accepted.
    * `tests/dequantize_e2e.rs` (3 tests, new) drives the full pipeline
      end-to-end on a pre-quantized fixture:
        - `gen-settings → calibrate-settings → compile-circuit →
          gen-witness → mock` succeeds with no manual dequantize step;
          the witness output (after dequantising via the calibrated output
          scales) agrees with a tract inference of the equivalent float
          model within ~0.5 per element.
        - `--disable-quantization-fixup` halts at `gen-settings` with
          `UnsupportedQuantizationOps`.
        - `#[ignore]`-gated companion runs the full SNARK
          `setup → prove → verify` on top (~4 s on the fixture, opt in
          with `cargo test -- --ignored`).
    * `tests/python/binding_tests.py::test_py_run_args` round-trips the
      new `PyRunArgs.disable_quantization_fixup` attribute.
    * Two checked-in fixtures under `tests/assets/`:
        - `quantized_qdq.onnx` — minimal QuantizeLinear → DequantizeLinear
          → Conv graph (with explicit Conv attrs so the cleaned model is
          also valid for `gen-settings`).
        - `quantized_dynamic.onnx` — DynamicQuantizeLinear → MatMulInteger
          graph mirroring the issue's `quantize_dynamic` output.
    * Mirror fixture under `examples/onnx/quantized_qdq/` plus
      `quantized_qdq` added to `tests/integration_tests.rs::TESTS[]`
      (bumped from `[&str; 100]` to `[&str; 101]`); two `seq!` macro
      ranges bumped from `0..99`/`0..=99` to `0..=100` so the new
      fixture (and the previously-uncovered `large_mlp` at idx 99) get
      picked up by 36 existing mock/prove/accuracy wrappers.

  Verified end-to-end on the issue's `face_landmark_quantized.onnx`:

    * `ezkl gen-settings -M face_landmark_quantized.onnx` → succeeds
      (45 dynamic-quantize fusions auto-rewritten transparently).
    * `ezkl dequantize -M face_landmark_quantized.onnx -O face_clean.onnx`
      → reports the per-pattern rewrite counts and writes the cleaned
      model.
    * `ezkl gen-settings -M face_landmark_quantized.onnx
      --disable-quantization-fixup` → safety-net error listing 44
      unrecognised quantization operators with an actionable message
      pointing at `ezkl dequantize`.

  The dequantize pass adds essentially zero overhead to `gen-settings`
  wall time (the ~90 s on this model is intrinsic to tract analysis +
  ezkl circuit construction; quantized-vs-cleaned timings are within
  noise).

  Adds `prost = "0.11"` as a direct optional dep (gated on the existing
  `onnx` feature) so we can `Message::decode`/`encode` `tract_onnx::pb`
  types directly. Tract already pulled prost in transitively; the direct
  dep just pins the major version we compile against.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ezkl unable to use model from post-training quantization

2 participants