fix: pre-quantized models run optimize, only skip quantize by DingmaomaoBJTU · Pull Request #794 · microsoft/winml-cli

DingmaomaoBJTU · 2026-05-29T12:52:17Z

Summary

Pre-quantized (QDQ) models were incorrectly skipping optimize_onnx. Our custom optimizer is safe for QDQ models — it removes redundant Q/DQ identity pairs without fusing QDQ into QLinear* ops. This PR changes the behavior so pre-quantized models run the full optimize+autoconf+analyze loop and only skip quantize.

Also fixes a bug where winml build -m model.onnx (without -c) would fail with "not a valid JSON file" because generate_build_config wasn't receiving the onnx_path parameter.

Changes

build/onnx.py & build/hf.py:
- Pre-quantized models now run full optimize+autoconf+analyze, only skip quantize.
- Cache is_quantized_onnx() result to a local is_qdq variable to avoid redundant ONNX file reads.
- skip_optimize: restore original semantics — runs optimize_onnx (applies config.optim flags) but skips the autoconf re-optimization loop. Stage labels now accurately reflect actual work done.
- Fix misleading log messages (distinguish QDQ detection vs skip_optimize flag).
commands/build.py: Pass onnx_path to generate_build_config when input is a .onnx file path.
Tests: Updated expectations to match new behavior.

DingmaomaoBJTU

Overall this is a clean, well-motivated fix. The root cause (ORT Level 2 optimizer fusing QDQ patterns) is clearly explained and the change is minimal. Tests are updated correctly. A couple of minor suggestions below.

DingmaomaoBJTU

Overall a clean, well-motivated fix. The root cause is clearly explained and the change is minimal. Tests are updated correctly. A few minor suggestions inline.

DingmaomaoBJTU

Good fix overall. The root cause is well-identified and the solution is minimal and correct: skipping run_optimize_analyze_loop() entirely preserves QDQ structure for downstream EPs. Tests are properly updated. One minor issue noted inline.

DingmaomaoBJTU

Good fix -- the root cause analysis is clear and the change is minimal. The pre-quantized path now correctly preserves the QDQ structure by skipping the ORT optimizer entirely. A few minor suggestions inline.

DingmaomaoBJTU

Clean, well-motivated fix. Skipping optimize_onnx entirely for pre-quantized models is the right call — ORT Level 2 fusing QDQ patterns into QLinearConv is a known foot-gun for QNN/DML EPs. The test updates are consistent. One minor nit below.

DingmaomaoBJTU

Good fix — the root-cause is clear and the change is minimal. Two minor issues noted inline.

DingmaomaoBJTU

Good fix - the root cause (ORT Level 2 optimizer fusing QDQ patterns) is clearly identified and the solution is minimal. Two minor nits inline.

DingmaomaoBJTU

Good fix — the root cause is clear and the change is minimal. A couple of minor suggestions below.

DingmaomaoBJTU

Good fix. The root cause (ORT Level 2 optimizer fusing QDQ patterns) is clearly identified and the change is minimal and correct. Tests are well updated. A couple of minor suggestions below.

DingmaomaoBJTU

Good fix overall. The root cause is well-explained and the change is minimal and correct. Left a couple of inline nits.

DingmaomaoBJTU

Reviewed — this looks good. The fix is clean and minimal. One minor observation:

src/winml/modelkit/build/onnx.py: The zeroed-out tuple �nalyze_iters, analyze_unsupported, analyze_details = 0, 0, {} is fine for now, but if �uild_onnx_model ever adds new return fields from the analyze loop, this line silently falls out of sync. Consider adding a brief comment noting this must track the return shape of
un_optimize_analyze_loop.

Tests are correctly updated. The docstring on est_pre_quantized_runs_analyze_only accurately reflects the new behavior.

DingmaomaoBJTU

Clean fix with good test coverage. The logic change is correct - skipping run_optimize_analyze_loop() entirely for pre-quantized models prevents ORT Level 2 QDQ fusion. Two minor issues noted inline.

DingmaomaoBJTU

Clean, well-motivated fix. Skipping optimize entirely for pre-quantized models is the correct approach. Tests properly updated.

DingmaomaoBJTU

Clean, well-scoped fix. The root cause analysis is solid and the test updates properly reflect the new behavior. One minor suggestion inline.

DingmaomaoBJTU

Good fix - skipping optimize_onnx() entirely for pre-quantized models is the correct approach to preserve QDQ structure. A couple of minor nits below.

DingmaomaoBJTU

Good fix overall — the root cause is well explained and the approach is correct. Skipping the ORT optimizer entirely for pre-quantized models makes sense to preserve QDQ structure. A couple of minor nits below.

DingmaomaoBJTU

Good fix. Skipping ORT Level 2 optimization for pre-quantized models avoids QDQ pattern fusion that breaks EP compatibility. Code is clean, tests updated correctly.

DingmaomaoBJTU

Good fix — the root cause is clear and the change is minimal. A couple of minor observations inline.

DingmaomaoBJTU

Good fix. The root cause (ORT L2 optimizer fusing QDQ patterns) is clearly identified and the change is minimal and correct. Tests are properly updated. A couple of minor observations inline.

DingmaomaoBJTU

Good fix - the root cause and verification are clearly documented. The code change is minimal and correct: pre-quantized models now bypass optimize_onnx() entirely, preserving QDQ structure for downstream EPs. Two minor nits below.

DingmaomaoBJTU

Good fix. The root cause (ORT Level 2 optimizer fusing QDQ patterns breaking EP compatibility) is well-identified and the solution is minimal. Tests updated correctly to assert_not_called(). One minor suggestion below.

DingmaomaoBJTU

Clean fix overall. A couple inline suggestions.

DingmaomaoBJTU

Good fix overall - the root cause is clear and the change is minimal. The tests are well-updated to match the new behavior. Left a couple of minor suggestions.

DingmaomaoBJTU

Good fix overall — the root cause and solution are well-motivated. Skipping ORT Level 2 optimization for pre-quantized models is the correct approach to preserve QDQ structure. A couple of minor observations below.

DingmaomaoBJTU

Good fix overall. The root cause is well-identified and the change is minimal and correct. A couple of minor nits below.

DingmaomaoBJTU

Good fix overall — the root cause analysis is solid and the change is minimal. A couple of minor nits below.

DingmaomaoBJTU

Good fix overall — the root cause is clear and the change is minimal. A few observations inline.

DingmaomaoBJTU

Clean, well-motivated fix. The root cause (ORT L2 optimizer fusing QDQ patterns) is clearly documented and the solution is appropriately minimal. A few inline suggestions.

DingmaomaoBJTU

Good fix overall. The root cause is well-identified and the solution is clean — skipping ORT Level 2 optimization entirely for pre-quantized models prevents QDQ fusion that breaks EP compatibility. The tests correctly reflect the new behavior. One minor nit: see inline comment about a stale log message.

DingmaomaoBJTU

Good fix overall - the root cause is well-explained and the test updates are consistent. A couple of minor suggestions below.

DingmaomaoBJTU

Good fix - the root cause analysis is clear and the change is minimal. A couple of minor observations below.

DingmaomaoBJTU

Good fix - clear root cause, clean implementation. The QDQ fusion was silently regressing latency, and skipping optimize entirely is the right call. A couple of inline notes below.

DingmaomaoBJTU

Clean, surgical fix with a well-documented root cause. The change is straightforward and the tests are properly updated. One consideration below.

DingmaomaoBJTU

Overall the fix is correct and well-motivated. The PR description clearly explains the root cause and includes verification. A couple of minor suggestions below.

DingmaomaoBJTU

Overall this is a clean, well-motivated fix. The root cause is clearly identified and the solution is minimal and correct. A couple of minor observations below.

DingmaomaoBJTU

LGTM - clean, well-scoped fix. One minor observation inline.

DingmaomaoBJTU

Good fix. The root cause is well-identified (ORT Level 2 optimizer fusing QDQ patterns breaks QNN/DML EP compatibility). A few minor observations below.

DingmaomaoBJTU

Good fix overall. The root cause (ORT Level 2 optimizer fusing QDQ patterns) is clearly identified and the solution is minimal and correct. The test updates properly reflect the new behavior. A few minor observations inline.

DingmaomaoBJTU

Overall LGTM - clean, well-scoped fix. One minor suggestion below.

DingmaomaoBJTU

Good fix overall. The root cause (ORT Level 2 optimizer fusing QDQ patterns) is well-explained and the change is minimal and correct. Left a couple of inline suggestions below.

DingmaomaoBJTU

Good fix with clear root cause analysis. The change correctly prevents ORT Level 2 optimization from fusing QDQ patterns. One minor nit on a stale log message.

DingmaomaoBJTU

Review Summary: Clean, well-scoped fix. The root cause (ORT L2 optimizer fusing QDQ patterns) is clearly explained, the fix is minimal and correct, and test assertions are properly updated from �ssert_called_once() to �ssert_not_called(). The zero-initialization of �nalyze_iters, analyze_unsupported, analyze_details is appropriate since no analysis runs. current_path retains its prior value which is correct for pass-through. LGTM — no issues found.

DingmaomaoBJTU

Good fix! The root cause is well-identified and the approach is correct - completely skipping ORT optimization preserves QDQ patterns. A couple of minor nits below.

DingmaomaoBJTU

Overall this is a clean fix. The root cause is clearly identified and the solution is correct - pre-quantized models should bypass ORT Level 2 optimization entirely to preserve QDQ graph structure. Tests are properly updated. One minor nit on the log message.

DingmaomaoBJTU

Clean, well-motivated fix. The root cause (ORT Level 2 optimizer fusing QDQ patterns) is clearly identified and the solution — skipping the optimizer entirely rather than running it with special flags — is the correct approach for preserving QDQ structure that downstream EPs require. Tests properly updated. One minor observation: the \ est_skip_optimize_kwarg\ test at line 428 also changes to \�ssert_not_called()\ — confirming that the \skip_optimize\ kwarg path now also fully skips (consistent behavior). LGTM.

xieofxie

Nice catch on the QDQ-skips-optimize issue and the -m model.onnx build-config bug. Two concerns worth addressing before merge.

1. skip_optimize=True no longer actually skips optimization. run_optimize_analyze_loop always calls optimize_onnx internally (see build/common.py:83), so the new control flow appends "optimize" to stages_skipped but still runs the full optimize+autoconf+analyze path. The log messages and stages_skipped then disagree with reality, and test_skip_optimize_kwarg (both test_hf.py and test_onnx.py, lines around 860 / 413) literally encodes the contradiction:

assert "optimize" in result.stages_skipped     # claims skip
mock_pipeline["optimize"].assert_called_once()  # but it ran

Those two assertions can't both be true in a sane reading of "skip_optimize". They were untouched by this PR, but the underlying behavior changed around them — please reconcile (assert_not_called() if the intent is really skip, or rename the flag if the intent is "label as skipped while still running").

2. Silent semantic regression for direct skip_optimize=True callers. Before this PR, skip_optimize=True passed no max_optim_iterations, so the autoconf re-optimization loop didn't run (max_optim_iterations=0 default). After this PR, max_optim_iterations=hack_max_optim_iterations is passed unconditionally, so skip_optimize=True now triggers a full autoconf loop where it used to skip it. Not mentioned in the PR description.

Individual suggestions on the lines below. The pre-quantized fix and the commands/build.py fix both look correct in intent — just want the skip_optimize semantics tightened up so labels match work, and the is_quantized_onnx calls de-duplicated.

ORT Level 2 optimization fuses QDQ patterns (e.g. DQ->Conv->Q into QLinearConv), breaking QNN/DML EP compatibility. Pre-quantized models passed to build were being optimized despite the log saying 'skipping optimize', causing ~30x latency regression (500+ms vs 17ms) due to CPU fallback. Now build_onnx_model truly skips the optimize stage for pre-quantized models, preserving the original QDQ graph structure.

Address reviewer feedback: - build/hf.py: skip optimize_onnx for pre-quantized models (same fix as build/onnx.py, prevents QDQ-to-QLinear fusion that breaks EP compat) - commands/build.py: detect ONNX file input and route to generate_onnx_build_config via onnx_path param, fixing 'not a valid JSON file' error when running winml build -m model.onnx without -c - Update test_hf.py assertions to match new skip behavior

xieofxie

Two blocking concerns before merge — the implementation looks plausible, but the PR description argues for the opposite behavior and rests on a contradicted technical claim.

xieofxie · 2026-06-16T08:12:00Z

+        # Skip optimize entirely for pre-quantized models. ORT Level 2
+        # optimization fuses QDQ patterns (e.g. DQ→Conv→Q → QLinearConv),
+        # which breaks QNN/DML EP compatibility and causes CPU fallback.
+        analyze_iters, analyze_unsupported = 0, 0
+        analyze_details: dict[str, Any] = {}


The PR description contradicts this code. The title/body say pre-quantized models now run the full optimize+autoconf+analyze loop and only skip quantize, and that "our custom optimizer is safe for QDQ models." This code does the opposite — it removes run_optimize_analyze_loop and skips optimize entirely (the updated tests flip optimize.assert_called_once() → assert_not_called()).

None of the "Changes" bullets match the diff: there is no is_qdq variable (still is_pre_quantized), no log message is changed, and skip_optimize now skips optimize rather than "restoring original semantics — runs optimize_onnx." Please rewrite the description to match the implementation before merge so the commit history / bisect aren't misleading.

xieofxie · 2026-06-16T08:12:00Z

-            )
-        )
+        # Skip optimize entirely for pre-quantized models. ORT Level 2
+        # optimization fuses QDQ patterns (e.g. DQ→Conv→Q → QLinearConv),


Unresolved technical premise. This comment says ORT Level 2 fuses QDQ→QLinear (so optimize must be skipped), but the PR description claims the optimizer is safe and does not fuse QDQ. These can't both be true, and the PR's correctness hinges on which is.

Note run_optimize_analyze_loop calls optimize_onnx(model, output, **config.optim) unconditionally (common.py:83), so whether fusion happens depends on the graph-optimization level config.optim applies to a QDQ model. Please confirm empirically — does optimize actually emit QLinearConv/QLinearMatMul on a sample QDQ model? — and state the verified reason here.

xieofxie · 2026-06-16T08:14:08Z

@@ -160,17 +160,11 @@ def build_onnx_model(
            "Skipping optimize + quantize, running analyze-only."


Stale/misleading log message. This still logs "...running analyze-only," but after this change the analyze loop is gone — analyze_iters/analyze_unsupported are hardcoded to 0 and no analysis runs. The message now actively misdescribes the behavior (and ironically the PR description claims to fix misleading logs, yet no log line is touched in the diff). Suggest updating to something like "Pre-quantized model: skipping optimize + quantize to preserve QDQ structure." Same applies to the matching line in hf.py.

DingmaomaoBJTU requested a review from a team as a code owner May 29, 2026 12:52

DingmaomaoBJTU commented May 29, 2026

View reviewed changes

DingmaomaoBJTU commented May 30, 2026

View reviewed changes

DingmaomaoBJTU commented May 31, 2026

View reviewed changes

DingmaomaoBJTU commented Jun 1, 2026

View reviewed changes

Comment thread src/winml/modelkit/build/onnx.py

Comment thread src/winml/modelkit/build/onnx.py Outdated

DingmaomaoBJTU commented Jun 1, 2026

View reviewed changes

Comment thread src/winml/modelkit/build/onnx.py

DingmaomaoBJTU commented Jun 1, 2026

View reviewed changes

DingmaomaoBJTU force-pushed the dingmaomaobjtu/fix-prequantized-optimize-skip branch from f83aa5c to cebbf4d Compare June 8, 2026 05:52

DingmaomaoBJTU changed the title ~~fix: skip optimize_onnx for pre-quantized models~~ fix: pre-quantized models run optimize, only skip quantize Jun 8, 2026

DingmaomaoBJTU force-pushed the dingmaomaobjtu/fix-prequantized-optimize-skip branch 2 times, most recently from 1c2dee5 to 51e4841 Compare June 8, 2026 06:31

xieofxie reviewed Jun 8, 2026

View reviewed changes

DingmaomaoBJTU force-pushed the dingmaomaobjtu/fix-prequantized-optimize-skip branch 2 times, most recently from bdee3ff to 71eacb7 Compare June 8, 2026 07:51

DingmaomaoBJTU force-pushed the dingmaomaobjtu/fix-prequantized-optimize-skip branch from 71eacb7 to d7b1c38 Compare June 16, 2026 06:49

github-actions Bot added 2 commits June 16, 2026 15:03

fix: add type annotations to satisfy mypy var-annotated

bb0c2fd

DingmaomaoBJTU force-pushed the dingmaomaobjtu/fix-prequantized-optimize-skip branch from 13711d6 to bb0c2fd Compare June 16, 2026 07:29

xieofxie reviewed Jun 16, 2026

View reviewed changes

		@@ -160,17 +160,11 @@ def build_onnx_model(
		"Skipping optimize + quantize, running analyze-only."

Conversation

DingmaomaoBJTU commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

DingmaomaoBJTU commented May 29, 2026 •

edited

Loading