research: add autoconfig POC with QNN NPU catalog sweep by DingmaomaoBJTU · Pull Request #891 · microsoft/winml-cli

DingmaomaoBJTU · 2026-06-15T02:30:48Z

What this PR adds

research/autoconfig/ — an automated config search POC that sweeps opset versions (17–21), execution providers, and graph optimizations to find the best winml-cli build config for a given model on Windows hardware.

Key findings from QNN NPU catalog sweep (8 models, Snapdragon X Elite)

npu-001: opset21 gives +24–31% on DINOv2 family — NOT a general ViT property

Rigorously validated with fresh quantized.onnx builds, 3×500-iter sessions:

Model	opset17	opset21	Gain
facebook/dinov2-small	7.18 ms	4.98 ms	+30.6% ✅
facebook/dinov2-base	34.56 ms	26.23 ms	+24.1% ✅
facebook/dino-vitb16	19.92 ms	20.07 ms	-0.7% (NEUTRAL) ← critical control
microsoft/rad-dino	274.98 ms	275.36 ms	-0.1% (CPU-bound)

Key discriminant: dino-vitb16 is the same ViT-B size as dinov2-base, but gets zero benefit from opset21. The speedup is specific to the DINOv2 architecture — mechanism TBD (DINOv2-specific op patterns in opset21 ONNX export, not the original kMaxSupportedOpset bypass mechanism which doesn't apply to ORT 1.24.x).

npu-006: conv fusions cause catastrophic regression on Conv-dominant models only

Model	No fusions	With fusions	Regression
microsoft/resnet-18	~1–4 ms	~132–135 ms	+4900% 🔥
facebook/dinov2-base	34.56 ms	25.92 ms	-25% (FASTER)
facebook/dino-vitb16	19.92 ms	20.12 ms	+1% (neutral)

Hazard is proportional to Conv op density. Attention-dominant models are safe or slightly benefit.

npu-007: DVFS thermal noise requires session-level averaging

QNN NPU CV is always 0.1–2.0+. Use 3×500-iter sessions with 30s cool-down. Trust gains >10% only.

Included files

Core scripts

autoconfig.py — main search loop (ConvNext CPU baseline)
catalog_qnn_sweep.py — 8-model QNN NPU catalog sweep
analyze_graph.py — ONNX graph analysis helper
validation_sweep.py — focused npu-001/npu-006 validation sweep (NEW)
gen_report_v3.py, autoconfig_diagram.html

Knowledge base (`ep_knowledge/`)

qnn_npu.json — 7 findings (npu-001 through npu-007), continuously updated with validation data
cpu.json, dml.json, qnn_gpu.json

Benchmark results (`catalog-qnn-sweep/`)

SUMMARY.md — original 8-model sweep results
VALIDATION_SUMMARY.md — 3-model validation sweep with full per-session data and cross-model comparison table
Per-model results.json and results_v2.json for dinov2-base, rad-dino, dino-vitb16

Design docs (`docs/`)

agent-design.md — winml-cli agent layer design (Diagnostic / Decision / Cross-Device / Regression / Recommendation agents)
skills-design.md — WinML CLI Skills Design (11 skills, competitive analysis, feature gaps)
ep-knowledge-review.md — statistical audit of ep_knowledge findings

Feature gaps identified

FusedConv detection in analyze_graph.py — needed to gate npu-006 rule automatically
DVFS-aware perf protocol — current winml perf doesn't expose session-level averaging
Budget-aware sweep — skip expensive hypotheses when time budget exhausted
Mechanism investigation for npu-001 — graph dump comparing Transpose counts at opset17 vs opset21

Status: Research POC — not production code. Scripts run standalone; not integrated into the winml CLI yet.

Adds research/autoconfig/ — an automated config search POC that sweeps opset versions (17-21), execution providers, and graph optimizations to find the best winml-cli build config for a given model on Windows hardware. Key findings from 8-model QNN NPU catalog sweep: - npu-001: opset 21 bypass gives +25-31% on Conv+residual models (MobileViT, DINOv2) - npu-006: conv fusions (conv-bn/add/activation) cause 4900% regression on ResNet-18 QNN NPU - npu-007: DVFS thermal noise requires session-level averaging (3x500 iters) for reliable results Includes ep_knowledge/ KB with confirmed findings per EP, and catalog-qnn-sweep/ with per-model benchmark results and cross-model pattern analysis. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Adds research/autoconfig/docs/agent-design.md — strategic design for the agent layer of winml-cli, covering: - winml-cli vs Olive distinction (UX + Windows-first + explainability) - Why autoconfig search is a sub-tool, not the agent entry point - 5 agent types: Diagnostic, Decision Guidance, Cross-Device Confidence, Regression Detection, Model Recommendation - Autoconfig's role within the agent framework - Key concerns and open questions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Adds research/autoconfig/docs/skills-design.md — full design doc for the winml-cli skills/agent layer, including: - 11 skill designs (use-winml-cli, optimize-for-device, ep-compatibility-check, debug-accuracy-drop, and others) - Competitive analysis (Apple coremltools, ExecuTorch, AI Hub, NVIDIA ModelOpt, OpenVINO, Olive) - Top 5 feature gaps - Validation confidence levels (L1-L5) - Structured output requirements - QNN NPU catalog sweep findings (npu-001/006/007) - FusedConv unfuse feature request Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

+import json
+
+
+results = json.load(open(r"ablation-search\results.json"))


…ping skills - Split skill catalog into two ranked categories by the 'does it touch code?' discriminator: User (config-only) and Contributor (code changes) - Merge overlapping skills (12 -> 9): - check-model-feasibility = find-a-model + ep-compatibility-check - ship-to-winapp = validate-before-ship + prepare-for-winapp - autoconfig absorbs optimize-for-device as its manual mode - Add self-contained HTML render of the design doc for easier reading

xieofxie · 2026-06-16T01:14:34Z

+
+    {
+      "id": "cpu-005",
+      "title": "Baseline (no extra flags) is the optimal config for ConvNext CPU",


xieofxie · 2026-06-16T01:15:04Z

+
+    {
+      "id": "cpu-001",
+      "title": "opset 19+ causes severe regression on CPU EP (3-4x slowdown)",


Critical issues found and corrected: npu-001 (opset 21 speedup): - mechanism_confirmed changed TRUE → FALSE The kMaxSupportedOpset bypass requires ORT < 1.18; the sweep used onnxruntime-windowsml 1.24.5 where kMaxSupportedOpset >= 22. The bypass mechanism does not apply. The speedup for DINOv2/MobileViT is empirically real but the WHY is now unknown. - ResNet-18 removed from 'benefits' list — sub-ms model, 3-session ranges span 4x for the same config (pure DVFS noise). Reported +20.2% was noise. - MobileViT magnitude corrected: h1 had DVFS spike inflating median to 11.72ms; actual gain is ~20-26% not 26.5%. - DINOv2 finding kept: 3-session data shows non-overlapping distributions. - Added per-session raw data analysis and required follow-up experiments. npu-002 / npu-003 (W8A16 speedup, compile speedup): - scope changed from 'General / all vision models' to 'ConvNext only' (both findings from 1 model; magnitude claims not transferable) - confidence reduced from 'high' to 'medium' npu-004 (W8A8 accuracy collapse): - confidence changed from 'medium' to 'very_low / anecdote' - Finding has NO recorded data (experiment 'aborted early, numbers not saved') Cannot be treated as a KB finding until re-run with recorded numbers. npu-005 (QNN Hub comparison): - Added fairness caveat: comparing qairt-stack model on ORT QNN EP is not a valid comparison. Finding is trivially true (use right tool for right stack) but not informative. npu-006 (conv fusions catastrophic): - No confidence change — this is the most statistically solid finding. - Added session-level evidence note: h4 CV=0.016 (extremely stable, unusual for QNN NPU), consistent with deterministic CPU fallback hypothesis. search_space_rules: - opset recommendation changed from 'Conv+residual' to 'Conv+attention hybrid' to reflect actual validated models (DINOv2 is attention-dominant, not Conv+residual in the traditional sense) New file: docs/ep-knowledge-review.md - Full statistical analysis of per-session data - ORT version dependency explained - Additional models needed for validation - Minimum experiment protocol Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…eneral ViT Run validation_sweep.py across 3 new models to rigorously test npu-001 (opset21 speedup) and npu-006 (conv fusion regression) hypotheses. KEY FINDINGS: npu-001 (opset21 speedup): - facebook/dinov2-base: +24.1% (opset17 34.56ms -> opset21 26.23ms) 3-session full bench, fresh quantized.onnx builds, very stable - microsoft/rad-dino: -0.1% NEUTRAL -- model runs on CPU (~275ms), QNN NPU cannot accelerate ViT-L; opset irrelevant when CPU-bound - facebook/dino-vitb16: -0.7% NEUTRAL -- critical control proving the speedup is NOT a general ViT property; DINOv2-specific op patterns must explain the difference Combined with original catalog data: dinov2-small +30.6%, dinov2-base +24.1% (both confirmed) dino-vitb16 NEUTRAL (confirmed control) -> scope is DINOv2 family npu-006 (conv fusions): - dinov2-base: fusions -25% (faster) -- attention-dominant, benign - dino-vitb16: fusions +1% (neutral) -- no meaningful Conv ops to fuse Combined with original resnet-18 +4900% -> hazard is conv-density-gated Script fixes in validation_sweep.py: - bench_screen parsed d.get('p50_ms') instead of d['latency_ms']['p50'] - Reuse check accepted any .onnx (including truncated export.onnx) - Model selection preferred optimized.onnx over quantized.onnx Updated files: - ep_knowledge/qnn_npu.json: npu-001 scope narrowed to DINOv2-family, validated_models expanded with dino-vitb16 (negative control) and dinov2-base (positive), rad-dino (CPU-bound); npu-006 scope updated - catalog-qnn-sweep/VALIDATION_SUMMARY.md: full cross-model results table - catalog-qnn-sweep/{dinov2-base,rad-dino,dino-vitb16}/results_v2.json - catalog-qnn-sweep/.gitignore: exclude val_h*/ build artifact dirs

+        if complete_models:
+            print(f"  [reuse] existing build in {hyp_dir.name}", flush=True)
+            ok = True
+            build_out = "(reused)"


+            p50 = lat.get("p50") if isinstance(lat, dict) else None
+            if p50:
+                p50s.append(round(p50, 3))
+        except Exception:


…nism invalidated, confidence calibrated Merge structural improvements from local review into KB (smart merge, preserving validation sweep data from 2026-06-16): npu-001: - Add mechanism_invalidation field (explicit statement of INVALIDATION with cause: ORT 1.24.5 kMaxSupportedOpset>=22, bypass does not apply) - Add critical_caveats array (4 caveats incl. DINOv2-specific scope note) - Downgrade confidence to 'medium-high on empirical / low on mechanism' (was 'high' which was overclaiming given unknown mechanism) npu-002/003: - Add follow_up_required fields (FP32 baselines on MobileViT/DINOv2/ResNet) npu-004: - Update action_for_autoconfig: 'Do NOT use to skip W8A8 without running eval first' (was 'Treat as potentially risky' which was still prescriptive without data) search_space_rules: - Rename recommended_order_conv_attention_hybrid -> recommended_order_conv_residual to match local review terminology NOTE: Validation sweep data (dinov2-base +24.1%, dino-vitb16 NEUTRAL, rad-dino CPU-bound) from 2026-06-16 is preserved — not overwritten.

…d NOT Transpose elimination Task 3 investigation: loaded dinov2-small opset17 (h0) and opset21 (h3) optimized.onnx and quantized.onnx from catalog_qnn_sweep builds; counted op types with onnx.load(). Key finding: Transpose count is IDENTICAL (49 nodes) in both opsets. - opset17 optimized: 391 total, 49 Transpose, 121 Reshape - opset21 optimized: 439 total, 49 Transpose, 169 Reshape (+48) - opset17 quantized: 1398 total, 49 Transpose, 615 DQ, 392 Q - opset21 quantized: 1542 total, 49 Transpose, 663 DQ, 440 Q (+48 QDQ pairs) Rules out: NHWC Transpose-elimination as speedup cause, fewer-ops as explanation. Consistent with: QNN EP scheduling/partitioning difference triggered by +48 Reshape nodes. Also: kMaxSupportedOpset confirmed >= 23 in ORT 1.24.4 (C:\\tmp env), reaffirming that the original bypass mechanism does NOT apply. Updated npu-001 critical_caveats, follow_up_required, and added transpose_analysis_2026_06_16 section with raw op counts.

…DINOv2-specific New benchmark results (2026-06-17, QNN NPU Snapdragon X Elite, 3x500-iter W8A16): BAAI/bge-small-en-v1.5 (BERT/sentence-similarity): h0=10.617ms [10.52, 10.32, 11.01] h3=9.840ms [10.25, 9.33, 9.94] opset21 gain +7.3% -- MARGINAL / INCONCLUSIVE (CV=0.3, ranges barely non-overlapping) Unusual vs all other NLP models (distilbert -0.1%, MiniLM -0.7%, roberta +0.1%) Needs 5+ sessions to differentiate from DVFS noise. rizvandwiki/gender-classification (plain ViT): h0=14.326ms [14.15, 14.94, 13.89] h3=13.830ms [13.70, 13.92, 13.87] opset21 gain +3.5% -- NEUTRAL (ranges overlap 13.89/13.92ms, CV=0.35) CRITICAL FINDING: this ViT model has IDENTICAL op counts to DINOv2-small (49 Transpose, 121 Reshape, ~72 Gemm) yet shows NO benefit. Confirms npu-001 is not explainable by op-count profiles or general ViT architecture. Combined with Transpose analysis (Task 3): opset17 and opset21 DINOv2-small have identical Transpose node counts (49). The speedup mechanism is NOT Transpose elimination. The effect is specific to DINOv2 family at a level below op-count visibility -- possibly quantization behavior, tensor layout, or QNN EP partitioning. Also updated: models_tested list (+5 entries), validated_models sections, scope and confidence statements, task completion notes in follow_up_required.

…ndings, fix mechanism claims cpu.json: - cpu-001: mechanism_confirmed true->false. Data is real (opset 17 best) but the kMaxSupportedOpset gate hypothesis doesn't explain the non-monotonic pattern (opset22=85ms partial recovery while 19/20/21 all ~150-170ms). Two separate kMaxSupportedOpset constants exist (NHWC gate vs Transpose Optimizer gate); the CPU one is unverified. Added note on this distinction. - cpu-006: mechanism_confirmed true->false (derived from cpu-001). Meta-rule (EP isolation) remains valid. Added note that NPU/CPU experiments used different models (DINOv2 vs ConvNext) -- comparison is directional only. dml.json: - dml-001: INVALIDATED as 'DML is faster'. DML p50=16.9ms vs QNN GPU p50=17.7ms: diff = 0.8ms = 0.82 sigma of GPU measurement -- distributions OVERLAP. Retained: DML IS more stable (std 0.52 vs 0.97), that difference is real. - dml-002: HEADLINE CORRECTED. p50 with NHWC is marginally BETTER (16.5 vs 16.9ms), not worse. The actual finding is NHWC increases tail latency (p90 +19%) and variance (std 3.6x worse). Action unchanged (avoid NHWC) but for stability reasons, not p50. qnn_gpu.json: - gpu-003: Downgraded from medium to low confidence. Single experiment, 34% gap is above noise level but needs replication before citing as 'NEVER use compile'.

Key corrections: - Bench protocol: QNN NPU CV 0.10-1.2 is normal (DVFS); never reject on CV. Protocol is 3x500-iter always, not gated on CV. - Phase 4 conv fusions: add npu-006 hard gate — FusedConv not supported by QNN EP -> CPU fallback -> +4900% regression on Conv-dense models. Rule: skip all conv-*-fusion if Conv% of total ops > 20%. - Diagnosis table: add npu-006 catastrophic regression row. - Gate 2 lesson: DINOv2 opset21 +24-31% is real but mechanism UNKNOWN. Two hypotheses ruled out: kMaxSupportedOpset bypass (ORT>=23), Transpose elimination (count identical opset17/21). +48 Reshape nodes only diff found. ViT models with identical op counts see no benefit -- effect below topology. - DML vs QNN GPU: correct 'consistently faster' claim -- 0.8ms diff = 0.82sigma, distributions overlap. Real finding: DML is more stable (std 0.52 vs 0.97). - EP table: update QNN NPU to 'architecture-dependent', add conv-fusion caveat; DML note corrected; CPU note: mechanism uncertain (two kMaxSupportedOpset). - Actionable findings: replace 'mechanism CONFIRMED' with full invalidation log.

… loop Phase 0 — new analyze step sets 3 EP-specific flags before any experiment: conv_fusions_blocked: QNN NPU + Conv% > 20% -> skip all conv-*-fusion nhwc_blocked: QNN GPU / DML -> skip nhwc-transformer (dml-002) opset_sweep_blocked: CPU EP -> never sweep opset (cpu-001, fixed at 17) bench_protocol: 'npu' if QNN NPU -> always 3-session, no CV gate Phase 1 skip_set — 3 new hard blocks wired from Phase 0 flags: conv fusions blocked when npu-006 risk detected nhwc-transformer blocked for GPU/DML EPs opset sweep blocked for CPU EP Conv bottleneck queue respects conv_fusions_blocked flag Phase 2 loop: Hypothesis rule 2a: start with W8A16 (not W8A8); W8A8 is high-risk for LN/GELU W8A8 early exit: if top-1 <= 15% on first W8A8 attempt -> skip all W8A8 variants PERF step: full EP-aware bench protocol with 3-session NPU path, CV gate for CPU/GPU, s0 JIT exclusion rule, and non-overlapping range requirement for KEEP Post-convergence: mandatory compile for QNN NPU (+1.7x validated), explicit compile-skip guard for GPU/DML (compile regresses on Adreno X1-85). Hypothesis generation: opset sweep is now EP-qualified — CPU always blocked, GPU/DML not validated (skip), QNN NPU full sweep 17-21 with scope note.

DingmaomaoBJTU requested a review from a team as a code owner June 15, 2026 02:30

github-actions Bot and others added 2 commits June 15, 2026 10:32

github-advanced-security AI found potential problems Jun 15, 2026

View reviewed changes

Comment thread research/autoconfig/gen_report_v3.py

import json

results = json.load(open(r"ablation-search\results.json"))

xieofxie reviewed Jun 16, 2026

View reviewed changes

github-actions Bot and others added 2 commits June 16, 2026 14:33

github-advanced-security AI found potential problems Jun 16, 2026

View reviewed changes

github-actions Bot added 6 commits June 16, 2026 19:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

research: add autoconfig POC with QNN NPU catalog sweep#891

research: add autoconfig POC with QNN NPU catalog sweep#891
DingmaomaoBJTU wants to merge 12 commits into
mainfrom
dingmaomaobjtu/research-autoconfig-poc

DingmaomaoBJTU commented Jun 15, 2026 •

edited

Loading

Uh oh!

xieofxie Jun 16, 2026

Uh oh!

xieofxie Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		import json


		results = json.load(open(r"ablation-search\results.json"))

Conversation

DingmaomaoBJTU commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR adds

Key findings from QNN NPU catalog sweep (8 models, Snapdragon X Elite)

npu-001: opset21 gives +24–31% on DINOv2 family — NOT a general ViT property

npu-006: conv fusions cause catastrophic regression on Conv-dominant models only

npu-007: DVFS thermal noise requires session-level averaging

Included files

Core scripts

Knowledge base (ep_knowledge/)

Benchmark results (catalog-qnn-sweep/)

Design docs (docs/)

Feature gaps identified

Uh oh!

xieofxie Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

xieofxie Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DingmaomaoBJTU commented Jun 15, 2026 •

edited

Loading

Knowledge base (`ep_knowledge/`)

Benchmark results (`catalog-qnn-sweep/`)

Design docs (`docs/`)