research: add autoconfig POC with QNN NPU catalog sweep#891
Open
DingmaomaoBJTU wants to merge 12 commits into
Open
research: add autoconfig POC with QNN NPU catalog sweep#891DingmaomaoBJTU wants to merge 12 commits into
DingmaomaoBJTU wants to merge 12 commits into
Conversation
Adds research/autoconfig/ — an automated config search POC that sweeps opset versions (17-21), execution providers, and graph optimizations to find the best winml-cli build config for a given model on Windows hardware. Key findings from 8-model QNN NPU catalog sweep: - npu-001: opset 21 bypass gives +25-31% on Conv+residual models (MobileViT, DINOv2) - npu-006: conv fusions (conv-bn/add/activation) cause 4900% regression on ResNet-18 QNN NPU - npu-007: DVFS thermal noise requires session-level averaging (3x500 iters) for reliable results Includes ep_knowledge/ KB with confirmed findings per EP, and catalog-qnn-sweep/ with per-model benchmark results and cross-model pattern analysis. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds research/autoconfig/docs/agent-design.md — strategic design for the agent layer of winml-cli, covering: - winml-cli vs Olive distinction (UX + Windows-first + explainability) - Why autoconfig search is a sub-tool, not the agent entry point - 5 agent types: Diagnostic, Decision Guidance, Cross-Device Confidence, Regression Detection, Model Recommendation - Autoconfig's role within the agent framework - Key concerns and open questions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds research/autoconfig/docs/skills-design.md — full design doc for the winml-cli skills/agent layer, including: - 11 skill designs (use-winml-cli, optimize-for-device, ep-compatibility-check, debug-accuracy-drop, and others) - Competitive analysis (Apple coremltools, ExecuTorch, AI Hub, NVIDIA ModelOpt, OpenVINO, Olive) - Top 5 feature gaps - Validation confidence levels (L1-L5) - Structured output requirements - QNN NPU catalog sweep findings (npu-001/006/007) - FusedConv unfuse feature request Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| import json | ||
|
|
||
|
|
||
| results = json.load(open(r"ablation-search\results.json")) |
…ping skills - Split skill catalog into two ranked categories by the 'does it touch code?' discriminator: User (config-only) and Contributor (code changes) - Merge overlapping skills (12 -> 9): - check-model-feasibility = find-a-model + ep-compatibility-check - ship-to-winapp = validate-before-ship + prepare-for-winapp - autoconfig absorbs optimize-for-device as its manual mode - Add self-contained HTML render of the design doc for easier reading
xieofxie
reviewed
Jun 16, 2026
|
|
||
| { | ||
| "id": "cpu-005", | ||
| "title": "Baseline (no extra flags) is the optimal config for ConvNext CPU", |
xieofxie
reviewed
Jun 16, 2026
|
|
||
| { | ||
| "id": "cpu-001", | ||
| "title": "opset 19+ causes severe regression on CPU EP (3-4x slowdown)", |
Critical issues found and corrected: npu-001 (opset 21 speedup): - mechanism_confirmed changed TRUE → FALSE The kMaxSupportedOpset bypass requires ORT < 1.18; the sweep used onnxruntime-windowsml 1.24.5 where kMaxSupportedOpset >= 22. The bypass mechanism does not apply. The speedup for DINOv2/MobileViT is empirically real but the WHY is now unknown. - ResNet-18 removed from 'benefits' list — sub-ms model, 3-session ranges span 4x for the same config (pure DVFS noise). Reported +20.2% was noise. - MobileViT magnitude corrected: h1 had DVFS spike inflating median to 11.72ms; actual gain is ~20-26% not 26.5%. - DINOv2 finding kept: 3-session data shows non-overlapping distributions. - Added per-session raw data analysis and required follow-up experiments. npu-002 / npu-003 (W8A16 speedup, compile speedup): - scope changed from 'General / all vision models' to 'ConvNext only' (both findings from 1 model; magnitude claims not transferable) - confidence reduced from 'high' to 'medium' npu-004 (W8A8 accuracy collapse): - confidence changed from 'medium' to 'very_low / anecdote' - Finding has NO recorded data (experiment 'aborted early, numbers not saved') Cannot be treated as a KB finding until re-run with recorded numbers. npu-005 (QNN Hub comparison): - Added fairness caveat: comparing qairt-stack model on ORT QNN EP is not a valid comparison. Finding is trivially true (use right tool for right stack) but not informative. npu-006 (conv fusions catastrophic): - No confidence change — this is the most statistically solid finding. - Added session-level evidence note: h4 CV=0.016 (extremely stable, unusual for QNN NPU), consistent with deterministic CPU fallback hypothesis. search_space_rules: - opset recommendation changed from 'Conv+residual' to 'Conv+attention hybrid' to reflect actual validated models (DINOv2 is attention-dominant, not Conv+residual in the traditional sense) New file: docs/ep-knowledge-review.md - Full statistical analysis of per-session data - ORT version dependency explained - Additional models needed for validation - Minimum experiment protocol Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…eneral ViT
Run validation_sweep.py across 3 new models to rigorously test npu-001
(opset21 speedup) and npu-006 (conv fusion regression) hypotheses.
KEY FINDINGS:
npu-001 (opset21 speedup):
- facebook/dinov2-base: +24.1% (opset17 34.56ms -> opset21 26.23ms)
3-session full bench, fresh quantized.onnx builds, very stable
- microsoft/rad-dino: -0.1% NEUTRAL -- model runs on CPU (~275ms),
QNN NPU cannot accelerate ViT-L; opset irrelevant when CPU-bound
- facebook/dino-vitb16: -0.7% NEUTRAL -- critical control proving the
speedup is NOT a general ViT property; DINOv2-specific op patterns
must explain the difference
Combined with original catalog data:
dinov2-small +30.6%, dinov2-base +24.1% (both confirmed)
dino-vitb16 NEUTRAL (confirmed control) -> scope is DINOv2 family
npu-006 (conv fusions):
- dinov2-base: fusions -25% (faster) -- attention-dominant, benign
- dino-vitb16: fusions +1% (neutral) -- no meaningful Conv ops to fuse
Combined with original resnet-18 +4900% -> hazard is conv-density-gated
Script fixes in validation_sweep.py:
- bench_screen parsed d.get('p50_ms') instead of d['latency_ms']['p50']
- Reuse check accepted any .onnx (including truncated export.onnx)
- Model selection preferred optimized.onnx over quantized.onnx
Updated files:
- ep_knowledge/qnn_npu.json: npu-001 scope narrowed to DINOv2-family,
validated_models expanded with dino-vitb16 (negative control) and
dinov2-base (positive), rad-dino (CPU-bound); npu-006 scope updated
- catalog-qnn-sweep/VALIDATION_SUMMARY.md: full cross-model results table
- catalog-qnn-sweep/{dinov2-base,rad-dino,dino-vitb16}/results_v2.json
- catalog-qnn-sweep/.gitignore: exclude val_h*/ build artifact dirs
| if complete_models: | ||
| print(f" [reuse] existing build in {hyp_dir.name}", flush=True) | ||
| ok = True | ||
| build_out = "(reused)" |
| p50 = lat.get("p50") if isinstance(lat, dict) else None | ||
| if p50: | ||
| p50s.append(round(p50, 3)) | ||
| except Exception: |
…nism invalidated, confidence calibrated Merge structural improvements from local review into KB (smart merge, preserving validation sweep data from 2026-06-16): npu-001: - Add mechanism_invalidation field (explicit statement of INVALIDATION with cause: ORT 1.24.5 kMaxSupportedOpset>=22, bypass does not apply) - Add critical_caveats array (4 caveats incl. DINOv2-specific scope note) - Downgrade confidence to 'medium-high on empirical / low on mechanism' (was 'high' which was overclaiming given unknown mechanism) npu-002/003: - Add follow_up_required fields (FP32 baselines on MobileViT/DINOv2/ResNet) npu-004: - Update action_for_autoconfig: 'Do NOT use to skip W8A8 without running eval first' (was 'Treat as potentially risky' which was still prescriptive without data) search_space_rules: - Rename recommended_order_conv_attention_hybrid -> recommended_order_conv_residual to match local review terminology NOTE: Validation sweep data (dinov2-base +24.1%, dino-vitb16 NEUTRAL, rad-dino CPU-bound) from 2026-06-16 is preserved — not overwritten.
…d NOT Transpose elimination Task 3 investigation: loaded dinov2-small opset17 (h0) and opset21 (h3) optimized.onnx and quantized.onnx from catalog_qnn_sweep builds; counted op types with onnx.load(). Key finding: Transpose count is IDENTICAL (49 nodes) in both opsets. - opset17 optimized: 391 total, 49 Transpose, 121 Reshape - opset21 optimized: 439 total, 49 Transpose, 169 Reshape (+48) - opset17 quantized: 1398 total, 49 Transpose, 615 DQ, 392 Q - opset21 quantized: 1542 total, 49 Transpose, 663 DQ, 440 Q (+48 QDQ pairs) Rules out: NHWC Transpose-elimination as speedup cause, fewer-ops as explanation. Consistent with: QNN EP scheduling/partitioning difference triggered by +48 Reshape nodes. Also: kMaxSupportedOpset confirmed >= 23 in ORT 1.24.4 (C:\\tmp env), reaffirming that the original bypass mechanism does NOT apply. Updated npu-001 critical_caveats, follow_up_required, and added transpose_analysis_2026_06_16 section with raw op counts.
…DINOv2-specific New benchmark results (2026-06-17, QNN NPU Snapdragon X Elite, 3x500-iter W8A16): BAAI/bge-small-en-v1.5 (BERT/sentence-similarity): h0=10.617ms [10.52, 10.32, 11.01] h3=9.840ms [10.25, 9.33, 9.94] opset21 gain +7.3% -- MARGINAL / INCONCLUSIVE (CV=0.3, ranges barely non-overlapping) Unusual vs all other NLP models (distilbert -0.1%, MiniLM -0.7%, roberta +0.1%) Needs 5+ sessions to differentiate from DVFS noise. rizvandwiki/gender-classification (plain ViT): h0=14.326ms [14.15, 14.94, 13.89] h3=13.830ms [13.70, 13.92, 13.87] opset21 gain +3.5% -- NEUTRAL (ranges overlap 13.89/13.92ms, CV=0.35) CRITICAL FINDING: this ViT model has IDENTICAL op counts to DINOv2-small (49 Transpose, 121 Reshape, ~72 Gemm) yet shows NO benefit. Confirms npu-001 is not explainable by op-count profiles or general ViT architecture. Combined with Transpose analysis (Task 3): opset17 and opset21 DINOv2-small have identical Transpose node counts (49). The speedup mechanism is NOT Transpose elimination. The effect is specific to DINOv2 family at a level below op-count visibility -- possibly quantization behavior, tensor layout, or QNN EP partitioning. Also updated: models_tested list (+5 entries), validated_models sections, scope and confidence statements, task completion notes in follow_up_required.
…ndings, fix mechanism claims cpu.json: - cpu-001: mechanism_confirmed true->false. Data is real (opset 17 best) but the kMaxSupportedOpset gate hypothesis doesn't explain the non-monotonic pattern (opset22=85ms partial recovery while 19/20/21 all ~150-170ms). Two separate kMaxSupportedOpset constants exist (NHWC gate vs Transpose Optimizer gate); the CPU one is unverified. Added note on this distinction. - cpu-006: mechanism_confirmed true->false (derived from cpu-001). Meta-rule (EP isolation) remains valid. Added note that NPU/CPU experiments used different models (DINOv2 vs ConvNext) -- comparison is directional only. dml.json: - dml-001: INVALIDATED as 'DML is faster'. DML p50=16.9ms vs QNN GPU p50=17.7ms: diff = 0.8ms = 0.82 sigma of GPU measurement -- distributions OVERLAP. Retained: DML IS more stable (std 0.52 vs 0.97), that difference is real. - dml-002: HEADLINE CORRECTED. p50 with NHWC is marginally BETTER (16.5 vs 16.9ms), not worse. The actual finding is NHWC increases tail latency (p90 +19%) and variance (std 3.6x worse). Action unchanged (avoid NHWC) but for stability reasons, not p50. qnn_gpu.json: - gpu-003: Downgraded from medium to low confidence. Single experiment, 34% gap is above noise level but needs replication before citing as 'NEVER use compile'.
Key corrections: - Bench protocol: QNN NPU CV 0.10-1.2 is normal (DVFS); never reject on CV. Protocol is 3x500-iter always, not gated on CV. - Phase 4 conv fusions: add npu-006 hard gate — FusedConv not supported by QNN EP -> CPU fallback -> +4900% regression on Conv-dense models. Rule: skip all conv-*-fusion if Conv% of total ops > 20%. - Diagnosis table: add npu-006 catastrophic regression row. - Gate 2 lesson: DINOv2 opset21 +24-31% is real but mechanism UNKNOWN. Two hypotheses ruled out: kMaxSupportedOpset bypass (ORT>=23), Transpose elimination (count identical opset17/21). +48 Reshape nodes only diff found. ViT models with identical op counts see no benefit -- effect below topology. - DML vs QNN GPU: correct 'consistently faster' claim -- 0.8ms diff = 0.82sigma, distributions overlap. Real finding: DML is more stable (std 0.52 vs 0.97). - EP table: update QNN NPU to 'architecture-dependent', add conv-fusion caveat; DML note corrected; CPU note: mechanism uncertain (two kMaxSupportedOpset). - Actionable findings: replace 'mechanism CONFIRMED' with full invalidation log.
… loop
Phase 0 — new analyze step sets 3 EP-specific flags before any experiment:
conv_fusions_blocked: QNN NPU + Conv% > 20% -> skip all conv-*-fusion
nhwc_blocked: QNN GPU / DML -> skip nhwc-transformer (dml-002)
opset_sweep_blocked: CPU EP -> never sweep opset (cpu-001, fixed at 17)
bench_protocol: 'npu' if QNN NPU -> always 3-session, no CV gate
Phase 1 skip_set — 3 new hard blocks wired from Phase 0 flags:
conv fusions blocked when npu-006 risk detected
nhwc-transformer blocked for GPU/DML EPs
opset sweep blocked for CPU EP
Conv bottleneck queue respects conv_fusions_blocked flag
Phase 2 loop:
Hypothesis rule 2a: start with W8A16 (not W8A8); W8A8 is high-risk for LN/GELU
W8A8 early exit: if top-1 <= 15% on first W8A8 attempt -> skip all W8A8 variants
PERF step: full EP-aware bench protocol with 3-session NPU path, CV gate for
CPU/GPU, s0 JIT exclusion rule, and non-overlapping range requirement for KEEP
Post-convergence: mandatory compile for QNN NPU (+1.7x validated), explicit
compile-skip guard for GPU/DML (compile regresses on Adreno X1-85).
Hypothesis generation: opset sweep is now EP-qualified — CPU always blocked,
GPU/DML not validated (skip), QNN NPU full sweep 17-21 with scope note.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR adds
research/autoconfig/— an automated config search POC that sweeps opset versions (17–21), execution providers, and graph optimizations to find the bestwinml-cli buildconfig for a given model on Windows hardware.Key findings from QNN NPU catalog sweep (8 models, Snapdragon X Elite)
npu-001: opset21 gives +24–31% on DINOv2 family — NOT a general ViT property
Rigorously validated with fresh quantized.onnx builds, 3×500-iter sessions:
Key discriminant: dino-vitb16 is the same ViT-B size as dinov2-base, but gets zero benefit from opset21. The speedup is specific to the DINOv2 architecture — mechanism TBD (DINOv2-specific op patterns in opset21 ONNX export, not the original kMaxSupportedOpset bypass mechanism which doesn't apply to ORT 1.24.x).
npu-006: conv fusions cause catastrophic regression on Conv-dominant models only
Hazard is proportional to Conv op density. Attention-dominant models are safe or slightly benefit.
npu-007: DVFS thermal noise requires session-level averaging
QNN NPU CV is always 0.1–2.0+. Use 3×500-iter sessions with 30s cool-down. Trust gains >10% only.
Included files
Core scripts
autoconfig.py— main search loop (ConvNext CPU baseline)catalog_qnn_sweep.py— 8-model QNN NPU catalog sweepanalyze_graph.py— ONNX graph analysis helpervalidation_sweep.py— focused npu-001/npu-006 validation sweep (NEW)gen_report_v3.py,autoconfig_diagram.htmlKnowledge base (
ep_knowledge/)qnn_npu.json— 7 findings (npu-001 through npu-007), continuously updated with validation datacpu.json,dml.json,qnn_gpu.jsonBenchmark results (
catalog-qnn-sweep/)SUMMARY.md— original 8-model sweep resultsVALIDATION_SUMMARY.md— 3-model validation sweep with full per-session data and cross-model comparison tableresults.jsonandresults_v2.jsonfor dinov2-base, rad-dino, dino-vitb16Design docs (
docs/)agent-design.md— winml-cli agent layer design (Diagnostic / Decision / Cross-Device / Regression / Recommendation agents)skills-design.md— WinML CLI Skills Design (11 skills, competitive analysis, feature gaps)ep-knowledge-review.md— statistical audit of ep_knowledge findingsFeature gaps identified
analyze_graph.py— needed to gate npu-006 rule automaticallywinml perfdoesn't expose session-level averagingStatus: Research POC — not production code. Scripts run standalone; not integrated into the
winmlCLI yet.