Feat/sam3 onnx hub support (issue #324)#582
Conversation
There was a problem hiding this comment.
Pull request overview
Adds support for Hub-hosted pre-exported ONNX inputs (<org>/<repo>/<path>.onnx) to enable SAM 3 consumption via Scenario D, and fixes pre-quantized (QDQ + QOperator) detection/routing so already-quantized models skip ORT optimization/quantization stages that can crash on unsupported integer kernels.
Changes:
- Introduces
loader/onnx_hub.pyto recognize and download Hub-hosted ONNX files (plus best-effort.onnx_datasidecars), and wires resolution into multiple entry points (CLI + inference/model loading). - Expands
is_quantized_onnxto detect QOperator quantized models and adds shared quant-op constants. - Updates build pipelines to honor “pre-quantized” routing (skip optimize/quantize) and adds regression/unit/integration tests for SAM 3 and Hub-ONNX refs.
Reviewed changes
Copilot reviewed 26 out of 26 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/onnx/test_detection.py | New tests for quantized (QDQ + QOperator) and compiled ONNX detection. |
| tests/unit/loader/test_onnx_hub.py | New unit tests for Hub-ONNX path detection/splitting/downloading behavior. |
| tests/unit/inference/test_engine.py | Ensures Hub-ONNX refs are resolved before inference routing logic. |
| tests/unit/commands/test_perf_cli.py | Verifies winml perf resolves Hub-ONNX refs before ONNX path validation. |
| tests/unit/commands/test_hub_onnx_ref.py | New CLI tests for wmk config/wmk build with Hub-hosted ONNX refs. |
| tests/unit/commands/test_eval.py | Adds Hub-ONNX resolution coverage in eval model-path resolution. |
| tests/unit/build/test_onnx.py | Regression tests ensuring pre-quantized ONNX skips optimize/quantize stages. |
| tests/unit/build/test_hf.py | Regression tests ensuring pre-quantized exported ONNX skips optimize/quantize stages. |
| tests/integration/test_sam3_e2e.py | End-to-end integration tests for SAM3 decoder + encoder via Hub-hosted ONNX artifacts. |
| src/winml/modelkit/onnx/detection.py | Updates quantized detection to use unified QDQ+QOperator op set. |
| src/winml/modelkit/models/auto.py | Resolves Hub-ONNX refs before fast-path ONNX/HF routing in from_pretrained. |
| src/winml/modelkit/loader/onnx_hub.py | New implementation for Hub-hosted ONNX ref resolution via hf_hub_download. |
| src/winml/modelkit/loader/init.py | Re-exports Hub-ONNX helper APIs. |
| src/winml/modelkit/inference/engine.py | Normalizes Hub-ONNX refs to local paths in load and load_schema_only. |
| src/winml/modelkit/data/hub_models.json | Adds SAM3 encoder/decoder catalog entries using Hub-ONNX refs. |
| src/winml/modelkit/compiler/utils.py | Adds QOperator + union quantization op-type constants. |
| src/winml/modelkit/compiler/init.py | Exposes new quantization op-type constants from compiler package. |
| src/winml/modelkit/commands/perf.py | Resolves Hub-ONNX refs prior to ONNX path checks. |
| src/winml/modelkit/commands/inspect.py | Treats Hub-ONNX refs as ONNX inputs for consistent “not supported” messaging. |
| src/winml/modelkit/commands/eval.py | Resolves Hub-ONNX refs before validating ONNX file existence. |
| src/winml/modelkit/commands/config.py | Resolves Hub-ONNX refs before dispatching config generation path. |
| src/winml/modelkit/commands/build.py | Resolves Hub-ONNX refs and adds skip-optimize plumbing for ONNX pipeline. |
| src/winml/modelkit/build/onnx.py | Ensures pre-quantized models skip ORT optimize and don’t crash on QOperator ops. |
| src/winml/modelkit/build/hf.py | Same as above for HF-exported ONNX artifacts that are already quantized. |
| src/winml/modelkit/build/common.py | Adds skip_optimize to the shared optimize/analyze loop. |
| README.md | Documents Hub-hosted ONNX input form and supported commands. |
Comments suppressed due to low confidence (1)
src/winml/modelkit/commands/build.py:1162
--no-optimizesetsextra_kwargs["skip_optimize"], but_build_onnx_pipeline()never reads it. As a result, the CLI flag has no effect unlessis_quantized_onnx()happens to detect the model as pre-quantized. Consider consumingextra_kwargs.pop("skip_optimize", False)and combining it withis_pre_quantized(and settingmax_iters=0when skipping) so users can force skipping the optimize/autoconf stage for problematic ONNX inputs.
from ..onnx import copy_onnx_model, is_quantized_onnx
max_iters: int = extra_kwargs.pop("hack_max_optim_iterations", 3)
# ── Validate + setup ─────────────────────────────────────────
if not onnx_path.exists():
raise FileNotFoundError(f"ONNX file not found: {onnx_path}")
try:
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # ``ConvInteger``. Skip the optimize pass and the autoconf re-optim | ||
| # loop; analyze still runs lint-only. |
| logger.info( | ||
| "Pre-quantized model detected (QDQ nodes present). " | ||
| "Pre-quantized model detected (QDQ or QOperator nodes present). " | ||
| "Skipping optimize + quantize, running analyze-only." | ||
| ) | ||
| stages_skipped.append("optimize") | ||
| # Optimize+analyze only, no autoconf re-optimization | ||
| # Analyze-only: skip ORT-based graph optimization (no kernel for | ||
| # QOperator ops like ConvInteger on the host EP), no autoconf loop. | ||
| current_path, _, analyze_iters, analyze_unsupported, analyze_details = ( |
| if is_pre_quantized: | ||
| logger.info( | ||
| "Pre-quantized model detected (QDQ nodes present). " | ||
| "Pre-quantized model detected (QDQ or QOperator nodes present). " | ||
| "Skipping optimize + quantize, running analyze-only." | ||
| ) | ||
| stages_skipped.append("optimize") | ||
| # Optimize+analyze only, no autoconf re-optimization | ||
| # Analyze-only: skip ORT-based graph optimization (no kernel for | ||
| # QOperator ops like ConvInteger on the host EP), no autoconf loop. | ||
| current_path, _, analyze_iterations, analyze_unsupported_nodes, analyze_details = ( |
| ep: Target execution provider for the analyzer. | ||
| device: Target device for the analyzer. | ||
| max_optim_iterations: Maximum autoconf re-optimization rounds. | ||
| 0 means optimize+analyze only (no autoconf re-optimization). | ||
| skip_optimize: When True, skip the initial ``optimize_onnx`` call and | ||
| just copy the input model to ``optimized_path``. Used for | ||
| pre-quantized models (QDQ or QOperator format) where ORT-based |
DingmaomaoBJTU
left a comment
There was a problem hiding this comment.
(supersedes the earlier review — consolidated with design feedback)
Code-level issues
See inline comments below.
One additional nit not inlined (existing code, not touched by this PR): _run_quantize_stage at line 952 of commands/build.py still says "(QDQ nodes already present)". Since this PR broadens detection to cover QOperator, the message should read "(QDQ or QOperator nodes already present)" for consistency with the log messages updated in build/onnx.py and build/hf.py.
Design feedback
Three structural concerns — the bug fixes (QOperator detection, skip_optimize) are solid, but the way the Hub-ONNX input form is introduced creates maintenance surface that should be addressed before or shortly after merge.
1. "Hub-hosted ONNX" is not a distinct input type — it is a download step
org/repo/path/file.onnx -> hf_hub_download() -> /local/cache/file.onnx
After the download, this is just a local .onnx file. The downstream pipeline does not (and should not) care that it once lived on the Hub. Yet the PR threads detection logic (is_hf_onnx_path / maybe_resolve_hf_onnx_path) through 7+ command entry points, giving it the status of a persistent input type. This inflates both the concept count and the code surface.
A cleaner model is: resolve once at a single entry point, return a local path, downstream is unaware.
2. Model input identification needs a single resolver — is_hub_model already exists
hub_utils.py already has is_hub_model() with comprehensive local-path rejection (checks Path.exists(), ./, ../, ~/, Windows drive letters). The new is_hf_onnx_path reimplements only Path.exists(), missing the other cases.
The codebase now has three parallel detection mechanisms with no shared logic:
| Input form | Detector | Local-path rejection |
|---|---|---|
HF model ID (org/model) |
is_hub_model() |
Full (exists, ./, ../, ~/, Win drive) |
| Local ONNX file | scattered path.suffix == ".onnx" checks |
N/A |
Hub-hosted ONNX (org/repo/path.onnx) |
is_hf_onnx_path() (new) |
Partial (only Path.exists()) |
Suggested direction: a single resolve_model_input() that classifies and resolves in one place, reusing is_hub_model's rejection logic. Each command calls it once; adding a fourth input form means changing one function, not 7+.
def resolve_model_input(value: str) -> ModelInput:
# 1. Local-path rejection (reuse is_hub_model logic)
# 2. If org/repo/path.onnx -> download -> return as local_onnx
# 3. If org/model -> return as hf_id
# 4. If local .onnx exists -> return as local_onnx
# No persistent "hub_onnx" type needed3. Pre-quantized ONNX handling should be decided once, not scattered across three pipelines
The detect-and-skip logic lives independently in three places with inconsistent behavior:
| Location | Detects via | Skips optimize | Skips quantize |
|---|---|---|---|
build/onnx.py |
is_quantized_onnx() |
skip_optimize=True |
Explicit |
build/hf.py |
is_quantized_onnx() |
skip_optimize=True |
Explicit |
commands/build.py |
is_quantized_onnx() |
skip_optimize=True |
Relies on user config having quant: null |
Plus _run_quantize_stage has its own is_quantized_onnx() guard (line 951) — a fourth detection point.
The CLI pipeline is the odd one out: if a user provides a config with quant settings for a pre-quantized model, it will attempt to re-quantize. The library functions protect against this; the CLI does not.
Suggested direction: detect once at pipeline entry, stamp the result onto WinMLBuildConfig (e.g. config.skip_optimize = True; config.quant = None), and have all downstream stages read config. This also eliminates redundant is_quantized_onnx calls on the same model.
4. No discovery mechanism for Hub ONNX files
The current UX requires the user to already know the exact file path inside a Hub repo (onnx/vision_encoder_int8.onnx), the right variant (fp32 vs int8, encoder vs decoder), and the task. Compare with the HF model ID flow where architecture, task, and export are all auto-detected.
The only discovery path is the static hub_models.json catalog, which is manually curated. If this is an intentional V1 scope limitation, it should be documented as such. Otherwise, consider accepting a repo ID (onnx-community/sam3-tracker-ONNX) and listing available .onnx files, or reading repo metadata to auto-configure task and roles.
| # is downloaded once and treated as a local .onnx path thereafter. | ||
| from ..loader import maybe_resolve_hf_onnx_path | ||
|
|
||
| model_path = maybe_resolve_hf_onnx_path(str(model_path)) or str(model_path) |
There was a problem hiding this comment.
Dead code: or str(model_path) never triggers. maybe_resolve_hf_onnx_path(str(x)) always returns a non-None string when given a non-None string — it only returns None when the input is None. Since str(model_path) is never None, the or branch is unreachable.
Every other call site in this PR uses the simpler pattern:
model_id = maybe_resolve_hf_onnx_path(model_id)Suggestion: drop the or to match the other sites, or add a comment explaining the defensive intent. Same applies to load_schema_only below (line 383).
| **onnx_kwargs, | ||
| **config.optim, | ||
| ) | ||
| # 1. Optimize (or skip for pre-quantized models) |
There was a problem hiding this comment.
skip_optimize=True + max_optim_iterations > 0 is not guarded. All current callers pair skip_optimize=True with max_optim_iterations=0, so this is not a live bug. However, the function contract allows a caller to pass both skip_optimize=True and max_optim_iterations=3, in which case the autoconf loop would discover flags and call optimize_onnx on a pre-quantized model — the exact crash this fix prevents.
Consider making the invariant self-enforcing:
if skip_optimize:
max_optim_iterations = 0 # re-optimize would crash on pre-quantized models| # Union of all quantization op types (QDQ + QOperator). Use this for | ||
| # "is the model already quantized?" detection regardless of which format | ||
| # the producer used. | ||
| QUANTIZATION_OP_TYPES: frozenset[str] = QDQ_OP_TYPES | QOPERATOR_OP_TYPES |
There was a problem hiding this comment.
Consider adding DynamicQuantizeLinear (and DynamicQuantizeMatMul). DynamicQuantizeLinear is a standard ONNX op (opset 11+) used in dynamic quantization. Models quantized via onnxruntime.quantization with QuantType.QUInt8 in dynamic mode contain this op instead of static QDQ pairs or QOperator fused ops.
Without it, a dynamically-quantized model would not be detected by is_quantized_onnx and would be routed through optimize + quantize. If this is an intentional exclusion (SAM 3 uses static QOperator), a comment noting the limitation would help.
| ``.onnx`` file, and best-effort fetches an optional | ||
| ``<filename>.onnx_data`` sidecar so the ONNX loader can find external | ||
| initializers. | ||
|
|
There was a problem hiding this comment.
Case-sensitive .onnx check vs case-insensitive gate in eval.py. This uses model_id.endswith(".onnx") (case-sensitive), while eval.py:_resolve_model_path gates on Path(value).suffix.lower() == ".onnx" (case-insensitive). A Hub ref ending in .ONNX would pass the eval gate but fail here, producing a confusing "ONNX file not found" error.
Probably fine in practice (Hub repos use lowercase), but model_id.lower().endswith(".onnx") would close the gap.
| # Hub-hosted ONNX (e.g. ``onnx-community/sam3-tracker-ONNX/onnx/...``) | ||
| # is downloaded once and treated as a local .onnx path thereafter. | ||
| from ..loader import is_hf_onnx_path, resolve_hf_onnx_path | ||
|
|
There was a problem hiding this comment.
role=path sub-model entries are not wired to the Hub resolver. The role=path branch above (around line 434) validates Path(path).exists() for each sub-model entry but does not call is_hf_onnx_path / resolve_hf_onnx_path. A Hub ref like image-encoder=onnx-community/sam3-tracker-ONNX/onnx/vision_encoder_int8.onnx would fail with "ONNX file not found".
Not blocking for current SAM 3 usage (single-model), but worth a TODO if multi-role eval on Hub-hosted models is planned.
Questions on validation and evaluationThe bug fixes are well-tested (synthetic ONNX models, mock-based CLI tests, regression assertions) — nice work on the silent-skip audit and the QOperator coverage. A few questions on the SAM 3 model validation side — would be great if you could share any additional results you have. What the PR verifies (solid foundation)
These are all great and necessary. A few additional data points would help reviewers feel confident about the end-to-end quality: Would love to see
Task inconsistency between catalog and testsSmall thing I noticed — the encoder's task differs between
The encoder only produces image embeddings, so SuggestionIf some of the above are blocked (NPU availability, no |
SAM 3 (facebook/sam3) requires transformers>=5, but optimum-onnx pins
transformers<4.58, so the standard HF + Optimum export route SAM 2 uses
is blocked. This change wires SAM 3 in through the existing pre-exported
ONNX (Scenario D) pipeline by recognizing path-style Hub references
('org/repo/path/to/file.onnx') and downloading the file once via
huggingface_hub.
Changes:
- New src/winml/modelkit/loader/onnx_hub.py: is_hf_onnx_path,
resolve_hf_onnx_path. Mirrors the is_xxx/resolve_xxx pair pattern
used by is_compiled_onnx/is_quantized_onnx.
- Wire into wmk config, wmk build, and WinMLAutoModel.from_pretrained
with the same 2-line 'if is_hf_onnx_path(x): x = str(resolve_hf_onnx_path(x))'
pattern.
- Add 2 sam3_tracker entries to hub_models.json so 'wmk hub --model-type
sam3_tracker' lists them.
- Tests: 12 unit tests for the resolver, 2 CLI plumbing tests, and 3
end-to-end integration tests (slow/network/integration).
The existing build_onnx_model pipeline runs unchanged on the resolved
local path: the int8 ONNX is auto-detected as quantized via
is_quantized_onnx, the quantization stage is skipped, and the artifact
flows through Optimize -> Analyze<->Optimize -> Compile -> Finalize.
consumer. Also fix two latent bugs in the build pipeline that any QOperator-quantized model would have hit. Background ---------- Issue #324 asks for SAM 2-style native HuggingFace export support for ``facebook/sam3`` (Sam3*IOConfig, Sam3ModelPatcher, etc.). That path is blocked by an upstream constraint: ``optimum-onnx`` pins ``transformers<4.58``, but ``facebook/sam3`` requires ``transformers>=5`` (the ``Sam3Model`` class only exists there). Resolving the pin would need either an upstream optimum-onnx PR or vendoring SAM 3 patcher code that bypasses optimum entirely. Instead, this PR introduces a generic "Hub-hosted ONNX file" input form and lets SAM 3 ride on the existing pre-exported-ONNX (Scenario D) pipeline that already worked for any local ``.onnx`` file. The infrastructure is reusable for any future model with similar version constraints (Whisper / Phi / RWKV / etc. all ship pre-exported ONNX repos on the Hub today). What's added ------------ 1. Hub-hosted ONNX URI resolver - ``loader/onnx_hub.py``: ``is_hf_onnx_path()``, ``resolve_hf_onnx_path()``, ``maybe_resolve_hf_onnx_path()`` - Recognizes inputs of the form ``<org>/<repo>/<path/to/file>.onnx``, downloads via ``huggingface_hub.hf_hub_download``, returns the local cache path. Falls through unchanged for HF model IDs / local paths / ``None``. - Best-effort ``.onnx_data`` sidecar fetch for >2 GiB models. ``EntryNotFoundError`` is expected (inlined weights); ``OSError`` surfaces as a WARNING (disk/permission/network problems should not be silently dropped — the model would later fail to load with a confusing error). 2. CLI wiring (every command that accepts a model identifier) - ``wmk config`` / ``wmk build``: resolve at the top of the command - ``wmk inspect``: friendly "ONNX inspection not yet supported" error for Hub-ONNX refs (matches local .onnx UX) - ``wmk run`` / ``wmk serve``: ``InferenceEngine.load()`` and ``load_schema_only()`` resolve before routing - ``wmk perf``: resolve before the ``Path(model_id).suffix == '.onnx' and exists()`` check (otherwise Hub refs are mistaken for missing local files and rejected with FileNotFoundError) - ``wmk eval``: ``_resolve_model_path`` resolves before the local existence check - ``WinMLAutoModel.from_pretrained``: resolves before HF/ONNX dispatch - Stage-tool commands (``analyze``/``optimize``/``quantize``/ ``compile``/``export``) intentionally NOT wired — they take ``click.Path(exists=True)`` and operate on local files only. 3. SAM 3 catalog entries (``data/hub_models.json``) - Two entries for ``onnx-community/sam3-tracker-ONNX``: the vision encoder and the prompt-encoder + mask-decoder. Note: was already present in the base branch — this PR does not modify it. 4. Integration tests (``tests/integration/test_sam3_e2e.py``) - 4 decoder tests + 2 encoder tests, marked ``@slow @network @integration`` - Asserts: Hub URI resolves, quantization detected, build produces ``model.onnx``, autoconf produces an ``optimization_config``, and for the encoder: pre-quantized round-trip preserves the ``ConvInteger`` / ``MatMulInteger`` ops byte-identically. - Skips narrowed to ``HfHubHTTPError`` / ``OSError`` only — real bugs in the build/analyze pipeline will surface as test failures rather than green skips. Bugs fixed (would affect any QOperator-quantized model, not just SAM 3) --------------------------------------------------------------------- A. ``is_quantized_onnx`` only detected QDQ format (``QuantizeLinear`` / ``DequantizeLinear``). The SAM 3 vision encoder uses ``QuantFormat.QOperator`` (no QDQ pairs, just integer ops: ``ConvInteger``, ``MatMulInteger``, ``QLinear*``). Previously misclassified as not quantized → routed through the optimize + quantize stages → tried to re-quantize an already-int8 model. Fix: ``compiler/utils.py`` adds ``QOPERATOR_OP_TYPES`` and ``QUANTIZATION_OP_TYPES = QDQ ∪ QOperator``. ``onnx/detection.py`` uses the union. B. The ``is_pre_quantized`` branches in ``build_onnx_model``, ``build_hf_model``, and the CLI's ``_build_onnx_pipeline`` logged "skipping optimize" but still invoked ``optimize_onnx`` → ``ort_graph`` → loaded the model into an ORT session. For QOperator models on hosts without a CPU ``ConvInteger`` kernel (e.g. ``onnxruntime-windowsml`` 1.23.x), this crashes the build stage with ``NOT_IMPLEMENTED``. Fix: ``build/common.py::run_optimize_analyze_loop`` gains a real ``skip_optimize: bool`` knob that bypasses ``optimize_onnx`` and the autoconf re-optim loop, just copying the input as the "optimized" artifact. All three pre-quantized branches now pass ``skip_optimize=True``. The downstream behavior (skip quantize + skip compile when configured) is unchanged. Verification ------------ - ``onnx.checker.check_model(full_check=True)`` passes on built artifacts - Built decoder produces NUMERICALLY IDENTICAL outputs to input decoder (``max|built - input| = 0.0`` across all 3 outputs) — pre-quantized round-trip is a true pass-through, not just structurally similar - Encoder runtime feasibility on CPU is identical to input encoder (both fail on CPU because of upstream ORT ``ConvInteger`` kernel gap; encoder requires NPU EP — unchanged from input) - Decoder real inference produces sane SAM-shaped outputs: ``iou_scores ∈ [0, 1]``, ``pred_masks`` logits span both signs, ``object_score_logits`` non-degenerate Test count ---------- - 4518 unit tests pass (+12 new regression tests across: ``test_onnx_hub.py``, ``test_detection.py``, ``test_eval.py``, ``test_perf_cli.py``, ``test_engine.py``) - 6 integration tests pass (live HF download, ~30s) - Ruff check + format clean on all 24 changed files Silent-skip audit (per SAM 2 review feedback) --------------------------------------------- Removed ``except Exception: pytest.skip(...)`` patterns from SAM 3 integration tests — they were swallowing real bugs (including the ``ConvInteger`` regression fixed in this PR). All skips now narrowed to ``HfHubHTTPError`` / ``OSError`` (network) or specific runtime exceptions; ``RuntimeError`` from ``build_onnx_model`` and ``analyze_onnx`` now fails loudly. Removed unnecessary ``pytest.importorskip("huggingface_hub")`` (it's a hard transitive dep). Sidecar download ``OSError`` now logs WARNING instead of DEBUG. Known limitations (not addressed in this PR) -------------------------------------------- - SAM 3 encoder requires NPU EP (QNN / OpenVINO / VitisAI) because ``onnxruntime-windowsml`` ships no CPU kernel for ``ConvInteger(10)``. This is true for both the input and built artifact — our build preserves runtime behavior exactly. Decoder uses ``MatMulInteger`` and runs on either CPU or NPU. - Catalog entries for SAM 3 have ``quantization: null`` so ``wmk perf`` falls back to default random-input shapes that violate the SAM 3 decoder's internal reshape constraints. Populating ``quantization.input_tensors`` with proper shape hints (the pattern every other catalog entry follows) is the recommended fix; out of scope for this PR.
df2f38a to
6e47c07
Compare
6e47c07 to
4464ce6
Compare
| # ``normalize_model_arg`` returns ``str | None`` per its signature; | ||
| # the ``or model`` keeps the narrowed ``str`` type for downstream use. | ||
| hf_model: str = cli_utils.normalize_model_arg(model) or model | ||
| model = hf_model |
4464ce6 to
5bf6fb7
Compare
…on evaluator Follow-up to the SAM 3 / Hub-hosted ONNX commits, addressing inline review feedback from @DingmaomaoBJTU, Copilot review, and CodeQL. Refactor: single model-input classifier and resolver ---------------------------------------------------- New module ``winml.modelkit.utils.model_input`` is the single entry point for classifying a ``-m/--model`` value (one of ``hub_onnx``, ``hf_id``, ``local_onnx``, ``build_dir``, ``invalid``) and resolving it (download for ``hub_onnx``, pass-through otherwise). Replaces the previous scattered detection across ``loader/onnx_hub`` (``is_hf_onnx_path``, ``maybe_resolve_hf_onnx_path``), ``utils/cli`` (``is_onnx_file_path``), and ad-hoc ``Path(value).suffix == ".onnx"`` checks in commands. Callers updated: ``commands/build``, ``config``, ``eval``, ``inspect``, ``perf``; ``inference/engine`` (``load`` and ``load_schema_only``); ``models/auto.WinMLAutoModel.from_pretrained``. Dead code removed ----------------- * ``resolve_model_input(...).local_path or str(model_path)`` -- the ``or`` branch was unreachable; ``local_path`` is set for every resolvable input. * Case-sensitive ``.onnx`` check in ``loader/onnx_hub`` (now consistent with case-insensitive ``.lower()`` check in callers). Build pipeline -------------- * ``ensure_pre_quantized_stamped`` is now the single defensive detection point for pre-quantized models in library entry points; the unified CLI path stamps the config up front, library entry points only run detection when needed. * ``run_optimize_analyze_loop`` enforces ``max_optim_iterations = 0`` whenever ``skip_optimize=True``; ORT lacks kernels for ``ConvInteger`` / ``MatMulInteger`` on host EPs, so re-optimize would crash for the same reason the initial optimize is skipped. * Skip-log message updated to ``"QDQ or QOperator nodes present"`` to match ``is_quantized_onnx`` accepting both quantization formats. * ``compiler/utils.QUANTIZATION_OP_TYPES`` extended with ``DynamicQuantizeLinear`` / ``DynamicQuantizeMatMul``; exported via ``__all__`` to satisfy CodeQL unused-global warning. Mask-generation evaluator (response to "would love to see eval") ---------------------------------------------------------------- * ``winml.modelkit.eval.mask_generation_evaluator`` drives encoder + decoder ORT sessions for SAM-family promptable mask generation. Profile-dispatch design supports SAM 3 (1008 input, mean=std=0.5, direct resize) and SAM 2.1 (1024 input, ImageNet mean/std, longest-side-pad). Verified preprocessing is byte-correct against ``onnx-community/sam3-tracker-ONNX/preprocessor_config.json``. * ``winml.modelkit.datasets.mask_generation`` -- ``MaskGenerationDataset`` wraps ``mattmdjaga/human_parsing_dataset`` with binary foreground/background masks. * ``winml.modelkit.eval.metrics.binary_segmentation`` -- mIoU + Dice on binary masks. * Composite SAM 3 entry in ``models_with_acc.json``. Reference scripts ----------------- * ``scripts/sam3_reference_check.py`` -- spot-check against the published reference ``iou_scores`` from the model card. * ``scripts/mask_generation_eval.py`` -- generic harness for any SAM-family ONNX pair with ``--preset`` (sam2 / sam3) and ``--dataset`` (human_parsing / coco). * ``scripts/sam3_smoke_eval.py`` -- back-compat shim that delegates to the generic harness with the SAM 3 preset. Test cleanups (CodeQL + Copilot) -------------------------------- * ``tests/integration/test_sam3_e2e.py`` fixtures: explicit ``raise`` after ``pytest.skip`` to make control flow obvious. * ``tests/unit/onnx/test_detection.py``: consolidate ``import onnx`` / ``from onnx import ...`` into a single import form. * ``tests/unit/core/test_onnx_utils.py``: expected keys updated for new ``input_symbolic_shapes`` field in ``get_io_config`` output. Other ----- * ``.gitignore``: ignore stray ``<UUID>.data`` ORT external-data sidecars at repo root, ``quantized_info.csv`` Quark side-effect file, and ``scripts/_*.py`` / ``scripts/_*.json`` private debug scripts.
5bf6fb7 to
a0e9c7c
Compare
Thanks for the detailed review and for calling this out. This has been addressed in the latest push, but via a small flow refactor rather than just a string edit: _run_quantize_stage no longer carries the old “(QDQ nodes already present)” wording path. Pre-quantized detection is now decided earlier and stamped on config (skip_optimize=True, quant=None), so _run_quantize_stage exits on config.quant is None for both QDQ and QOperator cases. For consistency, the stage-level logs in build/onnx.py and build/hf.py now use “pre-quantized model” wording (covering QDQ + QOperator). If you prefer an explicit QDQ/QOperator mention in the commands/build.py skip path as well, I can add that as a follow-up log line too. |
Thanks for the detailed consolidated feedback. This is very helpful. On the quantize-stage wording nit, I agree on the consistency goal. In the current branch, the old skip-message path you referenced is no longer the active behavior. The quantize stage now exits based on config state (quant is None) for pre-quantized inputs, and the pre-quantized wording is handled in the build pipelines. See src/winml/modelkit/commands/build.py, src/winml/modelkit/build/onnx.py, and src/winml/modelkit/build/hf.py. On design point 1 (Hub-hosted ONNX as download step): I agree with what you suggested. The current implementation is now centered on single-point normalization and resolution, where entry points resolve to local paths and downstream pipeline stages operate on local ONNX paths. See src/winml/modelkit/utils/cli.py and src/winml/modelkit/utils/model_input.py. |
This is addressed in the latest revision by introducing a single resolver path and routing command entry points through it. We now use src/winml/modelkit/utils/model_input.py as the classification and resolution center, and src/winml/modelkit/utils/cli.py normalize_model_arg delegates to resolve_model_input so callers do not have to implement their own Hub-vs-local detection. Also, local-path rejection is no longer limited to Path.exists-style checks; model_input reuses shared local-path logic via _is_local_path from src/winml/modelkit/utils/hub_utils.py, which covers the ./, ../, ~/, and Windows-drive cases you listed. On the hub_onnx type: agreed on the principle that downstream should operate on local ONNX after resolution. In the current shape, hub_onnx is an internal classification state used before resolution; after resolve_model_input, downstream paths consume local_path (for example src/winml/modelkit/inference/engine.py and src/winml/modelkit/models/auto.py). |
This is addressed in the latest revision. We now decide pre-quantized status once and stamp it onto config, then downstream stages read config state instead of re-detecting: A shared stamping helper sets skip_optimize true and clears quant when the input is pre-quantized (or when skip_optimize is forced). |
You are right that the current Hub ONNX UX is still artifact-path driven: users must provide a fully qualified file path, and we do not yet support repo-only input or automatic task and role inference from repo metadata. So the limitation you described is real. What this PR does improve is the failure and guidance path: when a Hub ONNX path is incorrect, we now surface available ONNX files from that repo in the error hint, so users can recover without guessing blindly. That is a usability improvement, but it is not full discovery. Below is the status as of now:
I will track the full discovery work as follow-up so this does not remain an implicit V1 constraint. |
I ran an informal offline mask-generation eval locally and can share a baseline: Model: onnx-community/sam3-tracker-ONNX (int8 encoder and decoder) mIoU: 0.9494 |
Thanks for pointing this. This support has been added in this PR. The eval path now includes a dedicated mask-generation evaluator and binary-segmentation metrics: Task routing for mask-generation is wired in src/winml/modelkit/eval/evaluate.py. |
Yes, I do have preliminary perf numbers, but they were collected via manual ORT benchmarking scripts rather than wmk perf because of the current shape-hints limitation you mentioned. What I measured so far: Scope: SAM 3 encoder ONNX (pixel_values shape [1,3,1008,1008]), not full end-to-end tracker pipeline. |
I validated against the published onnx-community/sam3-tracker-ONNX reference example (same image and prompt) and confirmed the pipeline is producing close outputs, but I did not run a direct facebook/sam3 PyTorch-to-ONNX parity comparison in this change set. For the reference spot-check, the IoU scores were: Reference: [0.931315, 0.037516, 0.512856] |
Thanks for raising these, In the current branch I have addressed the following:
|
This PR’s functional pieces are done and passing (Hub-ONNX resolution, pre-quantized QDQ/QOperator handling, and test coverage), but I can’t say we have stable end-to-end NPU perf yet in this environment. When I reproduced the suggested flow (winml.modelkit sys + winml.modelkit perf -m microsoft/resnet-50 --device npu --monitor --verbose), sys detected AMD NPU hardware, but perf failed because EP setup did not complete (MIGraphXExecutionProvider download/install failed, resulting EPs were CPU + DML only, then ValueError: Device 'npu' requested but no compatible EP is available). I do have script-based measurements from a different path: SAM3 encoder run (VitisAI + CPU fallback) showed min 32172 ms, p50 33406 ms, mean 33621.8 ms, max 35109 ms; CPU baseline for the same encoder was min 21734 ms, p50 23532 ms, mean 25097.0 ms, max 34219 ms; and ResNet50 script run (VitisAI + CPU fallback) averaged 35.87 ms (min 33.08 ms). We have preliminary mixed-provider numbers from custom scripts, but not a clean reproducible winml perf NPU benchmark in this setup; given that, I propose we merge this PR for correctness and move full end-to-end NPU validation/benchmarking to a dedicated follow-up PR. |
DingmaomaoBJTU
left a comment
There was a problem hiding this comment.
PR Review: SAM3 ONNX Hub Support (#324)
This PR is well-structured with a clean ModelInput classifier/resolver design, robust sidecar handling, and excellent error enrichment for missing files. Several issues need attention before merge.
See inline comments for details.
| "model_id": "onnx-community/sam3-tracker-ONNX/onnx/vision_encoder_int8.onnx", | ||
| "task": "image-feature-extraction", | ||
| "model_type": "sam3_tracker", | ||
| "supported_eps": {}, |
There was a problem hiding this comment.
Bug: supported_eps: {} makes these entries invisible to EP-based catalog lookups.
catalog.py line 112 computes {normalize_ep_name(k) for k in (m.get('supported_eps') or {})} — an empty dict yields an empty set, so wmk list --ep cpu (or any EP filter) will never surface these models. They should list at minimum CPU support:
json "supported_eps": {"cpu": ["CPU"]}
All other models in this file have populated supported_eps.
There was a problem hiding this comment.
You're right, empty supported_eps: {} made them invisible to catalog filters. I've corrected both SAM3 entries to separate model_id (org/repo) from onnx_path (file path) and added "supported_eps": {"cpu": ["CPU"]}.
Now wmk list --ep cpu will surface them correctly.
| "size_mb": 157.2 | ||
| }, | ||
| { | ||
| "model_id": "onnx-community/sam3-tracker-ONNX/onnx/vision_encoder_int8.onnx", |
There was a problem hiding this comment.
Warning: model_id uses a full ONNX path rather than an HF repo id.
Every other entry in hub_models.json uses the org/repo form (e.g. microsoft/resnet-50). Setting model_id to onnx-community/sam3-tracker-ONNX/onnx/vision_encoder_int8.onnx deviates from the schema. Any code that passes model_id to �pi.model_info() or displays it as a clickable HF link will break.
Consider using model_id: 'onnx-community/sam3-tracker-ONNX' and adding a new onnx_path field for the sub-file path.
There was a problem hiding this comment.
I've updated both SAM3 entries to use the standard org/repo format and added an onnx_path field for the sub-file. Now model_id works correctly with hf_api.model_info() and HF links, while onnx_path specifies the target file within the repo.
| logger.debug("Could not list files for %s: %s", repo_id, list_err) | ||
| return ( | ||
| f"Could not list available .onnx files in '{repo_id}' " | ||
| f"(see https://huggingface.co/{repo_id}/tree/main)." |
There was a problem hiding this comment.
Warning: hint URL is hardcoded to /tree/main regardless of revision. If the user specified --revision v1.0 and the file is missing, the error directs them to main instead. Fix: branch = revision or 'main'; use f'https://huggingface.co/{repo_id}/tree/{branch}'
There was a problem hiding this comment.
the error messages were hardcoded to /tree/main regardless of the revision parameter. I've updated both error messages in _format_available_onnx_files() to use:
branch = revision or "main"
f"https://huggingface.co/{repo_id}/tree/{branch}"
Now if a user specifies --revision v1.0 and a file is missing, the error correctly directs them to /tree/v1.0.
| "input_points": points, | ||
| "input_labels": labels, | ||
| "input_boxes": box, | ||
| "image_embeddings.0": emb["image_embeddings.0"], |
There was a problem hiding this comment.
Bug: bare dict key access on hardcoded embedding names will raise an opaque KeyError for any encoder whose output names differ. For example a future SAM variant or a re-exported model could use 'image_embedding_0' instead of 'image_embeddings.0'. Add a guard before this block: missing = [k for k in ('image_embeddings.0','image_embeddings.1','image_embeddings.2') if k not in emb]; if missing: raise ValueError(f'Encoder outputs missing {missing}; got {sorted(emb)}. Expected SAM 3/SAM 2.1 ONNX export.')
There was a problem hiding this comment.
Bare dict access without validation would fail silently with opaque KeyError if embedding names differ. I've added a validation guard before accessing the embedding keys:
required_embeds = ("image_embeddings.0", "image_embeddings.1", "image_embeddings.2")
missing = [k for k in required_embeds if k not in emb]
if missing:
raise ValueError(
f"Encoder outputs missing {missing}; got {sorted(emb.keys())}. "
"Expected SAM 3 or SAM 2.1 ONNX export with multi-scale embeddings."
)
Now users get a clear error message explaining what embeddings are expected vs. what was received.
| f"Duplicate role {role!r} in -m options.", | ||
| param_hint="-m/--model", | ||
| ) | ||
| path = cli_utils.normalize_model_arg(path) or path |
There was a problem hiding this comment.
Warning: if Hub download fails (network error, auth required, wrong repo), normalize_model_arg() raises FileNotFoundError or RepositoryNotFoundError which propagates as a raw Python traceback. Wrap it for a clean UX: try: path = cli_utils.normalize_model_arg(path) or path; except (FileNotFoundError, OSError) as e: raise click.BadParameter(str(e), param_hint='-m/--model') from e. Same pattern needed in config.py, build.py, and perf.py.
There was a problem hiding this comment.
Wrapped all normalize_model_arg() calls with exception handling to convert Hub download failures to clean CLI messages instead of raw tracebacks. Applied to eval.py, perf.py, and config.py.
| seed=ds.seed, | ||
| ) | ||
|
|
||
| def prepare_pipeline(self) -> Pipeline | None: # type: ignore[override] |
There was a problem hiding this comment.
Info: text prompt_mode='text' is accepted by MaskGenerationDataset (validates text_col is set) and by prepare_data(), so dataset creation succeeds. But _build_decoder_inputs() raises ValueError for text mode (line 537). That exception is swallowed by the per-sample except-Exception in compute() and logged as a WARNING, so the user gets mIoU=0.0 and num_samples=0 with no clear top-level error. Either raise in init when prompt_mode='text': raise ValueError('Text-prompt mode not yet supported...'), or elevate the per-sample ValueError to a top-level failure when ALL samples fail.
There was a problem hiding this comment.
Added early validation in init to reject text-prompt mode upfront with a clear error message instead of silently failing with mIoU=0.0.
Adds SAM 3 support and introduces a generic Hub-hosted ONNX input form (
<org>/<repo>/<path>.onnx) that downloads pre-exported ONNX files from HuggingFace viahuggingface_hub. SAM 3 is the first consumer.The standard SAM 2-style export route is blocked:
optimum-onnxpinstransformers<4.58, butfacebook/sam3requirestransformers>=5. This PR uses the pre-exportedonnx-community/sam3-tracker-ONNXartifacts via the existing Scenario D pipeline.Changes
loader/onnx_hub.py)wmk config,wmk build,wmk inspect,wmk run,wmk serve,wmk perf,wmk evalis_quantized_onnxto detect QOperator format (ConvInteger,MatMulInteger,QLinear*) — was QDQ-onlyis_pre_quantizedbuild branch to truly skip the optimize stage (previously crashed on QOperator models)Tests
Limitations
ConvIntegerkernel inonnxruntime-windowsml); decoder runs on CPU + NPU.