feat: optimized QASYMM8_SIGNED->F32 direct convolution path by alvoron · Pull Request #1298 · ARM-software/ComputeLibrary

alvoron · 2026-06-17T16:05:04Z

Motivation

Inference frameworks (e.g., OpenVINO) often run int8-quantized activations with float32 output. Previously this required a multi-step chain: quantized GEMM accumulating into int32, a separate GEMMLowp output-stage operator, and a dequantize/cast step. This adds a single-kernel path that takes QASYMM8_SIGNED input and weights and writes F32 output directly, reducing memory traffic and operator overhead.

Dependency

Requires #1297 to be merged first.
That branch provides the dequant_a_offset/dequant_b_offset fields on AsmGemmInfo, the relaxed type guard in CpuGemmAssemblyDispatch::validate (removing the "Only S32 output" restriction for S8 input), and the create_arm_gemm_dequant changes that pass offsets into DequantizeFloat with the correct combined scale. Without it this branch does not build correctly standalone.

Technical approach

The existing DequantizeFloat output stage is extended with a_offset and b_offset fields. GemmInterleaved is taught to:

Pack row sums of A into the A panel (via transforms_quantized with multiplier = 1) when b_offset != 0, for per-row offset correction.
Compute column sums of B (weights) during set_pretransposed_B_array when a_offset != 0, stored alongside the pretransposed weight buffer.
Apply all three correction terms in dequantize_block_32<float> at merge time.

K-blocking is conservatively disabled for the DequantizeFloat + MergeStep case (matching the existing Requantize32 policy) since row sums must cover full K.

CpuGemmDirectConv2d detects the QASYMM8_SIGNED→F32 path by type, reads zero-points from QuantizationInfo, and passes them to the dispatch via AsmGemmInfo. CpuConv2d::get_convolution_method automatically routes NHWC QASYMM8_SIGNED→F32 convolutions to GEMM_CONV2D (no explicit flag).

Also fixes a latent bug in dequantize_block_32<float>: the expression val * qp.scale (integer × float, silently truncating for large accumulators) is corrected to static_cast<float>(val) * qp.scale.

Asymmetric correction formula

out[m,n] = (raw_acc[m,n]
− a_offset · Σ_k b[k,n] (per-column, from col sums)
− b_offset · Σ_k a[m,k] (per-row, from row sums)
+ a_offset · b_offset · K (cross-term)
) · (scale_a · scale_b) + bias[n]

Signed-off-by: Aleksandr Voron <aleksandr.voron@intel.com>

feat: optimized QASYMM8_SIGNED->F32 direct convolution path

6e974d7

Signed-off-by: Aleksandr Voron <aleksandr.voron@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: optimized QASYMM8_SIGNED->F32 direct convolution path#1298

feat: optimized QASYMM8_SIGNED->F32 direct convolution path#1298
alvoron wants to merge 1 commit into
ARM-software:mainfrom
alvoron:alvoron_direct_i8_f32_conv

alvoron commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alvoron commented Jun 17, 2026

Motivation

Dependency

Technical approach

Asymmetric correction formula

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant