[ET-VK][qconv] Enable im2col to handle grouped convolution by SS-JIA · Pull Request #17793 · pytorch/executorch

SS-JIA · 2026-03-02T21:03:38Z

Stack from ghstack (oldest at bottom):

Previously, the im2col + pointwise GEMM path (q8ta_conv2d_im2col) only
supported non-grouped convolutions (groups=1). This diff extends it to handle
grouped convolutions as well, providing significant speedups on Mali GPUs.

The key changes are:

PW GEMM shader (q8ta_conv2d_pw.glsl): Added K4_per_group and
OC4_per_group as push constants. The shader now computes a group index from
the output channel block (group_idx = oc_block_idx / OC4_per_group) and
offsets the im2col input read by group_idx * K4_per_group. For non-grouped
cases (groups=1), group_idx is always 0, so behavior is unchanged.

PW node (Q8taConv2dPW.cpp): add_q8ta_conv2d_pw_node now accepts a
groups parameter (default=1) and computes K4_per_group and OC4_per_group
internally from the input/output tensor dimensions. K4_per_group and
OC4_per_group were previously specialization constants; they are now push
constants to avoid shader variant explosion when groups varies.

Im2col node (Q8taConv2dIm2Col.cpp): Removed the groups == 1 assertion
from add_q8ta_im2col_node. The im2col shader already handles groups correctly
(each group's K range is contiguous in the output buffer). The q8ta_conv2d_im2col
operator now passes the groups value through to the PW node.

Dispatch heuristic (Q8taConv2d.cpp): Updated q8ta_conv2d with
device-aware dispatch. On Mali, im2col is used for all eligible cases (grouped
and ungrouped) since it provides 1.2-3.6x speedups. On Adreno, im2col is only
used for ungrouped convolutions (groups=1) where in_channels_per_group >= 32 or
spatial_out <= 4096, since grouped convolutions show 0.7-0.95x regression with
im2col. The heuristic uses graph.device_is_mali() to select the path.

Tests (test_q8ta_conv2d.cpp): Updated im2col test eligibility from
groups == 1 && channels.in % 4 == 0 to in_channels_per_group % 4 == 0,
enabling im2col testing for grouped cases. Added SceneX v9 256x256 grouped
convolution configs.

Differential Revision: D94949480

Previously, the im2col + pointwise GEMM path (`q8ta_conv2d_im2col`) only supported non-grouped convolutions (groups=1). This diff extends it to handle grouped convolutions as well, providing significant speedups on Mali GPUs. The key changes are: **PW GEMM shader (`q8ta_conv2d_pw.glsl`)**: Added `K4_per_group` and `OC4_per_group` as push constants. The shader now computes a group index from the output channel block (`group_idx = oc_block_idx / OC4_per_group`) and offsets the im2col input read by `group_idx * K4_per_group`. For non-grouped cases (groups=1), `group_idx` is always 0, so behavior is unchanged. **PW node (`Q8taConv2dPW.cpp`)**: `add_q8ta_conv2d_pw_node` now accepts a `groups` parameter (default=1) and computes `K4_per_group` and `OC4_per_group` internally from the input/output tensor dimensions. `K4_per_group` and `OC4_per_group` were previously specialization constants; they are now push constants to avoid shader variant explosion when groups varies. **Im2col node (`Q8taConv2dIm2Col.cpp`)**: Removed the `groups == 1` assertion from `add_q8ta_im2col_node`. The im2col shader already handles groups correctly (each group's K range is contiguous in the output buffer). The `q8ta_conv2d_im2col` operator now passes the groups value through to the PW node. **Dispatch heuristic (`Q8taConv2d.cpp`)**: Updated `q8ta_conv2d` with device-aware dispatch. On Mali, im2col is used for all eligible cases (grouped and ungrouped) since it provides 1.2-3.6x speedups. On Adreno, im2col is only used for ungrouped convolutions (groups=1) where in_channels_per_group >= 32 or spatial_out <= 4096, since grouped convolutions show 0.7-0.95x regression with im2col. The heuristic uses `graph.device_is_mali()` to select the path. **Tests (`test_q8ta_conv2d.cpp`)**: Updated im2col test eligibility from `groups == 1 && channels.in % 4 == 0` to `in_channels_per_group % 4 == 0`, enabling im2col testing for grouped cases. Added SceneX v9 256x256 grouped convolution configs. Differential Revision: [D94949480](https://our.internmc.facebook.com/intern/diff/D94949480/) [ghstack-poisoned]

pytorch-bot · 2026-03-02T21:03:42Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17793

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 12 New Failures, 2 Unrelated Failures

As of commit 0979460 with merge base ae41854 ():

NEW FAILURES - The following jobs have failed:

pull / test-coreml-bc-macos (macos-m2-stable) / macos-job (gh)
An action could not be found at the URI 'https://api.github.com/repos/actions/checkout/tarball/11bd71901bbe5b1630ceea73d27597364c9af683' (D95C:209FC0:1101B0:147814:69A5FB41)
Test Metal Backend / export-model-metal-artifact (mistralai, Voxtral-Mini-3B-2507, non-quantized) / macos-job (gh)
An action could not be found at the URI 'https://api.github.com/repos/actions/checkout/tarball/11bd71901bbe5b1630ceea73d27597364c9af683' (D951:2DE155:10D239:1449A6:69A5FB3D)
Test Metal Backend / export-model-metal-artifact (mistralai, Voxtral-Mini-3B-2507, quantized-int4-metal) / macos-job (gh)
An action could not be found at the URI 'https://api.github.com/repos/actions/checkout/tarball/11bd71901bbe5b1630ceea73d27597364c9af683' (D4B0:2A601:1221A9:159990:69A5FB36)
Test Metal Backend / export-model-metal-artifact (mistralai, Voxtral-Mini-4B-Realtime-2602, quantized-int4-metal) / macos-job (gh)
An action could not be found at the URI 'https://api.github.com/repos/actions/checkout/tarball/11bd71901bbe5b1630ceea73d27597364c9af683' (D6D5:2CA45F:115A16:14D303:69A5FB39)
Test Metal Backend / export-model-metal-artifact (nvidia, parakeet-tdt, non-quantized) / macos-job (gh)
An action could not be found at the URI 'https://api.github.com/repos/actions/checkout/tarball/11bd71901bbe5b1630ceea73d27597364c9af683' (D6CA:2DE155:10D0B6:1447AE:69A5FB36)
Test Metal Backend / export-model-metal-artifact (nvidia, parakeet-tdt, quantized-int4-metal) / macos-job (gh)
An action could not be found at the URI 'https://api.github.com/repos/actions/checkout/tarball/11bd71901bbe5b1630ceea73d27597364c9af683' (D6E0:13B30E:10FC66:147527:69A5FB3D)
Test Metal Backend / export-model-metal-artifact (openai, whisper-large-v3-turbo, non-quantized) / macos-job (gh)
An action could not be found at the URI 'https://api.github.com/repos/actions/checkout/tarball/11bd71901bbe5b1630ceea73d27597364c9af683' (D4BB:1856C:10A852:141FF0:69A5FB39)
Test Metal Backend / export-model-metal-artifact (openai, whisper-large-v3-turbo, quantized-int4-metal) / macos-job (gh)
An action could not be found at the URI 'https://api.github.com/repos/actions/checkout/tarball/11bd71901bbe5b1630ceea73d27597364c9af683' (D4C7:23578:10D646:145028:69A5FB3D)
Test Metal Backend / export-model-metal-artifact (openai, whisper-small, non-quantized) / macos-job (gh)
An action could not be found at the URI 'https://api.github.com/repos/actions/checkout/tarball/11bd71901bbe5b1630ceea73d27597364c9af683' (EEE1:350925:FA507:131D35:69A5FB3D)
Test Metal Backend / export-model-metal-artifact (openai, whisper-small, quantized-int4-metal) / macos-job (gh)
An action could not be found at the URI 'https://api.github.com/repos/actions/checkout/tarball/11bd71901bbe5b1630ceea73d27597364c9af683' (D947:273D84:100523:137CF5:69A5FB39)
Test Metal Backend / test-executorch-metal-build / macos-job (gh)
An action could not be found at the URI 'https://api.github.com/repos/actions/checkout/tarball/11bd71901bbe5b1630ceea73d27597364c9af683' (EEC9:193E62:FB6CE:13308B:69A5FB36)
Test Metal Backend / test-metal-backend-modules / macos-job (gh)
An action could not be found at the URI 'https://api.github.com/repos/actions/checkout/tarball/11bd71901bbe5b1630ceea73d27597364c9af683' (EED6:23578:10D5A4:144F53:69A5FB39)

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Test CUDA Windows Export and E2E / test-model-cuda-windows-e2e (mistralai, Voxtral-Mini-3B-2507, non-quantized) / windows-job (gh) (trunk failure)
Process completed with exit code 1.
Test CUDA Windows Export and E2E / test-model-cuda-windows-e2e (mistralai, Voxtral-Mini-3B-2507, quantized-int4-weight-only) / windows-job (gh) (trunk failure)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-03-02T21:04:50Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Previously, the im2col + pointwise GEMM path (`q8ta_conv2d_im2col`) only supported non-grouped convolutions (groups=1). This diff extends it to handle grouped convolutions as well, providing significant speedups on Mali GPUs. The key changes are: **PW GEMM shader (`q8ta_conv2d_pw.glsl`)**: Added `K4_per_group` and `OC4_per_group` as push constants. The shader now computes a group index from the output channel block (`group_idx = oc_block_idx / OC4_per_group`) and offsets the im2col input read by `group_idx * K4_per_group`. For non-grouped cases (groups=1), `group_idx` is always 0, so behavior is unchanged. **PW node (`Q8taConv2dPW.cpp`)**: `add_q8ta_conv2d_pw_node` now accepts a `groups` parameter (default=1) and computes `K4_per_group` and `OC4_per_group` internally from the input/output tensor dimensions. `K4_per_group` and `OC4_per_group` were previously specialization constants; they are now push constants to avoid shader variant explosion when groups varies. **Im2col node (`Q8taConv2dIm2Col.cpp`)**: Removed the `groups == 1` assertion from `add_q8ta_im2col_node`. The im2col shader already handles groups correctly (each group's K range is contiguous in the output buffer). The `q8ta_conv2d_im2col` operator now passes the groups value through to the PW node. **Dispatch heuristic (`Q8taConv2d.cpp`)**: Updated `q8ta_conv2d` with device-aware dispatch. On Mali, im2col is used for all eligible cases (grouped and ungrouped) since it provides 1.2-3.6x speedups. On Adreno, im2col is only used for ungrouped convolutions (groups=1) where in_channels_per_group >= 32 or spatial_out <= 4096, since grouped convolutions show 0.7-0.95x regression with im2col. The heuristic uses `graph.device_is_mali()` to select the path. **Tests (`test_q8ta_conv2d.cpp`)**: Updated im2col test eligibility from `groups == 1 && channels.in % 4 == 0` to `in_channels_per_group % 4 == 0`, enabling im2col testing for grouped cases. Added SceneX v9 256x256 grouped convolution configs. Differential Revision: [D94949480](https://our.internmc.facebook.com/intern/diff/D94949480/) ghstack-source-id: 346525921 Pull Request resolved: #17793

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 2, 2026

This was referenced Mar 2, 2026

[ET-VK][ez] Use tree reduction in q8ta_linear_gemv shader #17792

Merged

[ET-VK][qconv] Add dynamic PACKED_INT8_CONV2D memory layout for device-adaptive conv2d #17794

Merged

[ET-VK][testing] Add GPU device name override for on-device model tests #17795

Merged

meta-codesync bot added fb-exported meta-exported labels Mar 2, 2026

manuelcandales approved these changes Mar 2, 2026

View reviewed changes

meta-codesync bot merged commit e754d74 into gh/SS-JIA/454/base Mar 3, 2026
192 of 211 checks passed

meta-codesync bot deleted the gh/SS-JIA/454/head branch March 3, 2026 08:28

meta-codesync bot temporarily deployed to cherry-pick-bot March 3, 2026 08:28 Inactive

pytorchbot mentioned this pull request Mar 3, 2026

[ET-VK][qconv] Enable im2col to handle grouped convolution #17809

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ET-VK][qconv] Enable im2col to handle grouped convolution#17793

[ET-VK][qconv] Enable im2col to handle grouped convolution#17793
meta-codesync[bot] merged 1 commit intogh/SS-JIA/454/basefrom
gh/SS-JIA/454/head

SS-JIA commented Mar 2, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 2, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SS-JIA commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17793

❌ 12 New Failures, 2 Unrelated Failures

Uh oh!

github-actions bot commented Mar 2, 2026

This PR needs a release notes: label

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SS-JIA commented Mar 2, 2026 •

edited

Loading

pytorch-bot bot commented Mar 2, 2026 •

edited

Loading

This PR needs a `release notes:` label