[ET-VK][ez] Use tree reduction in q8ta_linear_gemv shader by pytorchbot · Pull Request #17808 · pytorch/executorch

pytorchbot · 2026-03-03T08:29:02Z

This PR was created by the merge bot to help merge the original PR into the main branch.
ghstack PR number: #17792 by @SS-JIA
^ Please use this as the source of truth for the PR details, comments, and reviews
ghstack PR base: https://github.com/pytorch/executorch/tree/gh/SS-JIA/453/base
ghstack PR head: https://github.com/pytorch/executorch/tree/gh/SS-JIA/453/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/main
Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/SS-JIA/453/orig
Differential Revision: D94949137
@diff-train-skip-merge

Replace the serial O(WGS) reduction loop with a tree reduction pattern (O(log2(WGS))). Previously, only thread 0 summed all 64 partial accumulators sequentially. Now all threads participate in a classic halving reduction, matching the pattern already used in linear_q4gsw_coop.glsl. Authored by Claude. Differential Revision: [D94949137](https://our.internmc.facebook.com/intern/diff/D94949137/) ghstack-source-id: 346524552 Pull Request resolved: #17792

pytorch-bot · 2026-03-03T08:29:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17808

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-03-03T08:29:59Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Previously, the im2col + pointwise GEMM path (`q8ta_conv2d_im2col`) only supported non-grouped convolutions (groups=1). This diff extends it to handle grouped convolutions as well, providing significant speedups on Mali GPUs. The key changes are: **PW GEMM shader (`q8ta_conv2d_pw.glsl`)**: Added `K4_per_group` and `OC4_per_group` as push constants. The shader now computes a group index from the output channel block (`group_idx = oc_block_idx / OC4_per_group`) and offsets the im2col input read by `group_idx * K4_per_group`. For non-grouped cases (groups=1), `group_idx` is always 0, so behavior is unchanged. **PW node (`Q8taConv2dPW.cpp`)**: `add_q8ta_conv2d_pw_node` now accepts a `groups` parameter (default=1) and computes `K4_per_group` and `OC4_per_group` internally from the input/output tensor dimensions. `K4_per_group` and `OC4_per_group` were previously specialization constants; they are now push constants to avoid shader variant explosion when groups varies. **Im2col node (`Q8taConv2dIm2Col.cpp`)**: Removed the `groups == 1` assertion from `add_q8ta_im2col_node`. The im2col shader already handles groups correctly (each group's K range is contiguous in the output buffer). The `q8ta_conv2d_im2col` operator now passes the groups value through to the PW node. **Dispatch heuristic (`Q8taConv2d.cpp`)**: Updated `q8ta_conv2d` with device-aware dispatch. On Mali, im2col is used for all eligible cases (grouped and ungrouped) since it provides 1.2-3.6x speedups. On Adreno, im2col is only used for ungrouped convolutions (groups=1) where in_channels_per_group >= 32 or spatial_out <= 4096, since grouped convolutions show 0.7-0.95x regression with im2col. The heuristic uses `graph.device_is_mali()` to select the path. **Tests (`test_q8ta_conv2d.cpp`)**: Updated im2col test eligibility from `groups == 1 && channels.in % 4 == 0` to `in_channels_per_group % 4 == 0`, enabling im2col testing for grouped cases. Added SceneX v9 256x256 grouped convolution configs. Differential Revision: [D94949480](https://our.internmc.facebook.com/intern/diff/D94949480/) ghstack-source-id: 346525921 Pull Request resolved: #17793

…e-adaptive conv2d Performance testing of quantized int8 convolutions reveals that different algorithms perform better on different GPU architectures: im2col is faster on Mali while direct convolution is faster on Adreno. The optimal memory layout differs per algorithm (4C for im2col, 4C1W for direct convolution). This introduces a new "dynamic" memory layout PACKED_INT8_CONV2D that is serialized at export time and resolved to a concrete layout at runtime based on the device's GPU architecture. The resolution logic in ResolveLayouts.cpp mirrors the im2col vs direct convolution decision in Q8taConv2d.cpp. Differential Revision: [D94949134](https://our.internmc.facebook.com/intern/diff/D94949134/) ghstack-source-id: 346525918 Pull Request resolved: #17794

Add the ability to override the Vulkan device name at runtime so that device-adaptive code paths (e.g. memory layout selection) can be tested on hardware that doesn't match the overridden device type. PhysicalDevice::override_device_name() and Adapter::override_device_name() are added behind VULKAN_DEBUG. The device type detection logic is refactored into a reusable determine_device_type() helper to avoid duplication between the constructor and the override function. All test binaries in fb/test/models/ (classification, greenscreen, scenex, skin_seg) now accept --gpu_name to invoke the override before loading the model. The Skycastle CI workflows are updated to re-run classification and greenscreen tests with --gpu_name Mali-G715 in addition to the default run. Differential Revision: [D94949136](https://our.internmc.facebook.com/intern/diff/D94949136/) ghstack-source-id: 346525920 Pull Request resolved: #17795

pytorchbot requested a review from SS-JIA as a code owner March 3, 2026 08:29

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 3, 2026

ssjia added 3 commits March 3, 2026 09:58

SS-JIA approved these changes Mar 3, 2026

View reviewed changes

SS-JIA merged commit 1a75394 into main Mar 3, 2026
176 checks passed

SS-JIA deleted the gh/SS-JIA/453/orig branch March 3, 2026 15:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ET-VK][ez] Use tree reduction in q8ta_linear_gemv shader#17808

[ET-VK][ez] Use tree reduction in q8ta_linear_gemv shader#17808
SS-JIA merged 4 commits intomainfrom
gh/SS-JIA/453/orig

pytorchbot commented Mar 3, 2026

Uh oh!

pytorch-bot bot commented Mar 3, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pytorchbot commented Mar 3, 2026

Uh oh!

pytorch-bot bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17808

Uh oh!

github-actions bot commented Mar 3, 2026

This PR needs a release notes: label

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot bot commented Mar 3, 2026 •

edited

Loading

This PR needs a `release notes:` label