Skip to content

[ET-VK][qconv] Add dynamic PACKED_INT8_CONV2D memory layout for device-adaptive conv2d#17794

Merged
meta-codesync[bot] merged 1 commit intogh/SS-JIA/455/basefrom
gh/SS-JIA/455/head
Mar 3, 2026
Merged

[ET-VK][qconv] Add dynamic PACKED_INT8_CONV2D memory layout for device-adaptive conv2d#17794
meta-codesync[bot] merged 1 commit intogh/SS-JIA/455/basefrom
gh/SS-JIA/455/head

Conversation

@SS-JIA
Copy link
Contributor

@SS-JIA SS-JIA commented Mar 2, 2026

Stack from ghstack (oldest at bottom):

Performance testing of quantized int8 convolutions reveals that different
algorithms perform better on different GPU architectures: im2col is faster on
Mali while direct convolution is faster on Adreno. The optimal memory layout
differs per algorithm (4C for im2col, 4C1W for direct convolution).

This introduces a new "dynamic" memory layout PACKED_INT8_CONV2D that is
serialized at export time and resolved to a concrete layout at runtime based
on the device's GPU architecture. The resolution logic in ResolveLayouts.cpp
mirrors the im2col vs direct convolution decision in Q8taConv2d.cpp.

Differential Revision: D94949134

…e-adaptive conv2d

Performance testing of quantized int8 convolutions reveals that different
algorithms perform better on different GPU architectures: im2col is faster on
Mali while direct convolution is faster on Adreno. The optimal memory layout
differs per algorithm (4C for im2col, 4C1W for direct convolution).

This introduces a new "dynamic" memory layout PACKED_INT8_CONV2D that is
serialized at export time and resolved to a concrete layout at runtime based
on the device's GPU architecture. The resolution logic in ResolveLayouts.cpp
mirrors the im2col vs direct convolution decision in Q8taConv2d.cpp.

Differential Revision: [D94949134](https://our.internmc.facebook.com/intern/diff/D94949134/)

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 2, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17794

Note: Links to docs will display an error until the docs builds have been completed.

❌ 14 New Failures, 4 Unrelated Failures

As of commit 31f1d0b with merge base ae41854 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 2, 2026
@github-actions
Copy link

github-actions bot commented Mar 2, 2026

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@meta-codesync meta-codesync bot merged commit 968542d into gh/SS-JIA/455/base Mar 3, 2026
186 of 213 checks passed
@meta-codesync meta-codesync bot deleted the gh/SS-JIA/455/head branch March 3, 2026 08:28
@meta-codesync meta-codesync bot temporarily deployed to cherry-pick-bot March 3, 2026 08:28 Inactive
SS-JIA pushed a commit that referenced this pull request Mar 3, 2026
…e-adaptive conv2d

Performance testing of quantized int8 convolutions reveals that different
algorithms perform better on different GPU architectures: im2col is faster on
Mali while direct convolution is faster on Adreno. The optimal memory layout
differs per algorithm (4C for im2col, 4C1W for direct convolution).

This introduces a new "dynamic" memory layout PACKED_INT8_CONV2D that is
serialized at export time and resolved to a concrete layout at runtime based
on the device's GPU architecture. The resolution logic in ResolveLayouts.cpp
mirrors the im2col vs direct convolution decision in Q8taConv2d.cpp.

Differential Revision: [D94949134](https://our.internmc.facebook.com/intern/diff/D94949134/)

ghstack-source-id: 346525918
Pull Request resolved: #17794
SS-JIA pushed a commit that referenced this pull request Mar 3, 2026
…e-adaptive conv2d

Performance testing of quantized int8 convolutions reveals that different
algorithms perform better on different GPU architectures: im2col is faster on
Mali while direct convolution is faster on Adreno. The optimal memory layout
differs per algorithm (4C for im2col, 4C1W for direct convolution).

This introduces a new "dynamic" memory layout PACKED_INT8_CONV2D that is
serialized at export time and resolved to a concrete layout at runtime based
on the device's GPU architecture. The resolution logic in ResolveLayouts.cpp
mirrors the im2col vs direct convolution decision in Q8taConv2d.cpp.

Differential Revision: [D94949134](https://our.internmc.facebook.com/intern/diff/D94949134/)

ghstack-source-id: 346525918
Pull Request resolved: #17794
SS-JIA pushed a commit that referenced this pull request Mar 3, 2026
…e-adaptive conv2d

Performance testing of quantized int8 convolutions reveals that different
algorithms perform better on different GPU architectures: im2col is faster on
Mali while direct convolution is faster on Adreno. The optimal memory layout
differs per algorithm (4C for im2col, 4C1W for direct convolution).

This introduces a new "dynamic" memory layout PACKED_INT8_CONV2D that is
serialized at export time and resolved to a concrete layout at runtime based
on the device's GPU architecture. The resolution logic in ResolveLayouts.cpp
mirrors the im2col vs direct convolution decision in Q8taConv2d.cpp.

Differential Revision: [D94949134](https://our.internmc.facebook.com/intern/diff/D94949134/)

ghstack-source-id: 346525918
Pull Request resolved: #17794
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants