Some performance bottlenecks for UAV loads

Thanks for useful tool.
I added support for UAV loads in fork [https://github.com/ash3D/perftest/tree/UAV_load](url) (branch UAV_load). The results turned out to be somewhat slower than SRV on NVIDIA Kepler (GeForce GTX 760M). Previously I obtained higher performance with UAV compared to SRV under certain conditions in similar benchmark. So I started to experiment with shaders and eventually came up to about 2X speedup. The things I tried:

- Loop unrolling.
This improved 1d and 2d raw buffer loads but significantly worsened 3d and 4d loads on Kepler. Unrolling typed UAV buffer and texture loads resulted in crashes during benchmark execution on Kepler (but Intel worked well).
I also tried partial loop unrolling. This eliminated big slowdown for 3d/4d loads on Kepler but in general partial unrolling performance was closer to original one (without unrolling), often slower. Different unroll factors worked best for different conditions (load width, access pattern) in somewhat unpredicted way.

- Loop iteration count reduction.
Big 3d/4d loads slowdown on Kepler with unrolled loop suggested to reduce iteration count. This unexpectedly also improved 1d/2d performance (2X scaledown led to >2X performance). Even more unexpectedly it improved performance of original loop without unrolling.
Such behavior was detected on Kepler only. I tested Intel and Fermi a little bit and found mostly linear performance scaling there.

- Remove read start address and address mask.
Reading start address from cbuffer (used for unaligned tests) harmed UAV performance even if value is 0. Address mask which is intended to prevent compiler from merging multiple narrow loads also affected wide loads performance. It seems than NVIDIA GPUs perform wide raw buffer loads sequentially anyway so performance gains from removing address mask here apparently comes from something other. There are other places though where address mask apparently actually prevents narrow loads merging on Kepler (e.g. scalar 8/16/32 bit texture SRV loads).
Removing address mask also fixed big slowdown for 3d/4d loads on Kepler with unrolled loop.
I experimented with other address mask application - `&= ~mask` instead of `|= mask`. It unexpectedly improved performance. In some specific cases performance oddly became better than even without mask at all.

The modifications I mentioned also affected SRV performance in some extent but UAV performance was much more sensitive.

The results ultimately became close to expected theoretical peak rates of Kepler architecture. NVIDIA GPUs implements SRV loads in read-only TMU pipeline thus performance is different compared to CUDA which uses read/write LSU pipeline. It also differs significantly from AMD GCN. All of the 4 32-bit fetch units used for bilinear texture sampling can be utilized for buffer accesses in GCN (for wide loads/stores or coalesced 1d access). NVIDIA TMU fetch units are 64-bit beginning with Fermi (it able to filter 64-bit RGBA16F textures at full rate) but apparently only 1 of 4 used for buffer reads. I have observed similar behavior before with GT200 except its' fetch units are 32-bit.
UAV accesses are served by LSU pipeline on NVIDIA GPUs. Kepler has 2:1 LD/ST to TMU ratio but UAV cached in L2 only. Initially UAV loads was slower then SRV in the benchmark but after shaders modifications I described above UAV performance became faster then SRV for invariant loads. Ratio is still not 2X but close to it. Linear and random UAV load performance varied in wide range (probably due to increased L2 access rate) and can be much faster or slower then SRV in different cases. SRV performance is very stable (it is the same for invariant/linear/random reads).
I also tested NVIDIA Fermi2 (GeForce GTX 460) a little. Fermi has L1 cache for LSU pipeline (combined with shared memory) thus UAV performance appeared to be better. Invariant UAV reads are 2X faster then SRV ones. Linear and random UAV read performance still not as stable as SRV one but much better then on Kepler. Also Fermi is not subject to big performance drop for 3d/4d UAV raw loads with long unrolled loop.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some performance bottlenecks for UAV loads #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Some performance bottlenecks for UAV loads #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions