Skip to content

Commit c6e5ca8

Browse files
committed
Add preliminary gpu_timeline layer
1 parent 4e477fe commit c6e5ca8

36 files changed

+4589
-17
lines changed

layer_example/source/layer_device_functions.hpp

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,8 @@
2323
* ----------------------------------------------------------------------------
2424
*/
2525

26+
#include <vulkan/vulkan.h>
27+
2628
#include "framework/utils.hpp"
2729

2830
/* See Vulkan API for documentation. */

layer_gpu_timeline/CMakeLists.txt

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,5 +35,7 @@ set(LGL_CONFIG_LOG 1)
3535
include(../source_common/compiler_helper.cmake)
3636

3737
# Build steps
38-
add_subdirectory(source)
38+
add_subdirectory(../source_common/comms source_common/comms)
3939
add_subdirectory(../source_common/framework source_common/framework)
40+
add_subdirectory(../source_common/trackers source_common/trackers)
41+
add_subdirectory(source)

layer_gpu_timeline/README_LAYER.md

Lines changed: 164 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,164 @@
1+
# Layer: GPU Timeline
2+
3+
This layer is used with Arm GPUs for tracking submitted schedulable workloads
4+
and emitting semantic information about them. This data can be combined with
5+
the raw workload execution timing information captured using the Android
6+
Perfetto service, providing developers with a richer debug visualization.
7+
8+
## What devices?
9+
10+
The Arm GPU driver integration with the Perfetto render stages scheduler event
11+
trace is supported at production quality since the r47p0 driver version.
12+
However, associating semantics from this layer relies on a further integration
13+
with debug labels which requires an r51p0 or later driver version.
14+
15+
## What workloads?
16+
17+
A schedulable workload is the smallest workload that the Arm GPU command stream
18+
scheduler will issue to the GPU hardware work queues. This includes the
19+
following workload types:
20+
21+
* Render passes, split into:
22+
* Vertex or Binning phase
23+
* Fragment or Main phase
24+
* Compute dispatches
25+
* Trace rays
26+
* Transfers to a buffer
27+
* Transfers to an image
28+
29+
Most workloads are dispatched using a single API call, and are trivial to
30+
manage in the layer. However, render passes are more complex and need extra
31+
handling. In particular:
32+
33+
* Render passes are issued using multiple API calls.
34+
* Useful render pass properties, such as draw count, are not known until the
35+
render pass recording has ended.
36+
* Dynamic render passes using `vkCmdBeginRendering()` and `vkCmdEndRendering()`
37+
can be suspended and resumed across command buffer boundaries. Properties
38+
such as draw count are not defined by the scope of a single command buffer.
39+
40+
## Tracking workloads
41+
42+
This layer tracks workloads encoded in command buffers, and emits semantic
43+
metadata for each workload via a communications side-channel. A host tool
44+
combines the semantic data stream with the Perfetto data stream, using debug
45+
label tags injected by the layer as a common cross-reference to link across
46+
the streams.
47+
48+
### Workload labelling
49+
50+
Command stream labelling is implemented using `vkCmdDebugMarkerBeginEXT()`
51+
and `vkCmdDebugMarkerEndEXT()`, wrapping one layer-owned `tagID` label around
52+
each semantic workload. This `tagID` can unambiguously refer to this workload
53+
encoding, and metadata that we do not expect to change per submit will be
54+
emitted using the matching `tagID` as the sole identifier.
55+
56+
_**TODO:** Dynamic `submitID` tracking is not yet implemented._
57+
58+
The `tagID` label is encoded into the recorded command buffer which means, for
59+
reusable command buffers, it is not an unambiguous identifier of a specific
60+
running workload. To allow us to disambiguate specific workload instances, the
61+
layer can optionally add an outer wrapper of `submitID` labels around each
62+
submitted command buffer. This wrapper is only generated if the submit contains
63+
any command buffers that require the generation of a per-submit annex (see the
64+
following section for when this is needed).
65+
66+
The `submitID.tagID` pair of IDs uniquely identifies a specific running
67+
workload, and can be used to attach an instance-specific metadata annex to a
68+
specific submitted workload rather than to the shared recorded command buffer.
69+
70+
### Workload metadata for split render passes
71+
72+
_**TODO:** Split render pass tracking is not yet implemented._
73+
74+
Dynamic render passes can be split across multiple Begin/End pairs, including
75+
being split across command buffer boundaries. If these splits occur within a
76+
single primary command buffer, or its secondaries, it is handled transparently
77+
by the layer and it appears as a single message as if no splits occurred. If
78+
these splits occur across primary command buffer boundaries, then some
79+
additional work is required.
80+
81+
In our design a `tagID` debug marker is only started when the render pass first
82+
starts (not on resume), and stopped at the end of the render pass (not on
83+
suspend). The same `tagID` is used to refer to all parts of the render pass,
84+
no matter how many times it was suspended and resumed.
85+
86+
If a render pass splits across command buffers, we cannot precompute metrics
87+
based on `tagID` alone, even if the command buffers are one-time use. This is
88+
because we do not know what combination of submitted command buffers will be
89+
used, and so we cannot know what the render pass contains until submit time.
90+
Split render passes will emit a `submitID.tagID` metadata annex containing
91+
the parameters that can only be known at submit time.
92+
93+
### Workload metadata for compute dispatches
94+
95+
_**TODO:** Compute workgroup parsing from the SPIR-V is not yet implemented._
96+
97+
Compute workload dispatch is simple to track, but one of the metadata items we
98+
want to export is the total size of the work space (work_group_count *
99+
work_group_size).
100+
101+
The work group count is defined by the API call, but may be an indirect
102+
parameter (see indirect tracking above).
103+
104+
The work group size is defined by the program pipeline, and is defined in the
105+
SPIR-V via a literal or a build-time specialization constant. To support this
106+
use case we will need to parse the SPIR-V when the pipeline is built, if
107+
SPIR-V is available.
108+
109+
### Workload metadata for indirect calls
110+
111+
_**TODO:** Indirect parameter tracking is not yet implemented._
112+
113+
One of the valuable pieces of metadata that we want to present is the size of
114+
each workload. For render passes this is captured at API call time, but for
115+
other workloads the size can be an indirect parameter that is not known when
116+
the triggering API call is made.
117+
118+
To capture indirect parameters we insert a transfer that copies the indirect
119+
parameters into a layer-owned buffer. To ensure exclusive use of the buffer and
120+
avoid data corruption, each buffer region used is unique to a specific `tagID`.
121+
Attempting to submit the same command buffer multiple times will result in
122+
the workload being serialized to avoid racy access to the buffer. Once the
123+
buffer has been retrieved by the layer, a metadata annex containing the
124+
indirect parameters will be emitted using the `submitID.tagID` pair. This may
125+
be some time later than the original submit.
126+
127+
### Workload metadata for user-defined labels
128+
129+
The workload metadata captures user-defined labels that the application
130+
provides using `vkCmdDebugMarkerBeginEXT()` and `vkCmdDebugMarkerEndEXT()`.
131+
These are a stack-based debug mechanism where `Begin` pushes a new entry on to
132+
to the stack, and `End` pops the the most recent level off the stack.
133+
134+
Workloads are labelled with the stack values that existed when the workload
135+
was started. For render passes this is the value on the stack when, e.g.,
136+
`vkCmdBeginRenderPass()` was called. We do not capture any labels that exist
137+
inside the render pass.
138+
139+
The debug label stack belongs to the queue, not to the command buffer, so the
140+
value of the label stack is not known until submit time. The debug information
141+
for a specific `submitID.tagID` pair is therefore provided as an annex at
142+
submit time once the stack can be resolved.
143+
144+
## Message protocol
145+
146+
For each workload in a command buffer, or part-workload in the case of a
147+
suspended render pass, we record a JSON metadata blob containing the payload
148+
we want to send.
149+
150+
The low level protocol message contains:
151+
152+
* Message type `uint8_t`
153+
* Sequence ID `uint64_t` (optional, implied by message type)
154+
* Tag ID `uint64_t`
155+
* JSON length `uint32_t`
156+
* JSON payload `uint8_t[]`
157+
158+
Each workload will read whatever properties it can from the `tagID` metadata
159+
and will then merge in all fields from any subsequent `sequenceID.tagID`
160+
metadata that matches.
161+
162+
- - -
163+
164+
_Copyright © 2024, Arm Limited and contributors._
Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
# Layer: GPU Timeline - Command Buffer Modelling
2+
3+
One of the main challenges of this layer driver is modelling behavior in queues
4+
and command buffers that is not known until submit time, and then taking
5+
appropriate actions based on the combination of both the head state of the
6+
queue and the content of the pre-recorded command buffers.
7+
8+
Our design to solve this is a lightweight software command stream which is
9+
recorded when a command buffer is recorded, and then executed when the
10+
command buffer is submitted to the queue. Just like a real hardware command
11+
stream these commands can update state or trigger some other action we need
12+
performed.
13+
14+
## Layer commands
15+
16+
**MARKER_BEGIN(const std::string\*):**
17+
18+
* Push a new marker into the queue debug label stack.
19+
20+
**MARKER_END():**
21+
22+
* Pop the latest marker from the queue debug label stack.
23+
24+
**RENDERPASS_BEGIN(const json\*):**
25+
26+
* Set the current workload to a new render pass with the passed metadata.
27+
28+
**RENDERPASS_RESUME(const json\*):**
29+
30+
* Update the current workload, which must be a render pass, with extra
31+
draw count metadata.
32+
33+
**COMPUTE_DISPATCH_BEGIN(const json\*):**
34+
35+
* Set the current workload to a new compute dispatch with the passed metadata.
36+
37+
**TRACE_RAYS_BEGIN(const json\*):**
38+
39+
* Set the current workload to a new trace rays with the passed metadata.
40+
41+
**BUFFER_TRANSFER_BEGIN(const json\*):**
42+
43+
* Set the current workload to a new a buffer transfer.
44+
45+
**IMAGE_TRANSFER(const json\*):**
46+
47+
* Set the current workload to a new image transfer.
48+
49+
**WORKLOAD_END():**
50+
51+
* Mark the current workload as complete, and emit a built metadata entry for
52+
it.
53+
54+
## Layer command recording
55+
56+
Command buffer recording is effectively building two separate state
57+
structures for the layer.
58+
59+
The first is a per-workload or per-restart JSON structure that contains the
60+
metadata we need for that workload. For partial workloads - e.g. a dynamic
61+
render pass begin that has been suspended - this metadata will be partial and
62+
rely on later restart metadata to complete it.
63+
64+
The second is the layer "command stream" that contains the bytecode commands
65+
to execute when the command buffer is submitted to the queue. These commands
66+
are very simple, consisting of a list of command+pointer pairs, where the
67+
pointer value may be unused by some commands. Commands are stored in a
68+
std::vector, but we reserve enough memory to store 256 commands without
69+
reallocating which is enough for the majority of command buffers we see in
70+
real applications.
71+
72+
The command stream for a secondary command buffer is inlined into the primary
73+
command buffer during recording.
74+
75+
### Recording sequence
76+
77+
When application records a new workload:
78+
79+
* A `tagID` is assigned and recorded using `vkCmdMarkerBegin()` label in the
80+
Vulkan command stream _before_ the new workload is written to the command
81+
stream.
82+
* If workload is using indirect parameters, then a transfer job to copy
83+
indirect parameters into a layer-owned buffer is emitted _before_ the new
84+
workload. No additional barrier is needed because application barriers must
85+
have already ensured that the indirect parameter buffer is valid.
86+
* A proxy workload object is created in the layer storing the assigned
87+
`tagID` and all settings that are known at command recording time.
88+
* A layer command stream command is recorded into the submit time stream
89+
indicating `<TYPE>_BEGIN` with a pointer to the proxy workload. Note that
90+
this JSON may be modified later for some workloads.
91+
* If workload is using indirect parameters, a layer command stream command is
92+
recorded into the resolve time stream, which will handle cleanup and
93+
emitting the `submitID.tagID` annex message for the indirect data.
94+
* If the command buffer is not ONE_TIME_SUBMIT, if any workload is using
95+
indirect parameters, or contains incomplete render passes, the command
96+
buffer is marked as needing a `submitID` wrapper.
97+
* The user command is written to the Vulkan command stream.
98+
99+
When application resumes a render pass workload:
100+
101+
* A `tagID` of zero is assigned, but not emitted to the command stream.
102+
* A layer command stream command is recorded into the submit time stream
103+
indicating `<TYPE>_RESUME` with a pointer to the proxy workload. Note that
104+
this JSON may be modified later for some workloads.
105+
* The user command is written to the Vulkan command stream.
106+
107+
When application ends a workload:
108+
109+
* For render pass workloads, any statistics accumulated since the last begin
110+
are rolled up into the proxy workload object.
111+
* For render pass workloads, the user command is written to the Vulkan
112+
command stream.
113+
* The command steam label scope is closed using `vkCmdMarkerEnd()`.
114+
115+
## Layer command playback
116+
117+
The persistent state for command playback belongs to the queues the command
118+
buffers are submitted to. The command stream bytecode is run by a bytecode
119+
interpreter associated with the state of the current queue, giving the
120+
interpreter access to the current `submitID` and queue debug label stack.
121+
122+
### Submitting sequence
123+
124+
For each command buffer in the user submit:
125+
126+
* If the command buffer needs a `submitID` we allocate a unique `submitID` and
127+
create two new command buffers that will wrap the user command buffer with an
128+
additional stack layer of debug label containing the `s<ID>` string. We will
129+
inject a layer command stream async command to handle freeing the command
130+
buffers.
131+
* The tool will process the submit-time layer commands, executing each command
132+
to either update some state or emit
133+
* If there are any async layer commands, either recorded in the command buffer
134+
or from the wrapping command buffers, we will need to add an async handler.
135+
This cannot safely use the user fence or depend on any user object lifetime,
136+
so we will add a layer-owned timeline semaphore to the submit which we can
137+
wait on to determine when it is safe trigger the async work.
138+
139+
## Future: Async commands
140+
141+
One of our longer-term goals is to be able to capture indirect parameters,
142+
which will be available after-the-fact once the GPU has processed the command
143+
buffer. Once we have the data we can emit an annex message containing
144+
parameters for each indirect `submitID.tagID` pair in the command buffer.
145+
146+
We need to be able to emit the metadata after the commands are complete,
147+
and correctly synchronize use of the indirect capture staging buffer
148+
if command buffers are reissued. My current thinking is that we would
149+
implement this using additional layer commands that are processed on submit,
150+
including support for async commands that run in a separate thread and
151+
wait on the command buffer completion fence before running.
152+
153+
- - -
154+
155+
_Copyright © 2024, Arm Limited and contributors._

layer_gpu_timeline/source/CMakeLists.txt

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,16 @@ add_library(
4343
${VK_LAYER} SHARED
4444
device.cpp
4545
entry.cpp
46-
instance.cpp)
46+
instance.cpp
47+
layer_device_functions_command_buffer.cpp
48+
layer_device_functions_command_pool.cpp
49+
layer_device_functions_debug.cpp
50+
layer_device_functions_dispatch.cpp
51+
layer_device_functions_draw_call.cpp
52+
layer_device_functions_queue.cpp
53+
layer_device_functions_render_pass.cpp
54+
layer_device_functions_trace_rays.cpp
55+
timeline_comms.cpp)
4756

4857
target_include_directories(
4958
${VK_LAYER} PRIVATE
@@ -59,7 +68,9 @@ lgl_set_build_options(${VK_LAYER})
5968

6069
target_link_libraries(
6170
${VK_LAYER}
71+
lib_layer_comms
6272
lib_layer_framework
73+
lib_layer_trackers
6374
$<$<PLATFORM_ID:Android>:log>)
6475

6576
if (CMAKE_BUILD_TYPE STREQUAL "Release")

0 commit comments

Comments
 (0)