ARM-software
diff --git a/‎layer_example/source/layer_device_functions.hpp‎
Lines changed: 2 additions & 0 deletions b/‎layer_example/source/layer_device_functions.hpp‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎layer_gpu_timeline/CMakeLists.txt‎
Lines changed: 3 additions & 1 deletion b/‎layer_gpu_timeline/CMakeLists.txt‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎layer_gpu_timeline/README_LAYER.md‎
Lines changed: 164 additions & 0 deletions b/‎layer_gpu_timeline/README_LAYER.md‎
Lines changed: 164 additions & 0 deletions
diff --git a/‎layer_gpu_timeline/docs/command_buffer_model.md‎
Lines changed: 155 additions & 0 deletions b/‎layer_gpu_timeline/docs/command_buffer_model.md‎
Lines changed: 155 additions & 0 deletions
diff --git a/‎layer_gpu_timeline/source/CMakeLists.txt‎
Lines changed: 12 additions & 1 deletion b/‎layer_gpu_timeline/source/CMakeLists.txt‎
Lines changed: 12 additions & 1 deletion
@@ -23,6 +23,8 @@
  * ----------------------------------------------------------------------------
  */
 
+#include <vulkan/vulkan.h>
+
 #include "framework/utils.hpp"
 
 /* See Vulkan API for documentation. */
 
@@ -35,5 +35,7 @@ set(LGL_CONFIG_LOG 1)
 include(../source_common/compiler_helper.cmake)
 
 # Build steps
-add_subdirectory(source)
+add_subdirectory(../source_common/comms source_common/comms)
 add_subdirectory(../source_common/framework source_common/framework)
+add_subdirectory(../source_common/trackers source_common/trackers)
+add_subdirectory(source)
@@ -0,0 +1,164 @@
+# Layer: GPU Timeline
+
+This layer is used with Arm GPUs for tracking submitted schedulable workloads
+and emitting semantic information about them. This data can be combined with
+the raw workload execution timing information captured using the Android
+Perfetto service, providing developers with a richer debug visualization.
+
+## What devices?
+
+The Arm GPU driver integration with the Perfetto render stages scheduler event
+trace is supported at production quality since the r47p0 driver version.
+However, associating semantics from this layer relies on a further integration
+with debug labels which requires an r51p0 or later driver version.
+
+## What workloads?
+
+A schedulable workload is the smallest workload that the Arm GPU command stream
+scheduler will issue to the GPU hardware work queues. This includes the
+following workload types:
+
+* Render passes, split into:
+  * Vertex or Binning phase
+  * Fragment or Main phase
+* Compute dispatches
+* Trace rays
+* Transfers to a buffer
+* Transfers to an image
+
+Most workloads are dispatched using a single API call, and are trivial to
+manage in the layer. However, render passes are more complex and need extra
+handling. In particular:
+
+* Render passes are issued using multiple API calls.
+* Useful render pass properties, such as draw count, are not known until the
+  render pass recording has ended.
+* Dynamic render passes using `vkCmdBeginRendering()` and `vkCmdEndRendering()`
+  can be suspended and resumed across command buffer boundaries. Properties
+  such as draw count are not defined by the scope of a single command buffer.
+
+## Tracking workloads
+
+This layer tracks workloads encoded in command buffers, and emits semantic
+metadata for each workload via a communications side-channel. A host tool
+combines the semantic data stream with the Perfetto data stream, using debug
+label tags injected by the layer as a common cross-reference to link across
+the streams.
+
+### Workload labelling
+
+Command stream labelling is implemented using `vkCmdDebugMarkerBeginEXT()`
+and `vkCmdDebugMarkerEndEXT()`, wrapping one layer-owned `tagID` label around
+each semantic workload. This `tagID` can unambiguously refer to this workload
+encoding, and metadata that we do not expect to change per submit will be
+emitted using the matching `tagID` as the sole identifier.
+
+_**TODO:** Dynamic `submitID` tracking is not yet implemented._
+
+The `tagID` label is encoded into the recorded command buffer which means, for
+reusable command buffers, it is not an unambiguous identifier of a specific
+running workload. To allow us to disambiguate specific workload instances, the
+layer can optionally add an outer wrapper of `submitID` labels around each
+submitted command buffer. This wrapper is only generated if the submit contains
+any command buffers that require the generation of a per-submit annex (see the
+following section for when this is needed).
+
+The `submitID.tagID` pair of IDs uniquely identifies a specific running
+workload, and can be used to attach an instance-specific metadata annex to a
+specific submitted workload rather than to the shared recorded command buffer.
+
+### Workload metadata for split render passes
+
+_**TODO:** Split render pass tracking is not yet implemented._
+
+Dynamic render passes can be split across multiple Begin/End pairs, including
+being split across command buffer boundaries. If these splits occur within a
+single primary command buffer, or its secondaries, it is handled transparently
+by the layer and it appears as a single message as if no splits occurred. If
+these splits occur across primary command buffer boundaries, then some
+additional work is required.
+
+In our design a `tagID` debug marker is only started when the render pass first
+starts (not on resume), and stopped at the end of the render pass (not on
+suspend). The same `tagID` is used to refer to all parts of the render pass,
+no matter how many times it was suspended and resumed.
+
+If a render pass splits across command buffers, we cannot precompute metrics
+based on `tagID` alone, even if the command buffers are one-time use. This is
+because we do not know what combination of submitted command buffers will be
+used, and so we cannot know what the render pass contains until submit time.
+Split render passes will emit a `submitID.tagID` metadata annex containing
+the parameters that can only be known at submit time.
+
+### Workload metadata for compute dispatches
+
+_**TODO:** Compute workgroup parsing from the SPIR-V is not yet implemented._
+
+Compute workload dispatch is simple to track, but one of the metadata items we
+want to export is the total size of the work space (work_group_count *
+work_group_size).
+
+The work group count is defined by the API call, but may be an indirect
+parameter (see indirect tracking above).
+
+The work group size is defined by the program pipeline, and is defined in the
+SPIR-V via a literal or a build-time specialization constant. To support this
+use case we will need to parse the SPIR-V when the pipeline is built, if
+SPIR-V is available.
+
+### Workload metadata for indirect calls
+
+_**TODO:** Indirect parameter tracking is not yet implemented._
+
+One of the valuable pieces of metadata that we want to present is the size of
+each workload. For render passes this is captured at API call time, but for
+other workloads the size can be an indirect parameter that is not known when
+the triggering API call is made.
+
+To capture indirect parameters we insert a transfer that copies the indirect
+parameters into a layer-owned buffer. To ensure exclusive use of the buffer and
+avoid data corruption, each buffer region used is unique to a specific `tagID`.
+Attempting to submit the same command buffer multiple times will result in
+the workload being serialized to avoid racy access to the buffer. Once the
+buffer has been retrieved by the layer, a metadata annex containing the
+indirect parameters will be emitted using the `submitID.tagID` pair. This may
+be some time later than the original submit.
+
+### Workload metadata for user-defined labels
+
+The workload metadata captures user-defined labels that the application
+provides using `vkCmdDebugMarkerBeginEXT()` and `vkCmdDebugMarkerEndEXT()`.
+These are a stack-based debug mechanism where `Begin` pushes a new entry on to
+to the stack, and `End` pops the the most recent level off the stack.
+
+Workloads are labelled with the stack values that existed when the workload
+was started. For render passes this is the value on the stack when, e.g.,
+`vkCmdBeginRenderPass()` was called. We do not capture any labels that exist
+inside the render pass.
+
+The debug label stack belongs to the queue, not to the command buffer, so the
+value of the label stack is not known until submit time. The debug information
+for a specific `submitID.tagID` pair is therefore provided as an annex at
+submit time once the stack can be resolved.
+
+## Message protocol
+
+For each workload in a command buffer, or part-workload in the case of a
+suspended render pass, we record a JSON metadata blob containing the payload
+we want to send.
+
+The low level protocol message contains:
+
+* Message type `uint8_t`
+* Sequence ID `uint64_t` (optional, implied by message type)
+* Tag ID `uint64_t`
+* JSON length `uint32_t`
+* JSON payload `uint8_t[]`
+
+Each workload will read whatever properties it can from the `tagID` metadata
+and will then merge in all fields from any subsequent `sequenceID.tagID`
+metadata that matches.
+
+- - -
+
+_Copyright © 2024, Arm Limited and contributors._
@@ -0,0 +1,155 @@
+# Layer: GPU Timeline - Command Buffer Modelling
+
+One of the main challenges of this layer driver is modelling behavior in queues
+and command buffers that is not known until submit time, and then taking
+appropriate actions based on the combination of both the head state of the
+queue and the content of the pre-recorded command buffers.
+
+Our design to solve this is a lightweight software command stream which is
+recorded when a command buffer is recorded, and then executed when the
+command buffer is submitted to the queue. Just like a real hardware command
+stream these commands can update state or trigger some other action we need
+performed.
+
+## Layer commands
+
+**MARKER_BEGIN(const std::string\*):**
+
+* Push a new marker into the queue debug label stack.
+
+**MARKER_END():**
+
+* Pop the latest marker from the queue debug label stack.
+
+**RENDERPASS_BEGIN(const json\*):**
+
+* Set the current workload to a new render pass with the passed metadata.
+
+**RENDERPASS_RESUME(const json\*):**
+
+* Update the current workload, which must be a render pass, with extra
+  draw count metadata.
+
+**COMPUTE_DISPATCH_BEGIN(const json\*):**
+
+* Set the current workload to a new compute dispatch with the passed metadata.
+
+**TRACE_RAYS_BEGIN(const json\*):**
+
+* Set the current workload to a new trace rays with the passed metadata.
+
+**BUFFER_TRANSFER_BEGIN(const json\*):**
+
+* Set the current workload to a new a buffer transfer.
+
+**IMAGE_TRANSFER(const json\*):**
+
+* Set the current workload to a new image transfer.
+
+**WORKLOAD_END():**
+
+* Mark the current workload as complete, and emit a built metadata entry for
+  it.
+
+## Layer command recording
+
+Command buffer recording is effectively building two separate state
+structures for the layer.
+
+The first is a per-workload or per-restart JSON structure that contains the
+metadata we need for that workload. For partial workloads - e.g. a dynamic
+render pass begin that has been suspended - this metadata will be partial and
+rely on later restart metadata to complete it.
+
+The second is the layer "command stream" that contains the bytecode commands
+to execute when the command buffer is submitted to the queue. These commands
+are very simple, consisting of a list of command+pointer pairs, where the
+pointer value may be unused by some commands. Commands are stored in a
+std::vector, but we reserve enough memory to store 256 commands without
+reallocating which is enough for the majority of command buffers we see in
+real applications.
+
+The command stream for a secondary command buffer is inlined into the primary
+command buffer during recording.
+
+###  Recording sequence
+
+When application records a new workload:
+
+  * A `tagID` is assigned and recorded using `vkCmdMarkerBegin()` label in the
+    Vulkan command stream _before_ the new workload is written to the command
+    stream.
+  * If workload is using indirect parameters, then a transfer job to copy
+    indirect parameters into a layer-owned buffer is emitted _before_ the new
+    workload. No additional barrier is needed because application barriers must
+    have already ensured that the indirect parameter buffer is valid.
+  * A proxy workload object is created in the layer storing the assigned
+    `tagID` and all settings that are known at command recording time.
+  * A layer command stream command is recorded into the submit time stream
+    indicating `<TYPE>_BEGIN` with a pointer to the proxy workload. Note that
+    this JSON may be modified later for some workloads.
+  * If workload is using indirect parameters, a layer command stream command is
+    recorded into the resolve time stream, which will handle cleanup and
+    emitting the `submitID.tagID` annex message for the indirect data.
+  * If the command buffer is not ONE_TIME_SUBMIT, if any workload is using
+    indirect parameters, or contains incomplete render passes, the command
+    buffer is marked as needing a `submitID` wrapper.
+  * The user command is written to the Vulkan command stream.
+
+When application resumes a render pass workload:
+
+  * A `tagID` of zero is assigned, but not emitted to the command stream.
+  * A layer command stream command is recorded into the submit time stream
+    indicating `<TYPE>_RESUME` with a pointer to the proxy workload. Note that
+    this JSON may be modified later for some workloads.
+  * The user command is written to the Vulkan command stream.
+
+When application ends a workload:
+
+  * For render pass workloads, any statistics accumulated since the last begin
+    are rolled up into the proxy workload object.
+  * For render pass workloads, the user command is written to the Vulkan
+    command stream.
+  * The command steam label scope is closed using `vkCmdMarkerEnd()`.
+
+## Layer command playback
+
+The persistent state for command playback belongs to the queues the command
+buffers are submitted to. The command stream bytecode is run by a bytecode
+interpreter associated with the state of the current queue, giving the
+interpreter access to the current `submitID` and queue debug label stack.
+
+###  Submitting sequence
+
+For each command buffer in the user submit:
+
+* If the command buffer needs a `submitID` we allocate a unique `submitID` and
+  create two new command buffers that will wrap the user command buffer with an
+  additional stack layer of debug label containing the `s<ID>` string. We will
+  inject a layer command stream async command to handle freeing the command
+  buffers.
+* The tool will process the submit-time layer commands, executing each command
+  to either update some state or emit
+* If there are any async layer commands, either recorded in the command buffer
+  or from the wrapping command buffers, we will need to add an async handler.
+  This cannot safely use the user fence or depend on any user object lifetime,
+  so we will add a layer-owned timeline semaphore to the submit which we can
+  wait on to determine when it is safe trigger the async work.
+
+## Future: Async commands
+
+One of our longer-term goals is to be able to capture indirect parameters,
+which will be available after-the-fact once the GPU has processed the command
+buffer. Once we have the data we can emit an annex message containing
+parameters for each indirect `submitID.tagID` pair in the command buffer.
+
+We need to be able to emit the metadata after the commands are complete,
+and correctly synchronize use of the indirect capture staging buffer
+if command buffers are reissued. My current thinking is that we would
+implement this using additional layer commands that are processed on submit,
+including support for async commands that run in a separate thread and
+wait on the command buffer completion fence before running.
+
+- - -
+
+_Copyright © 2024, Arm Limited and contributors._
@@ -43,7 +43,16 @@ add_library(
     ${VK_LAYER} SHARED
         device.cpp
         entry.cpp
-        instance.cpp)
+        instance.cpp
+        layer_device_functions_command_buffer.cpp
+        layer_device_functions_command_pool.cpp
+        layer_device_functions_debug.cpp
+        layer_device_functions_dispatch.cpp
+        layer_device_functions_draw_call.cpp
+        layer_device_functions_queue.cpp
+        layer_device_functions_render_pass.cpp
+        layer_device_functions_trace_rays.cpp
+        timeline_comms.cpp)
 
 target_include_directories(
     ${VK_LAYER} PRIVATE
@@ -59,7 +68,9 @@ lgl_set_build_options(${VK_LAYER})
 
 target_link_libraries(
     ${VK_LAYER}
+        lib_layer_comms
         lib_layer_framework
+        lib_layer_trackers
         $<$<PLATFORM_ID:Android>:log>)
 
 if (CMAKE_BUILD_TYPE STREQUAL "Release")