[CK_TILE] Generate random tensor values with multiple threads #3324

DDEle · 2025-11-28T02:04:26Z

Proposed changes

ck_tile version of f9bf275 (merged with #2297)

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Copilot

Pull request overview

This PR introduces multi-threaded random tensor value generation for the ck_tile library, replacing the previous single-threaded or opt-in multi-threaded approach with an always-on deterministic multi-threaded implementation. The changes ensure reproducible results across different thread counts by using a block-based distribution strategy with RNG state management via discard().

Key Changes

Refactored FillUniformDistribution to always use multi-threading with deterministic block-based random number generation
Added CPU core management utilities (get_available_cpu_cores() and cpu_core_guard) for testing different thread configurations
Updated the template parameter to allow type deduction (T = void)

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`include/ck_tile/host/fill.hpp`	Replaced opt-in multi-threading with always-on deterministic block-based multi-threaded filling; changed template parameter to support type deduction
`include/ck_tile/host/joinable_thread.hpp`	Added `get_available_cpu_cores()` function and `cpu_core_guard` class for CPU affinity management in tests
`test/ck_tile/utility/test_fill.cpp`	New comprehensive test suite validating deterministic behavior across different sizes and thread counts
`test/ck_tile/utility/CMakeLists.txt`	Registered the new test executable
`example/ck_tile/18_flatmm/mxgemm/run_mx_flatmm.inc`	Updated to use new template syntax with type deduction

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-28T13:45:20Z

test/ck_tile/utility/test_fill.cpp

+    const int cpu_cores = max(32U, get_available_cpu_cores());
+    for(auto num_threads_diff : {-3, -1})
+    {
+        cpu_core_guard cg(max(cpu_cores + num_threads_diff, 1));


The test attempts to set CPU cores to max(cpu_cores + num_threads_diff, 1) where num_threads_diff is negative (-3, -1). However, cpu_cores is computed as max(32U, get_available_cpu_cores()), so if the system has fewer than 32 cores (say 16), this could try to set affinity to cores that don't exist (e.g., cores 29, 31 on a 16-core system). The cpu_core_guard should be limited to the actual number of available cores. Consider: cpu_core_guard cg(min(max(cpu_cores + num_threads_diff, 1), get_available_cpu_cores()));

Suggested change

cpu_core_guard cg(max(cpu_cores + num_threads_diff, 1));

cpu_core_guard cg(min(max(cpu_cores + num_threads_diff, 1), get_available_cpu_cores()));

Copilot · 2025-11-28T13:45:21Z

include/ck_tile/host/fill.hpp

 * @example
 *
 *     // Direct usage without creating a separate variable:
 *     ck_tile::FillUniformDistribution<ADataType>{-1.f, 1.f}(a_host_tensor);


[nitpick] The example in the documentation shows usage with an explicit template parameter FillUniformDistribution<ADataType>, but the new API allows (and the code examples in run_mx_flatmm.inc demonstrate) using FillUniformDistribution<> with template type deduction. The example should be updated to reflect the new usage pattern: ck_tile::FillUniformDistribution<>{-1.f, 1.f}(a_host_tensor);

Suggested change

* ck_tile::FillUniformDistribution<ADataType>{-1.f, 1.f}(a_host_tensor);

* ck_tile::FillUniformDistribution<>{-1.f, 1.f}(a_host_tensor);

Copilot · 2025-11-28T13:45:21Z

test/ck_tile/utility/test_fill.cpp

+
+#include <gtest/gtest.h>
+#include <vector>
+#include <cstring>


Missing #include <chrono> header. The code uses std::chrono::high_resolution_clock and related types but doesn't include the necessary header. This will cause compilation errors.

Suggested change

#include <cstring>

#include <cstring>

#include <chrono>

Copilot · 2025-11-28T13:45:22Z

test/ck_tile/utility/test_fill.cpp

+    std::cout << "Taking " << sec << " sec to fill 1GB of data of type " << typeid(T).name()
+              << std::endl;
+
+    const int cpu_cores = max(32U, get_available_cpu_cores());


Type mismatch: cpu_cores is declared as int but get_available_cpu_cores() returns unsigned int. This could cause sign-related issues when performing arithmetic operations. Change to const unsigned int cpu_cores = max(32U, get_available_cpu_cores()); or cast appropriately when using in arithmetic.

Suggested change

const int cpu_cores = max(32U, get_available_cpu_cores());

const unsigned int cpu_cores = max(32U, get_available_cpu_cores());

Copilot · 2025-11-28T13:45:22Z

include/ck_tile/host/joinable_thread.hpp

+        sched_getaffinity(0, sizeof(cpu_set_t), &original_cpu_set_);
+
+        // set new cpu set
+        cpu_set_t new_cpu_set;
+        CPU_ZERO(&new_cpu_set);
+        for(unsigned int i = 0; i < num_cores; ++i)
+        {
+#pragma clang diagnostic push
+#pragma clang diagnostic ignored "-Wold-style-cast"
+            CPU_SET(i, &new_cpu_set); // NOLINT(old-style-cast)
+#pragma clang diagnostic pop
+        }
+        sched_setaffinity(0, sizeof(cpu_set_t), &new_cpu_set);


Missing error handling for sched_getaffinity and sched_setaffinity system calls. If these calls fail, the cpu affinity won't be properly set or restored, which could lead to unexpected behavior. Consider adding error checks and potentially logging warnings if the operations fail.

Copilot · 2025-11-28T13:45:22Z

include/ck_tile/host/joinable_thread.hpp

+    {
+        // restore original cpu set
+        sched_setaffinity(0, sizeof(cpu_set_t), &original_cpu_set_);
+    }


The cpu_core_guard class is empty on non-Linux platforms. This means creating an instance has no effect on non-Linux systems, but the test code uses it assuming it will work. Consider adding a public constructor and destructor for non-Linux platforms to avoid potential compilation issues and make the class behavior more explicit.

Suggested change

}

}

#else

public:

cpu_core_guard(unsigned int) {}

~cpu_core_guard() {}

Copilot · 2025-11-28T13:45:23Z

include/ck_tile/host/fill.hpp

+        std::minstd_rand gen(seed_.has_value() ? *seed_ : std::random_device{}());
+        std::uniform_real_distribution<float> dis(a_, b_);
+
+        std::vector<joinable_thread> threads(num_thread);


The vector is pre-sized with num_thread elements but then reserve() is called for num_thread - 1. The reserve() call after construction is redundant and incorrect - reserve is meant to allocate capacity without changing size. Since the vector already has num_thread elements, this reserve call has no useful effect. Remove this line or change the initialization to std::vector<joinable_thread> threads; threads.reserve(num_thread - 1);

Suggested change

std::vector<joinable_thread> threads(num_thread);

std::vector<joinable_thread> threads;

Copilot · 2025-11-28T13:45:23Z

include/ck_tile/host/fill.hpp

+
+        // max 80 threads; at least 2MB per thread
+        const size_t available_cpu_cores = get_available_cpu_cores();
+        const int num_thread =


Type inconsistency: num_thread is declared as int but is used with size_t values in calculations. The result of min(80UL, available_cpu_cores, integer_divide_ceil(...)) should be size_t. This could lead to type conversion warnings or issues. Change the type to const size_t num_thread = ... for consistency.

Suggested change

const int num_thread =

const size_t num_thread =

[CK_TILE] Optimize Fill

f672eec

DDEle requested review from ThomasNing, afagaj, andriy-ca, aosewski, asleepzzz, bartekxk, carlushuang, cgmillette, coderfeli, geyyer, illsilin, poyenc, qianfengz, shumway, tenpercent and vidyasagar-amd as code owners November 28, 2025 02:04

DDEle added 2 commits November 28, 2025 10:04

Merge branch 'develop' into fill-opt

16c1931

Merge branch 'develop' into fill-opt

735e1a5

AviralGoelAMD requested a review from Copilot November 28, 2025 13:39

Copilot started reviewing on behalf of AviralGoelAMD November 28, 2025 13:40 View session

Copilot finished reviewing on behalf of AviralGoelAMD November 28, 2025 13:43

Copilot AI reviewed Nov 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CK_TILE] Generate random tensor values with multiple threads #3324

[CK_TILE] Generate random tensor values with multiple threads #3324

DDEle commented Nov 28, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 28, 2025

Uh oh!

Copilot AI Nov 28, 2025

Uh oh!

Copilot AI Nov 28, 2025

Uh oh!

Copilot AI Nov 28, 2025

Uh oh!

Copilot AI Nov 28, 2025

Uh oh!

Copilot AI Nov 28, 2025

Uh oh!

Copilot AI Nov 28, 2025

Uh oh!

Copilot AI Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	cpu_core_guard cg(max(cpu_cores + num_threads_diff, 1));
	cpu_core_guard cg(min(max(cpu_cores + num_threads_diff, 1), get_available_cpu_cores()));

	* ck_tile::FillUniformDistribution<ADataType>{-1.f, 1.f}(a_host_tensor);
	* ck_tile::FillUniformDistribution<>{-1.f, 1.f}(a_host_tensor);

	const int cpu_cores = max(32U, get_available_cpu_cores());
	const unsigned int cpu_cores = max(32U, get_available_cpu_cores());

-    }
+    }
+#else
+    public:
+    cpu_core_guard(unsigned int) {}
+    ~cpu_core_guard() {}

	std::vector<joinable_thread> threads(num_thread);
	std::vector<joinable_thread> threads;

[CK_TILE] Generate random tensor values with multiple threads #3324

Are you sure you want to change the base?

[CK_TILE] Generate random tensor values with multiple threads #3324

Conversation

DDEle commented Nov 28, 2025

Proposed changes

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants