-
Notifications
You must be signed in to change notification settings - Fork 256
[CK_TILE] Generate random tensor values with multiple threads #3324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces multi-threaded random tensor value generation for the ck_tile library, replacing the previous single-threaded or opt-in multi-threaded approach with an always-on deterministic multi-threaded implementation. The changes ensure reproducible results across different thread counts by using a block-based distribution strategy with RNG state management via discard().
Key Changes
- Refactored
FillUniformDistributionto always use multi-threading with deterministic block-based random number generation - Added CPU core management utilities (
get_available_cpu_cores()andcpu_core_guard) for testing different thread configurations - Updated the template parameter to allow type deduction (
T = void)
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
include/ck_tile/host/fill.hpp |
Replaced opt-in multi-threading with always-on deterministic block-based multi-threaded filling; changed template parameter to support type deduction |
include/ck_tile/host/joinable_thread.hpp |
Added get_available_cpu_cores() function and cpu_core_guard class for CPU affinity management in tests |
test/ck_tile/utility/test_fill.cpp |
New comprehensive test suite validating deterministic behavior across different sizes and thread counts |
test/ck_tile/utility/CMakeLists.txt |
Registered the new test executable |
example/ck_tile/18_flatmm/mxgemm/run_mx_flatmm.inc |
Updated to use new template syntax with type deduction |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const int cpu_cores = max(32U, get_available_cpu_cores()); | ||
| for(auto num_threads_diff : {-3, -1}) | ||
| { | ||
| cpu_core_guard cg(max(cpu_cores + num_threads_diff, 1)); |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test attempts to set CPU cores to max(cpu_cores + num_threads_diff, 1) where num_threads_diff is negative (-3, -1). However, cpu_cores is computed as max(32U, get_available_cpu_cores()), so if the system has fewer than 32 cores (say 16), this could try to set affinity to cores that don't exist (e.g., cores 29, 31 on a 16-core system). The cpu_core_guard should be limited to the actual number of available cores. Consider: cpu_core_guard cg(min(max(cpu_cores + num_threads_diff, 1), get_available_cpu_cores()));
| cpu_core_guard cg(max(cpu_cores + num_threads_diff, 1)); | |
| cpu_core_guard cg(min(max(cpu_cores + num_threads_diff, 1), get_available_cpu_cores())); |
| * @example | ||
| * | ||
| * // Direct usage without creating a separate variable: | ||
| * ck_tile::FillUniformDistribution<ADataType>{-1.f, 1.f}(a_host_tensor); |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The example in the documentation shows usage with an explicit template parameter FillUniformDistribution<ADataType>, but the new API allows (and the code examples in run_mx_flatmm.inc demonstrate) using FillUniformDistribution<> with template type deduction. The example should be updated to reflect the new usage pattern: ck_tile::FillUniformDistribution<>{-1.f, 1.f}(a_host_tensor);
| * ck_tile::FillUniformDistribution<ADataType>{-1.f, 1.f}(a_host_tensor); | |
| * ck_tile::FillUniformDistribution<>{-1.f, 1.f}(a_host_tensor); |
|
|
||
| #include <gtest/gtest.h> | ||
| #include <vector> | ||
| #include <cstring> |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing #include <chrono> header. The code uses std::chrono::high_resolution_clock and related types but doesn't include the necessary header. This will cause compilation errors.
| #include <cstring> | |
| #include <cstring> | |
| #include <chrono> |
| std::cout << "Taking " << sec << " sec to fill 1GB of data of type " << typeid(T).name() | ||
| << std::endl; | ||
|
|
||
| const int cpu_cores = max(32U, get_available_cpu_cores()); |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type mismatch: cpu_cores is declared as int but get_available_cpu_cores() returns unsigned int. This could cause sign-related issues when performing arithmetic operations. Change to const unsigned int cpu_cores = max(32U, get_available_cpu_cores()); or cast appropriately when using in arithmetic.
| const int cpu_cores = max(32U, get_available_cpu_cores()); | |
| const unsigned int cpu_cores = max(32U, get_available_cpu_cores()); |
| sched_getaffinity(0, sizeof(cpu_set_t), &original_cpu_set_); | ||
|
|
||
| // set new cpu set | ||
| cpu_set_t new_cpu_set; | ||
| CPU_ZERO(&new_cpu_set); | ||
| for(unsigned int i = 0; i < num_cores; ++i) | ||
| { | ||
| #pragma clang diagnostic push | ||
| #pragma clang diagnostic ignored "-Wold-style-cast" | ||
| CPU_SET(i, &new_cpu_set); // NOLINT(old-style-cast) | ||
| #pragma clang diagnostic pop | ||
| } | ||
| sched_setaffinity(0, sizeof(cpu_set_t), &new_cpu_set); |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing error handling for sched_getaffinity and sched_setaffinity system calls. If these calls fail, the cpu affinity won't be properly set or restored, which could lead to unexpected behavior. Consider adding error checks and potentially logging warnings if the operations fail.
| { | ||
| // restore original cpu set | ||
| sched_setaffinity(0, sizeof(cpu_set_t), &original_cpu_set_); | ||
| } |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cpu_core_guard class is empty on non-Linux platforms. This means creating an instance has no effect on non-Linux systems, but the test code uses it assuming it will work. Consider adding a public constructor and destructor for non-Linux platforms to avoid potential compilation issues and make the class behavior more explicit.
| } | |
| } | |
| #else | |
| public: | |
| cpu_core_guard(unsigned int) {} | |
| ~cpu_core_guard() {} |
| std::minstd_rand gen(seed_.has_value() ? *seed_ : std::random_device{}()); | ||
| std::uniform_real_distribution<float> dis(a_, b_); | ||
|
|
||
| std::vector<joinable_thread> threads(num_thread); |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The vector is pre-sized with num_thread elements but then reserve() is called for num_thread - 1. The reserve() call after construction is redundant and incorrect - reserve is meant to allocate capacity without changing size. Since the vector already has num_thread elements, this reserve call has no useful effect. Remove this line or change the initialization to std::vector<joinable_thread> threads; threads.reserve(num_thread - 1);
| std::vector<joinable_thread> threads(num_thread); | |
| std::vector<joinable_thread> threads; |
|
|
||
| // max 80 threads; at least 2MB per thread | ||
| const size_t available_cpu_cores = get_available_cpu_cores(); | ||
| const int num_thread = |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type inconsistency: num_thread is declared as int but is used with size_t values in calculations. The result of min(80UL, available_cpu_cores, integer_divide_ceil(...)) should be size_t. This could lead to type conversion warnings or issues. Change the type to const size_t num_thread = ... for consistency.
| const int num_thread = | |
| const size_t num_thread = |
Proposed changes
ck_tile version of f9bf275 (merged with #2297)
Checklist
Please put an
xinto the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.clang-formaton all changed files