Skip to content

Conversation

@DDEle
Copy link
Contributor

@DDEle DDEle commented Nov 28, 2025

Proposed changes

ck_tile version of f9bf275 (merged with #2297)

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
  • I have added inline documentation which enables the maintainers with understanding the motivation
  • I have removed the stale documentation which is no longer relevant after this pull request
  • (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
  • I have run clang-format on all changed files
  • Any dependent changes have been merged

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces multi-threaded random tensor value generation for the ck_tile library, replacing the previous single-threaded or opt-in multi-threaded approach with an always-on deterministic multi-threaded implementation. The changes ensure reproducible results across different thread counts by using a block-based distribution strategy with RNG state management via discard().

Key Changes

  • Refactored FillUniformDistribution to always use multi-threading with deterministic block-based random number generation
  • Added CPU core management utilities (get_available_cpu_cores() and cpu_core_guard) for testing different thread configurations
  • Updated the template parameter to allow type deduction (T = void)

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
include/ck_tile/host/fill.hpp Replaced opt-in multi-threading with always-on deterministic block-based multi-threaded filling; changed template parameter to support type deduction
include/ck_tile/host/joinable_thread.hpp Added get_available_cpu_cores() function and cpu_core_guard class for CPU affinity management in tests
test/ck_tile/utility/test_fill.cpp New comprehensive test suite validating deterministic behavior across different sizes and thread counts
test/ck_tile/utility/CMakeLists.txt Registered the new test executable
example/ck_tile/18_flatmm/mxgemm/run_mx_flatmm.inc Updated to use new template syntax with type deduction

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

const int cpu_cores = max(32U, get_available_cpu_cores());
for(auto num_threads_diff : {-3, -1})
{
cpu_core_guard cg(max(cpu_cores + num_threads_diff, 1));
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test attempts to set CPU cores to max(cpu_cores + num_threads_diff, 1) where num_threads_diff is negative (-3, -1). However, cpu_cores is computed as max(32U, get_available_cpu_cores()), so if the system has fewer than 32 cores (say 16), this could try to set affinity to cores that don't exist (e.g., cores 29, 31 on a 16-core system). The cpu_core_guard should be limited to the actual number of available cores. Consider: cpu_core_guard cg(min(max(cpu_cores + num_threads_diff, 1), get_available_cpu_cores()));

Suggested change
cpu_core_guard cg(max(cpu_cores + num_threads_diff, 1));
cpu_core_guard cg(min(max(cpu_cores + num_threads_diff, 1), get_available_cpu_cores()));

Copilot uses AI. Check for mistakes.
* @example
*
* // Direct usage without creating a separate variable:
* ck_tile::FillUniformDistribution<ADataType>{-1.f, 1.f}(a_host_tensor);
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The example in the documentation shows usage with an explicit template parameter FillUniformDistribution<ADataType>, but the new API allows (and the code examples in run_mx_flatmm.inc demonstrate) using FillUniformDistribution<> with template type deduction. The example should be updated to reflect the new usage pattern: ck_tile::FillUniformDistribution<>{-1.f, 1.f}(a_host_tensor);

Suggested change
* ck_tile::FillUniformDistribution<ADataType>{-1.f, 1.f}(a_host_tensor);
* ck_tile::FillUniformDistribution<>{-1.f, 1.f}(a_host_tensor);

Copilot uses AI. Check for mistakes.

#include <gtest/gtest.h>
#include <vector>
#include <cstring>
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing #include <chrono> header. The code uses std::chrono::high_resolution_clock and related types but doesn't include the necessary header. This will cause compilation errors.

Suggested change
#include <cstring>
#include <cstring>
#include <chrono>

Copilot uses AI. Check for mistakes.
std::cout << "Taking " << sec << " sec to fill 1GB of data of type " << typeid(T).name()
<< std::endl;

const int cpu_cores = max(32U, get_available_cpu_cores());
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type mismatch: cpu_cores is declared as int but get_available_cpu_cores() returns unsigned int. This could cause sign-related issues when performing arithmetic operations. Change to const unsigned int cpu_cores = max(32U, get_available_cpu_cores()); or cast appropriately when using in arithmetic.

Suggested change
const int cpu_cores = max(32U, get_available_cpu_cores());
const unsigned int cpu_cores = max(32U, get_available_cpu_cores());

Copilot uses AI. Check for mistakes.
Comment on lines +55 to +67
sched_getaffinity(0, sizeof(cpu_set_t), &original_cpu_set_);

// set new cpu set
cpu_set_t new_cpu_set;
CPU_ZERO(&new_cpu_set);
for(unsigned int i = 0; i < num_cores; ++i)
{
#pragma clang diagnostic push
#pragma clang diagnostic ignored "-Wold-style-cast"
CPU_SET(i, &new_cpu_set); // NOLINT(old-style-cast)
#pragma clang diagnostic pop
}
sched_setaffinity(0, sizeof(cpu_set_t), &new_cpu_set);
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing error handling for sched_getaffinity and sched_setaffinity system calls. If these calls fail, the cpu affinity won't be properly set or restored, which could lead to unexpected behavior. Consider adding error checks and potentially logging warnings if the operations fail.

Copilot uses AI. Check for mistakes.
{
// restore original cpu set
sched_setaffinity(0, sizeof(cpu_set_t), &original_cpu_set_);
}
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cpu_core_guard class is empty on non-Linux platforms. This means creating an instance has no effect on non-Linux systems, but the test code uses it assuming it will work. Consider adding a public constructor and destructor for non-Linux platforms to avoid potential compilation issues and make the class behavior more explicit.

Suggested change
}
}
#else
public:
cpu_core_guard(unsigned int) {}
~cpu_core_guard() {}

Copilot uses AI. Check for mistakes.
std::minstd_rand gen(seed_.has_value() ? *seed_ : std::random_device{}());
std::uniform_real_distribution<float> dis(a_, b_);

std::vector<joinable_thread> threads(num_thread);
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vector is pre-sized with num_thread elements but then reserve() is called for num_thread - 1. The reserve() call after construction is redundant and incorrect - reserve is meant to allocate capacity without changing size. Since the vector already has num_thread elements, this reserve call has no useful effect. Remove this line or change the initialization to std::vector<joinable_thread> threads; threads.reserve(num_thread - 1);

Suggested change
std::vector<joinable_thread> threads(num_thread);
std::vector<joinable_thread> threads;

Copilot uses AI. Check for mistakes.

// max 80 threads; at least 2MB per thread
const size_t available_cpu_cores = get_available_cpu_cores();
const int num_thread =
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type inconsistency: num_thread is declared as int but is used with size_t values in calculations. The result of min(80UL, available_cpu_cores, integer_divide_ceil(...)) should be size_t. This could lead to type conversion warnings or issues. Change the type to const size_t num_thread = ... for consistency.

Suggested change
const int num_thread =
const size_t num_thread =

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants