-
Notifications
You must be signed in to change notification settings - Fork 999
Add cudf::memory_resources for separate temporary/output allocation control #20934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add cudf::memory_resources for separate temporary/output allocation control #20934
Conversation
…d output allocations This large-scale refactoring introduces the cudf::memory_resources class that enables fine-grained control over memory allocation by separating temporary (intermediate) allocations from output (returned) allocations. Key changes: - Add cudf::memory_resources class in utilities/memory_resource.hpp with: * Two-argument constructor for explicit output and temporary MRs * Single-argument constructor for backward API compatibility * get_output_mr() and get_temporary_mr() accessor methods * Implicit conversion from rmm::device_async_resource_ref - Update 562 files across the codebase: * 197 public API headers converted to accept cudf::memory_resources * 365+ implementation files updated to use resources.get_temporary_mr() * All Thrust exec_policy calls now include memory resource parameter * All rmm::device_uvector and device_buffer allocations updated - Add validation support: * LIBCUDF_ERROR_ON_CURRENT_DEVICE_RESOURCE_REF environment variable * Enables strict checking that resources are threaded through all code paths * Helpful error messages for debugging - Create comprehensive test suite: * memory_resources_tests.cpp with 20+ test cases * memory_resources_validation_tests.cpp with 10+ validation tests * Tests cover constructors, separate pools, tracking, validation mode, edge cases - Add extensive documentation: * Implementation plan and design decisions * Refactoring summary with patterns used * Detailed list of targeted fixes applied * Test documentation and debugging guide * Validation script for verifying completeness Patterns updated throughout codebase: - rmm::device_async_resource_ref mr → cudf::memory_resources resources - cudf::get_current_device_resource_ref() → resources.get_temporary_mr() - rmm::exec_policy(stream) → rmm::exec_policy(stream, resources.get_temporary_mr()) - Function calls: pass entire resources object, not just get_output_mr() Maintains backward API compatibility through implicit conversion while enabling new functionality to use separate memory pools for optimization and profiling. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
/ok to test |
@karthikeyann, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
|
/ok to test 663a2bb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you be willing to break up this PR so it's just the design of cudf::memory_resources with a sample implementation for one API, and then expand to all of cuDF in a later PR? I'd like to separate design from broad implementation.
| std::getenv("LIBCUDF_ERROR_ON_CURRENT_DEVICE_RESOURCE_REF") != nullptr; | ||
|
|
||
| if (validation_enabled) { | ||
| throw std::runtime_error( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be simplified to CUDF_EXPECTS?
|
I'm resolving merge conflicts on this. My |
Summary
This PR introduces the
cudf::memory_resourcesclass that enables fine-grained control over memory allocation by separating temporary (intermediate) allocations from output (returned) allocations throughout libcudf.attempts to solve #20780
Motivation
Currently, all libcudf APIs use a single memory resource for both output data and temporary allocations. This limits the ability to:
Changes
Core Infrastructure
cudf::memory_resourcesinutilities/memory_resource.hppoutput_mrandtemporary_mrget_output_mr()andget_temporary_mr()rmm::device_async_resource_reffor API compatibilityRefactoring Scope
cudf::memory_resourcesresources.get_temporary_mr()for temporary allocationsdevice_uvectoranddevice_bufferupdated throughoutValidation Infrastructure
LIBCUDF_ERROR_ON_CURRENT_DEVICE_RESOURCE_REFvalidate_refactoring.sh- all checks pass ✅Testing
MEMORY_RESOURCES_TESTS_README.mdDocumentation
API Changes
Backward Compatible ✅
Existing code continues to work due to implicit conversion: