-
Notifications
You must be signed in to change notification settings - Fork 536
Fix runtime lib loading logic #2297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, 1 comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers only the changes made since the last review, not the entire PR. New critical issues identified: The refactored _load_cuda_library function has a duplicate @functools.lru_cache decorator on line 320 that will cause a syntax error. Additionally, lines 334–336 now return only the first handle from multi-handle returns (previously _load_nvidia_cuda_library returned all handles), breaking libraries that may provide multiple versions. Most critically, lines356–357 directly assign the tuple (bool, list) from _load_cuda_library_from_python to _CUBLAS_LIB_CTYPES and _CUDART_LIB_CTYPES instead of unpacking the handle, which will cause runtime failures when these variables are later used as ctypes handles. The previously reported if-elif-else logic error on lines 269–274 remains unfixed.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| transformer_engine/common/init.py | 1/5 | Refactored library loading to prioritize system over Python packages, but introduces duplicate decorator, loses multi-handle semantics, and incorrectly assigns tuple return values to ctypes handle globals |
Confidence score: 0/5
- This PR will cause immediate failures in production due to multiple syntax and logic errors
- Score reflects four distinct critical issues: duplicate decorator causing syntax error, incorrect return-value unpacking causing type errors, lost multi-handle semantics breaking library loading, and unfixed if-elif-else fallthrough from previous review
- Pay close attention to lines 320(duplicate decorator), 334–336 (lost multi-handle return), 356–357 (incorrect tuple assignment), and 269–274 (unfixed if-elif-else logic)
1 file reviewed, 3 comments
| _CUBLAS_LIB_CTYPES = _load_cuda_library_from_python("cublas") | ||
| _CUDART_LIB_CTYPES = _load_cuda_library_from_python("cuda_runtime") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: these calls skip system search and only look in python packages, but they don't error if not found (returns (False, [])); this will silently fail and assign False/[] to the ctypes vars instead of a handle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we only load those libraries from Python?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we prioritize system installs of the CUDA Runtime and cuBLAS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're absolutely right! Looking at the code, there's an inconsistency in the loading logic. Lines 352-354 use _load_cuda_library() which properly prioritizes system installations over Python packages:
_CUDNN_LIB_CTYPES = _load_cuda_library("cudnn")
_NVRTC_LIB_CTYPES = _load_cuda_library("nvrtc")
_CURAND_LIB_CTYPES = _load_cuda_library("curand")But lines 355-356 skip the system search entirely and only look in Python packages:
_CUBLAS_LIB_CTYPES = _load_cuda_library_from_python("cublas")
_CUDART_LIB_CTYPES = _load_cuda_library_from_python("cuda_runtime")This inconsistency can lead to exactly the bug described in the PR - loading incorrect shared objects with mismatching versions from PyPI packages when a proper system installation exists.
For consistency with the other CUDA libraries and to prioritize system installations, these should be:
| _CUBLAS_LIB_CTYPES = _load_cuda_library_from_python("cublas") | |
| _CUDART_LIB_CTYPES = _load_cuda_library_from_python("cuda_runtime") | |
| _CUBLAS_LIB_CTYPES = _load_cuda_library("cublas") | |
| _CUDART_LIB_CTYPES = _load_cuda_library("cuda_runtime") |
This would ensure that cuBLAS and CUDA Runtime follow the same loading priority as the other CUDA libraries: system first, then Python packages as fallback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers only the changes made since the last review, not the entire PR. The developer has updated transformer_engine/common/__init__.py to address multiple critical bugs identified in previous reviews. Specifically, the changes fix: (1) the if-elif-else fallthrough issue by converting line 271 to elif to prevent unintended overwrites, (2) the duplicate decorator on line 360 which is removed, (3) the cublas/cudart loading logic which is now wrapped in _load_cuda_library_from_system to properly search system paths and raise errors on failure rather than silently assigning (False, []), and (4) the single-handle return in _load_cuda_library which now returns all matched handles. The glob pattern on lines 270/272/274 still uses a wildcard (lib*{ext}.*[0-9]) that matches all versioned libraries in the directory instead of filtering specifically by the requested lib_name, which could lead to incorrect library selection when multiple CUDA libraries coexist, but this is a pre-existing pattern carried over from the original code and may be intentional to capture multi-part library names (e.g., libcudart.so.12.0).
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| transformer_engine/common/init.py | 4/5 | Fixed if-elif fallthrough, duplicate decorator, and cublas/cudart loading; all handles now returned; glob pattern still non-specific but matches prior behavior |
Confidence score: 4/5
- This PR is now safe to merge with minimal risk after addressing the previous critical issues
- Score reflects that all previously identified logic errors (fallthrough, duplicate decorator, silent failures, single-handle return) have been corrected; remaining uncertainty is the wildcard glob pattern which may be by design and does not introduce new risk
- No files require additional attention; all critical bugs have been resolved
1 file reviewed, 2 comments
| if os.path.isdir(os.path.join(nvidia_dir, "cu13")): | ||
| so_paths = glob.glob(os.path.join(nvidia_dir, "cu13", f"lib/lib*{ext}.*[0-9]")) | ||
| elif os.path.isdir(os.path.join(nvidia_dir, lib_name)): | ||
| so_paths = glob.glob(os.path.join(nvidia_dir, lib_name, f"lib/lib*{ext}.*[0-9]")) | ||
| else: | ||
| so_paths = glob.glob(os.path.join(nvidia_dir, f"cuda_{lib_name}", f"lib/lib*{ext}.*[0-9]")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: if-elif chain issue already reported; besides the missing elif on line 271, the wildcard lib*{ext} pattern (lines 270, 272, 274) will match ALL .so files in those directories, not just the requested library. Filter by lib_name after globbing or embed lib_name in the pattern. Should the glob patterns filter to lib{lib_name}*{ext}.*[0-9] or post-filter by basename?
| libs = glob.glob(f"{path}/**/lib{lib_name}{_get_sys_extension()}*", recursive=True) | ||
| libs.sort(reverse=True, key=os.path.basename) | ||
| if libs: | ||
| return True, ctypes.CDLL(libs[0], mode=ctypes.RTLD_GLOBAL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: glob uses lib{lib_name}* but does not filter out stub libraries; for nvrtc this previously filtered stubs explicitly (old code line ~336). Consider adding not ("stub" in x) filter.
|
/te-ci |
| # `_load_cuda_library` is used for packages that must be loaded | ||
| # during runtime. Both system and pypi packages are searched | ||
| # and an error is thrown if not found. | ||
| _CUDNN_LIB_CTYPES = _load_cuda_library("cudnn") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to handle the case where cuDNN is installed in CUDNN_PATH instead of CUDA_HOME?
| _CUBLAS_LIB_CTYPES = _load_cuda_library_from_python("cublas") | ||
| _CUDART_LIB_CTYPES = _load_cuda_library_from_python("cuda_runtime") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we prioritize system installs of the CUDA Runtime and cuBLAS?
Description
This is a small refactor of library loading logic during runtime to be more consistent and avoid duplication. The main point is to check python packages as a last ditch attempt to find the library and prioritize system installations.
Fixes a bug where the incorrect shared object is loaded (with mismatching versions) due to presence of PyPI packages that are installed by
pytorch/jaxetc.Type of change
Changes
curand,cudnnetc.LD_LIBRARY_PATHbefore checking python packages.ldconfigas redundant and brute force.Checklist: