Fix runtime lib loading logic #2297

ksivaman · 2025-10-23T16:14:13Z

Description

This is a small refactor of library loading logic during runtime to be more consistent and avoid duplication. The main point is to check python packages as a last ditch attempt to find the library and prioritize system installations.

Fixes a bug where the incorrect shared object is loaded (with mismatching versions) due to presence of PyPI packages that are installed by pytorch/jax etc.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Remove duplication of loading logic for various libs such as curand, cudnn etc.
Prioritize loading packages via system, e.g. LD_LIBRARY_PATH before checking python packages.
Remove search via ldconfig as redundant and brute force.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/common/__init__.py

greptile-apps

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. New critical issues identified: The refactored _load_cuda_library function has a duplicate @functools.lru_cache decorator on line 320 that will cause a syntax error. Additionally, lines 334–336 now return only the first handle from multi-handle returns (previously _load_nvidia_cuda_library returned all handles), breaking libraries that may provide multiple versions. Most critically, lines356–357 directly assign the tuple (bool, list) from _load_cuda_library_from_python to _CUBLAS_LIB_CTYPES and _CUDART_LIB_CTYPES instead of unpacking the handle, which will cause runtime failures when these variables are later used as ctypes handles. The previously reported if-elif-else logic error on lines 269–274 remains unfixed.

Important Files Changed

Filename	Score	Overview
transformer_engine/common/init.py	1/5	Refactored library loading to prioritize system over Python packages, but introduces duplicate decorator, loses multi-handle semantics, and incorrectly assigns tuple return values to ctypes handle globals

Confidence score: 0/5

This PR will cause immediate failures in production due to multiple syntax and logic errors
Score reflects four distinct critical issues: duplicate decorator causing syntax error, incorrect return-value unpacking causing type errors, lost multi-handle semantics breaking library loading, and unfixed if-elif-else fallthrough from previous review
Pay close attention to lines 320(duplicate decorator), 334–336 (lost multi-handle return), 356–357 (incorrect tuple assignment), and 269–274 (unfixed if-elif-else logic)

_{1 file reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/common/__init__.py

greptile-apps · 2025-10-24T15:01:41Z

transformer_engine/common/__init__.py

+    _CUBLAS_LIB_CTYPES = _load_cuda_library_from_python("cublas")
+    _CUDART_LIB_CTYPES = _load_cuda_library_from_python("cuda_runtime")


logic: these calls skip system search and only look in python packages, but they don't error if not found (returns (False, [])); this will silently fail and assign False/[] to the ctypes vars instead of a handle

Why do we only load those libraries from Python?

Shouldn't we prioritize system installs of the CUDA Runtime and cuBLAS?

You're absolutely right! Looking at the code, there's an inconsistency in the loading logic. Lines 352-354 use _load_cuda_library() which properly prioritizes system installations over Python packages:

_CUDNN_LIB_CTYPES = _load_cuda_library("cudnn") _NVRTC_LIB_CTYPES = _load_cuda_library("nvrtc") _CURAND_LIB_CTYPES = _load_cuda_library("curand")

But lines 355-356 skip the system search entirely and only look in Python packages:

_CUBLAS_LIB_CTYPES = _load_cuda_library_from_python("cublas") _CUDART_LIB_CTYPES = _load_cuda_library_from_python("cuda_runtime")

This inconsistency can lead to exactly the bug described in the PR - loading incorrect shared objects with mismatching versions from PyPI packages when a proper system installation exists.

For consistency with the other CUDA libraries and to prioritize system installations, these should be:

Suggested change

_CUBLAS_LIB_CTYPES = _load_cuda_library_from_python("cublas")

_CUDART_LIB_CTYPES = _load_cuda_library_from_python("cuda_runtime")

_CUBLAS_LIB_CTYPES = _load_cuda_library("cublas")

_CUDART_LIB_CTYPES = _load_cuda_library("cuda_runtime")

This would ensure that cuBLAS and CUDA Runtime follow the same loading priority as the other CUDA libraries: system first, then Python packages as fallback.

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

greptile-apps

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The developer has updated transformer_engine/common/__init__.py to address multiple critical bugs identified in previous reviews. Specifically, the changes fix: (1) the if-elif-else fallthrough issue by converting line 271 to elif to prevent unintended overwrites, (2) the duplicate decorator on line 360 which is removed, (3) the cublas/cudart loading logic which is now wrapped in _load_cuda_library_from_system to properly search system paths and raise errors on failure rather than silently assigning (False, []), and (4) the single-handle return in _load_cuda_library which now returns all matched handles. The glob pattern on lines 270/272/274 still uses a wildcard (lib*{ext}.*[0-9]) that matches all versioned libraries in the directory instead of filtering specifically by the requested lib_name, which could lead to incorrect library selection when multiple CUDA libraries coexist, but this is a pre-existing pattern carried over from the original code and may be intentional to capture multi-part library names (e.g., libcudart.so.12.0).

Important Files Changed

Filename	Score	Overview
transformer_engine/common/init.py	4/5	Fixed if-elif fallthrough, duplicate decorator, and cublas/cudart loading; all handles now returned; glob pattern still non-specific but matches prior behavior

Confidence score: 4/5

This PR is now safe to merge with minimal risk after addressing the previous critical issues
Score reflects that all previously identified logic errors (fallthrough, duplicate decorator, silent failures, single-handle return) have been corrected; remaining uncertainty is the wildcard glob pattern which may be by design and does not introduce new risk
No files require additional attention; all critical bugs have been resolved

_{1 file reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-10-28T19:52:30Z

transformer_engine/common/__init__.py

+    if os.path.isdir(os.path.join(nvidia_dir, "cu13")):
+        so_paths = glob.glob(os.path.join(nvidia_dir, "cu13", f"lib/lib*{ext}.*[0-9]"))
+    elif os.path.isdir(os.path.join(nvidia_dir, lib_name)):
+        so_paths = glob.glob(os.path.join(nvidia_dir, lib_name, f"lib/lib*{ext}.*[0-9]"))
+    else:
+        so_paths = glob.glob(os.path.join(nvidia_dir, f"cuda_{lib_name}", f"lib/lib*{ext}.*[0-9]"))


logic: if-elif chain issue already reported; besides the missing elif on line 271, the wildcard lib*{ext} pattern (lines 270, 272, 274) will match ALL .so files in those directories, not just the requested library. Filter by lib_name after globbing or embed lib_name in the pattern. Should the glob patterns filter to lib{lib_name}*{ext}.*[0-9] or post-filter by basename?

greptile-apps · 2025-10-28T19:52:31Z

transformer_engine/common/__init__.py

+        libs = glob.glob(f"{path}/**/lib{lib_name}{_get_sys_extension()}*", recursive=True)
+        libs.sort(reverse=True, key=os.path.basename)
+        if libs:
+            return True, ctypes.CDLL(libs[0], mode=ctypes.RTLD_GLOBAL)


style: glob uses lib{lib_name}* but does not filter out stub libraries; for nvrtc this previously filtered stubs explicitly (old code line ~336). Consider adding not ("stub" in x) filter.

ksivaman · 2025-10-28T20:18:07Z

/te-ci

timmoon10 · 2025-10-29T18:49:35Z

transformer_engine/common/__init__.py

+    # `_load_cuda_library` is used for packages that must be loaded
+    # during runtime. Both system and pypi packages are searched
+    # and an error is thrown if not found.
+    _CUDNN_LIB_CTYPES = _load_cuda_library("cudnn")


Do we also need to handle the case where cuDNN is installed in CUDNN_PATH instead of CUDA_HOME?

timmoon10 · 2025-10-30T01:06:44Z

transformer_engine/common/__init__.py

+    _CUBLAS_LIB_CTYPES = _load_cuda_library_from_python("cublas")
+    _CUDART_LIB_CTYPES = _load_cuda_library_from_python("cuda_runtime")


Shouldn't we prioritize system installs of the CUDA Runtime and cuBLAS?

ksivaman added 2 commits October 23, 2025 15:38

Fix ordering of runtime lib loading

f5e8584

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Fix extension

2c88485

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman requested a review from ptrendx October 23, 2025 16:14

greptile-apps bot reviewed Oct 23, 2025

View reviewed changes

transformer_engine/common/__init__.py Show resolved Hide resolved

Merge branch 'main' into reorder_lib_loading

a2bd4bd

greptile-apps bot reviewed Oct 24, 2025

View reviewed changes

ksivaman added 2 commits October 28, 2025 19:50

Fixes

56c0c73

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Merge branch 'main' into reorder_lib_loading

454a9c9

greptile-apps bot reviewed Oct 28, 2025

View reviewed changes

ksivaman requested review from ptrendx and removed request for ptrendx October 28, 2025 20:18

timmoon10 reviewed Oct 30, 2025

View reviewed changes

		_CUBLAS_LIB_CTYPES = _load_cuda_library_from_python("cublas")
		_CUDART_LIB_CTYPES = _load_cuda_library_from_python("cuda_runtime")

Fix runtime lib loading logic #2297

Are you sure you want to change the base?

Fix runtime lib loading logic #2297

Uh oh!

Conversation

ksivaman commented Oct 23, 2025

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 0/5

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

ptrendx Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

timmoon10 Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 4/5

Uh oh!

greptile-apps bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

ksivaman commented Oct 28, 2025

Uh oh!

timmoon10 Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

timmoon10 Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants