Skip to content

Conversation

@ksivaman
Copy link
Member

Description

This is a small refactor of library loading logic during runtime to be more consistent and avoid duplication. The main point is to check python packages as a last ditch attempt to find the library and prioritize system installations.

Fixes a bug where the incorrect shared object is loaded (with mismatching versions) due to presence of PyPI packages that are installed by pytorch/jax etc.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

  • Remove duplication of loading logic for various libs such as curand, cudnn etc.
  • Prioritize loading packages via system, e.g. LD_LIBRARY_PATH before checking python packages.
  • Remove search via ldconfig as redundant and brute force.

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
@ksivaman ksivaman requested a review from ptrendx October 23, 2025 16:14
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. New critical issues identified: The refactored _load_cuda_library function has a duplicate @functools.lru_cache decorator on line 320 that will cause a syntax error. Additionally, lines 334–336 now return only the first handle from multi-handle returns (previously _load_nvidia_cuda_library returned all handles), breaking libraries that may provide multiple versions. Most critically, lines356–357 directly assign the tuple (bool, list) from _load_cuda_library_from_python to _CUBLAS_LIB_CTYPES and _CUDART_LIB_CTYPES instead of unpacking the handle, which will cause runtime failures when these variables are later used as ctypes handles. The previously reported if-elif-else logic error on lines 269–274 remains unfixed.

Important Files Changed

Filename Score Overview
transformer_engine/common/init.py 1/5 Refactored library loading to prioritize system over Python packages, but introduces duplicate decorator, loses multi-handle semantics, and incorrectly assigns tuple return values to ctypes handle globals

Confidence score: 0/5

  • This PR will cause immediate failures in production due to multiple syntax and logic errors
  • Score reflects four distinct critical issues: duplicate decorator causing syntax error, incorrect return-value unpacking causing type errors, lost multi-handle semantics breaking library loading, and unfixed if-elif-else fallthrough from previous review
  • Pay close attention to lines 320(duplicate decorator), 334–336 (lost multi-handle return), 356–357 (incorrect tuple assignment), and 269–274 (unfixed if-elif-else logic)

1 file reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +356 to +357
_CUBLAS_LIB_CTYPES = _load_cuda_library_from_python("cublas")
_CUDART_LIB_CTYPES = _load_cuda_library_from_python("cuda_runtime")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: these calls skip system search and only look in python packages, but they don't error if not found (returns (False, [])); this will silently fail and assign False/[] to the ctypes vars instead of a handle

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we only load those libraries from Python?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we prioritize system installs of the CUDA Runtime and cuBLAS?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right! Looking at the code, there's an inconsistency in the loading logic. Lines 352-354 use _load_cuda_library() which properly prioritizes system installations over Python packages:

_CUDNN_LIB_CTYPES = _load_cuda_library("cudnn")
_NVRTC_LIB_CTYPES = _load_cuda_library("nvrtc")
_CURAND_LIB_CTYPES = _load_cuda_library("curand")

But lines 355-356 skip the system search entirely and only look in Python packages:

_CUBLAS_LIB_CTYPES = _load_cuda_library_from_python("cublas")
_CUDART_LIB_CTYPES = _load_cuda_library_from_python("cuda_runtime")

This inconsistency can lead to exactly the bug described in the PR - loading incorrect shared objects with mismatching versions from PyPI packages when a proper system installation exists.

For consistency with the other CUDA libraries and to prioritize system installations, these should be:

Suggested change
_CUBLAS_LIB_CTYPES = _load_cuda_library_from_python("cublas")
_CUDART_LIB_CTYPES = _load_cuda_library_from_python("cuda_runtime")
_CUBLAS_LIB_CTYPES = _load_cuda_library("cublas")
_CUDART_LIB_CTYPES = _load_cuda_library("cuda_runtime")

This would ensure that cuBLAS and CUDA Runtime follow the same loading priority as the other CUDA libraries: system first, then Python packages as fallback.

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The developer has updated transformer_engine/common/__init__.py to address multiple critical bugs identified in previous reviews. Specifically, the changes fix: (1) the if-elif-else fallthrough issue by converting line 271 to elif to prevent unintended overwrites, (2) the duplicate decorator on line 360 which is removed, (3) the cublas/cudart loading logic which is now wrapped in _load_cuda_library_from_system to properly search system paths and raise errors on failure rather than silently assigning (False, []), and (4) the single-handle return in _load_cuda_library which now returns all matched handles. The glob pattern on lines 270/272/274 still uses a wildcard (lib*{ext}.*[0-9]) that matches all versioned libraries in the directory instead of filtering specifically by the requested lib_name, which could lead to incorrect library selection when multiple CUDA libraries coexist, but this is a pre-existing pattern carried over from the original code and may be intentional to capture multi-part library names (e.g., libcudart.so.12.0).

Important Files Changed

Filename Score Overview
transformer_engine/common/init.py 4/5 Fixed if-elif fallthrough, duplicate decorator, and cublas/cudart loading; all handles now returned; glob pattern still non-specific but matches prior behavior

Confidence score: 4/5

  • This PR is now safe to merge with minimal risk after addressing the previous critical issues
  • Score reflects that all previously identified logic errors (fallthrough, duplicate decorator, silent failures, single-handle return) have been corrected; remaining uncertainty is the wildcard glob pattern which may be by design and does not introduce new risk
  • No files require additional attention; all critical bugs have been resolved

1 file reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +269 to +274
if os.path.isdir(os.path.join(nvidia_dir, "cu13")):
so_paths = glob.glob(os.path.join(nvidia_dir, "cu13", f"lib/lib*{ext}.*[0-9]"))
elif os.path.isdir(os.path.join(nvidia_dir, lib_name)):
so_paths = glob.glob(os.path.join(nvidia_dir, lib_name, f"lib/lib*{ext}.*[0-9]"))
else:
so_paths = glob.glob(os.path.join(nvidia_dir, f"cuda_{lib_name}", f"lib/lib*{ext}.*[0-9]"))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: if-elif chain issue already reported; besides the missing elif on line 271, the wildcard lib*{ext} pattern (lines 270, 272, 274) will match ALL .so files in those directories, not just the requested library. Filter by lib_name after globbing or embed lib_name in the pattern. Should the glob patterns filter to lib{lib_name}*{ext}.*[0-9] or post-filter by basename?

Comment on lines +306 to +309
libs = glob.glob(f"{path}/**/lib{lib_name}{_get_sys_extension()}*", recursive=True)
libs.sort(reverse=True, key=os.path.basename)
if libs:
return True, ctypes.CDLL(libs[0], mode=ctypes.RTLD_GLOBAL)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: glob uses lib{lib_name}* but does not filter out stub libraries; for nvrtc this previously filtered stubs explicitly (old code line ~336). Consider adding not ("stub" in x) filter.

@ksivaman
Copy link
Member Author

/te-ci

@ksivaman ksivaman requested review from ptrendx and removed request for ptrendx October 28, 2025 20:18
# `_load_cuda_library` is used for packages that must be loaded
# during runtime. Both system and pypi packages are searched
# and an error is thrown if not found.
_CUDNN_LIB_CTYPES = _load_cuda_library("cudnn")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to handle the case where cuDNN is installed in CUDNN_PATH instead of CUDA_HOME?

Comment on lines +356 to +357
_CUBLAS_LIB_CTYPES = _load_cuda_library_from_python("cublas")
_CUDART_LIB_CTYPES = _load_cuda_library_from_python("cuda_runtime")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we prioritize system installs of the CUDA Runtime and cuBLAS?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants