Skip to content

Conversation

@ksivaman
Copy link
Member

Description

This is a small refactor of library loading logic during runtime to be more consistent and avoid duplication. The main point is to check python packages as a last ditch attempt to find the library and prioritize system installations.

Fixes a bug where the incorrect shared object is loaded (with mismatching versions) due to presence of PyPI packages that are installed by pytorch/jax etc.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

  • Remove duplication of loading logic for various libs such as curand, cudnn etc.
  • Prioritize loading packages via system, e.g. LD_LIBRARY_PATH before checking python packages.
  • Remove search via ldconfig as redundant and brute force.

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
@ksivaman ksivaman requested a review from ptrendx October 23, 2025 16:14
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +269 to +274
if os.path.isdir(os.path.join(nvidia_dir, "cu13")):
so_paths = glob.glob(os.path.join(nvidia_dir, "cu13", f"lib/lib*{ext}.*[0-9]"))
if os.path.isdir(os.path.join(nvidia_dir, lib_name)):
so_paths = glob.glob(os.path.join(nvidia_dir, lib_name, f"lib/lib*{ext}.*[0-9]"))
else:
so_paths = glob.glob(os.path.join(nvidia_dir, f"cuda_{lib_name}", f"lib/lib*{ext}.*[0-9]"))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: if-elif-else logic error: line 269 checks cu13but doesn't prevent fallthrough. If cu13 exists, line 271 will also execute (checking lib_name dir), potentially overwriting so_paths. Should use elif on line 271.

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. New critical issues identified: The refactored _load_cuda_library function has a duplicate @functools.lru_cache decorator on line 320 that will cause a syntax error. Additionally, lines 334–336 now return only the first handle from multi-handle returns (previously _load_nvidia_cuda_library returned all handles), breaking libraries that may provide multiple versions. Most critically, lines356–357 directly assign the tuple (bool, list) from _load_cuda_library_from_python to _CUBLAS_LIB_CTYPES and _CUDART_LIB_CTYPES instead of unpacking the handle, which will cause runtime failures when these variables are later used as ctypes handles. The previously reported if-elif-else logic error on lines 269–274 remains unfixed.

Important Files Changed

Filename Score Overview
transformer_engine/common/init.py 1/5 Refactored library loading to prioritize system over Python packages, but introduces duplicate decorator, loses multi-handle semantics, and incorrectly assigns tuple return values to ctypes handle globals

Confidence score: 0/5

  • This PR will cause immediate failures in production due to multiple syntax and logic errors
  • Score reflects four distinct critical issues: duplicate decorator causing syntax error, incorrect return-value unpacking causing type errors, lost multi-handle semantics breaking library loading, and unfixed if-elif-else fallthrough from previous review
  • Pay close attention to lines 320(duplicate decorator), 334–336 (lost multi-handle return), 356–357 (incorrect tuple assignment), and 269–274 (unfixed if-elif-else logic)

1 file reviewed, 3 comments

Edit Code Review Agent Settings | Greptile


# Attempt to locate cuRAND in Python dist-packages
found, handle = _load_nvidia_cuda_library("curand")
@functools.lru_cache(maxsize=None)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: duplicate decorator; remove this extra line

Comment on lines +334 to +336
found, handle = _load_cuda_library_from_python(lib_name)
if found:
return handle
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: if _load_cuda_library_from_python returns multiple handles, only the first is returned here; all handles should be returned to match the previous behavior

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant