-
Notifications
You must be signed in to change notification settings - Fork 531
Fix runtime lib loading logic #2297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, 1 comment
| if os.path.isdir(os.path.join(nvidia_dir, "cu13")): | ||
| so_paths = glob.glob(os.path.join(nvidia_dir, "cu13", f"lib/lib*{ext}.*[0-9]")) | ||
| if os.path.isdir(os.path.join(nvidia_dir, lib_name)): | ||
| so_paths = glob.glob(os.path.join(nvidia_dir, lib_name, f"lib/lib*{ext}.*[0-9]")) | ||
| else: | ||
| so_paths = glob.glob(os.path.join(nvidia_dir, f"cuda_{lib_name}", f"lib/lib*{ext}.*[0-9]")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: if-elif-else logic error: line 269 checks cu13but doesn't prevent fallthrough. If cu13 exists, line 271 will also execute (checking lib_name dir), potentially overwriting so_paths. Should use elif on line 271.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers only the changes made since the last review, not the entire PR. New critical issues identified: The refactored _load_cuda_library function has a duplicate @functools.lru_cache decorator on line 320 that will cause a syntax error. Additionally, lines 334–336 now return only the first handle from multi-handle returns (previously _load_nvidia_cuda_library returned all handles), breaking libraries that may provide multiple versions. Most critically, lines356–357 directly assign the tuple (bool, list) from _load_cuda_library_from_python to _CUBLAS_LIB_CTYPES and _CUDART_LIB_CTYPES instead of unpacking the handle, which will cause runtime failures when these variables are later used as ctypes handles. The previously reported if-elif-else logic error on lines 269–274 remains unfixed.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| transformer_engine/common/init.py | 1/5 | Refactored library loading to prioritize system over Python packages, but introduces duplicate decorator, loses multi-handle semantics, and incorrectly assigns tuple return values to ctypes handle globals |
Confidence score: 0/5
- This PR will cause immediate failures in production due to multiple syntax and logic errors
- Score reflects four distinct critical issues: duplicate decorator causing syntax error, incorrect return-value unpacking causing type errors, lost multi-handle semantics breaking library loading, and unfixed if-elif-else fallthrough from previous review
- Pay close attention to lines 320(duplicate decorator), 334–336 (lost multi-handle return), 356–357 (incorrect tuple assignment), and 269–274 (unfixed if-elif-else logic)
1 file reviewed, 3 comments
|
|
||
| # Attempt to locate cuRAND in Python dist-packages | ||
| found, handle = _load_nvidia_cuda_library("curand") | ||
| @functools.lru_cache(maxsize=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
syntax: duplicate decorator; remove this extra line
| found, handle = _load_cuda_library_from_python(lib_name) | ||
| if found: | ||
| return handle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: if _load_cuda_library_from_python returns multiple handles, only the first is returned here; all handles should be returned to match the previous behavior
Description
This is a small refactor of library loading logic during runtime to be more consistent and avoid duplication. The main point is to check python packages as a last ditch attempt to find the library and prioritize system installations.
Fixes a bug where the incorrect shared object is loaded (with mismatching versions) due to presence of PyPI packages that are installed by
pytorch/jaxetc.Type of change
Changes
curand,cudnnetc.LD_LIBRARY_PATHbefore checking python packages.ldconfigas redundant and brute force.Checklist: