Refactor: Enhance GPU Memory Leak Test for read_region
#874
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem:
The existing GPU memory leak test,
test_read_region_cuda_memleak
, has been observed to fail sporadically (as noted in https://github.com/rapidsai/build-infra/issues/228#issuecomment-2852570027). The previous logic, which compared memory usage at iteration 5 against iteration 9 (mem_usage_history[5] - mem_usage_history[9]
), was susceptible to transient memory fluctuations caused by Python's garbage collection timing and CuPy's memory pool management, rather than necessarily indicating a persistent memory leak.Solution:
This PR refactors
test_read_region_cuda_memleak
to improve its stability and accuracy in detecting genuine GPU memory leaks:Explicit Memory Management:
img.read_region(device="cuda")
is now explicitly deleted (del region_data
) within each test loop iteration.cp.get_default_memory_pool().free_all_blocks()
is called immediately after deletion to encourage prompt reclamation of GPU memory by the CuPy memory pool.Revised Assertion Strategy:
mem_after_warmup
) and at the very end of all iterations (mem_at_end
).mem_at_end - mem_after_warmup
) over the subsequent test iterations exceeds a definedleak_threshold_mib
(currently set to 30MB over 7 active iterations). This directly targets sustained memory growth, which is characteristic of a leak.Improved Clarity:
Benefits:
This change aims to provide a more reliable and meaningful assessment of GPU memory usage for the
read_region
operation.