Refactor: Enhance GPU Memory Leak Test for `read_region` #874

gigony · 2025-05-06T18:56:43Z

Problem:

The existing GPU memory leak test, test_read_region_cuda_memleak, has been observed to fail sporadically (as noted in https://github.com/rapidsai/build-infra/issues/228#issuecomment-2852570027). The previous logic, which compared memory usage at iteration 5 against iteration 9 (mem_usage_history[5] - mem_usage_history[9]), was susceptible to transient memory fluctuations caused by Python's garbage collection timing and CuPy's memory pool management, rather than necessarily indicating a persistent memory leak.

Solution:

This PR refactors test_read_region_cuda_memleak to improve its stability and accuracy in detecting genuine GPU memory leaks:

Explicit Memory Management:
- The CuPy array returned by img.read_region(device="cuda") is now explicitly deleted (del region_data) within each test loop iteration.
- cp.get_default_memory_pool().free_all_blocks() is called immediately after deletion to encourage prompt reclamation of GPU memory by the CuPy memory pool.
Revised Assertion Strategy:
- The test now includes a "warm-up" phase (a few initial iterations) to allow for any one-time memory allocations or cache population.
- Memory usage is recorded before the warm-up (mem_after_warmup) and at the very end of all iterations (mem_at_end).
- The core assertion now checks if the cumulative memory increase (mem_at_end - mem_after_warmup) over the subsequent test iterations exceeds a defined leak_threshold_mib (currently set to 30MB over 7 active iterations). This directly targets sustained memory growth, which is characteristic of a leak.
Improved Clarity:
- Comments within the test have been updated to clearly explain the new methodology, the purpose of the warm-up phase, and the rationale behind the chosen leak threshold.

Benefits:

Increased Test Stability: By explicitly managing memory and using a cumulative check, the test should be far less prone to flakiness caused by system noise or GC timing.
More Accurate Leak Detection: The new logic more accurately reflects how a memory leak would manifest (i.e., sustained growth over time despite attempts to free resources).
Clearer Test Intent: The refactored code and comments make the test's purpose and methodology easier to understand.

This change aims to provide a more reliable and meaningful assessment of GPU memory usage for the read_region operation.

The `test_read_region_cuda_memleak` test was prone to sporadic failures due to its reliance on measuring memory differences between arbitrary points (iterations 5 and 9). This could be affected by transient memory spikes or dips unrelated to a consistent leak, and by the timing of Python's garbage collection and CuPy's memory pool heuristics. This commit refactors the test to: - Implement explicit deletion of the CuPy array (`region_data`) created by `img.read_region()` within each iteration. - Add an explicit call to `cp.get_default_memory_pool().free_all_blocks()` after deletion to force more immediate reclamation of GPU memory. - Change the assertion logic to: 1. Perform a few "warm-up" iterations to allow for initial allocations. 2. Measure memory after the warm-up phase. 3. Measure memory after all test iterations have completed. 4. Assert that the cumulative memory increase between the warm-up baseline and the final measurement is below a defined threshold (e.g., 30MB over 7 active iterations). - This approach provides a more robust measure of actual memory leaks by looking for sustained memory growth after explicit freeing attempts, rather than being sensitive to short-term fluctuations. - Updated comments to clarify the new logic and the rationale for the chosen memory threshold. This change aims to reduce test flakiness and provide a more reliable indicator of potential GPU memory leaks in the `read_region` functionality. Signed-off-by: Gigon Bae <[email protected]>

gigony requested a review from a team as a code owner May 6, 2025 18:56

gigony self-assigned this May 6, 2025

gigony requested a review from jakirkham May 6, 2025 18:57

gigony added improvement Improves an existing functionality non-breaking Introduces a non-breaking change maintenance labels May 6, 2025

gigony added this to cucim May 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: Enhance GPU Memory Leak Test for `read_region` #874

Refactor: Enhance GPU Memory Leak Test for `read_region` #874

gigony commented May 6, 2025

Refactor: Enhance GPU Memory Leak Test for read_region #874

Are you sure you want to change the base?

Refactor: Enhance GPU Memory Leak Test for read_region #874

Conversation

gigony commented May 6, 2025

Refactor: Enhance GPU Memory Leak Test for `read_region` #874

Refactor: Enhance GPU Memory Leak Test for `read_region` #874