Skip to content

Refactor: Enhance GPU Memory Leak Test for read_region #874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: branch-25.06
Choose a base branch
from

Conversation

gigony
Copy link
Contributor

@gigony gigony commented May 6, 2025

Problem:

The existing GPU memory leak test, test_read_region_cuda_memleak, has been observed to fail sporadically (as noted in https://github.com/rapidsai/build-infra/issues/228#issuecomment-2852570027). The previous logic, which compared memory usage at iteration 5 against iteration 9 (mem_usage_history[5] - mem_usage_history[9]), was susceptible to transient memory fluctuations caused by Python's garbage collection timing and CuPy's memory pool management, rather than necessarily indicating a persistent memory leak.

Solution:

This PR refactors test_read_region_cuda_memleak to improve its stability and accuracy in detecting genuine GPU memory leaks:

  1. Explicit Memory Management:

    • The CuPy array returned by img.read_region(device="cuda") is now explicitly deleted (del region_data) within each test loop iteration.
    • cp.get_default_memory_pool().free_all_blocks() is called immediately after deletion to encourage prompt reclamation of GPU memory by the CuPy memory pool.
  2. Revised Assertion Strategy:

    • The test now includes a "warm-up" phase (a few initial iterations) to allow for any one-time memory allocations or cache population.
    • Memory usage is recorded before the warm-up (mem_after_warmup) and at the very end of all iterations (mem_at_end).
    • The core assertion now checks if the cumulative memory increase (mem_at_end - mem_after_warmup) over the subsequent test iterations exceeds a defined leak_threshold_mib (currently set to 30MB over 7 active iterations). This directly targets sustained memory growth, which is characteristic of a leak.
  3. Improved Clarity:

    • Comments within the test have been updated to clearly explain the new methodology, the purpose of the warm-up phase, and the rationale behind the chosen leak threshold.

Benefits:

  • Increased Test Stability: By explicitly managing memory and using a cumulative check, the test should be far less prone to flakiness caused by system noise or GC timing.
  • More Accurate Leak Detection: The new logic more accurately reflects how a memory leak would manifest (i.e., sustained growth over time despite attempts to free resources).
  • Clearer Test Intent: The refactored code and comments make the test's purpose and methodology easier to understand.

This change aims to provide a more reliable and meaningful assessment of GPU memory usage for the read_region operation.

The `test_read_region_cuda_memleak` test was prone to sporadic failures
due to its reliance on measuring memory differences between arbitrary
points (iterations 5 and 9). This could be affected by transient memory
spikes or dips unrelated to a consistent leak, and by the timing of
Python's garbage collection and CuPy's memory pool heuristics.

This commit refactors the test to:
- Implement explicit deletion of the CuPy array (`region_data`) created
  by `img.read_region()` within each iteration.
- Add an explicit call to `cp.get_default_memory_pool().free_all_blocks()`
  after deletion to force more immediate reclamation of GPU memory.
- Change the assertion logic to:
    1. Perform a few "warm-up" iterations to allow for initial
       allocations.
    2. Measure memory after the warm-up phase.
    3. Measure memory after all test iterations have completed.
    4. Assert that the cumulative memory increase between the warm-up
       baseline and the final measurement is below a defined threshold
       (e.g., 30MB over 7 active iterations).
- This approach provides a more robust measure of actual memory leaks
  by looking for sustained memory growth after explicit freeing attempts,
  rather than being sensitive to short-term fluctuations.
- Updated comments to clarify the new logic and the rationale for the
  chosen memory threshold.

This change aims to reduce test flakiness and provide a more reliable
indicator of potential GPU memory leaks in the `read_region` functionality.

Signed-off-by: Gigon Bae <[email protected]>
@gigony gigony requested a review from a team as a code owner May 6, 2025 18:56
@gigony gigony self-assigned this May 6, 2025
@gigony gigony requested a review from jakirkham May 6, 2025 18:57
@gigony gigony added improvement Improves an existing functionality non-breaking Introduces a non-breaking change maintenance labels May 6, 2025
@gigony gigony added this to cucim May 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improves an existing functionality maintenance non-breaking Introduces a non-breaking change
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

1 participant