Skip to content

cvd remove fails silently with concurrent instances, causing GPU memory leak #528

@leroylim20

Description

@leroylim20

Summary

cvd remove --group_name=cvd_N silently fails to remove instances when other CVDs are running in the same container. Stopped instances remain in the fleet as "Running", retain their gfxstream GPU contexts, and GPU memory grows monotonically with each create/stop/remove cycle until cvd create hangs.

Environment

  • Image: us-docker.pkg.dev/android-cuttlefish-artifacts/cuttlefish-orchestration/cuttlefish-orchestration:stable
  • cvd version: 1.53.0 (VCS: b5244c8a7b84afd55a07b7617b63bf0beed0c735)
  • Host: AWS g5g.metal (aarch64, 64 vCPU, 128 GB RAM, 2× NVIDIA T4G)
  • NVIDIA driver: 580.159.03
  • OS: Amazon Linux 2023 (EKS-optimized AL2023_ARM_64_NVIDIA AMI)
  • GPU mode: gfxstream
  • Build: aosp-android-latest-release/aosp_cf_arm64_only_phone-userdebug (build ID 15357239)

Steps to Reproduce

  1. Run the cuttlefish-orchestration container in privileged mode with GPU access
  2. Launch 4+ CVDs with --gpu_mode=gfxstream:
    for i in 1 2 3 4; do
      HOME=/tmp/cvd-images cvd create \
        --host_path=/tmp/cvd-images --product_path=/tmp/cvd-images \
        --cpus=4 --memory_mb=4096 --gpu_mode=gfxstream \
        --x_res=540 --y_res=960 --dpi=240 --daemon \
        --enable_audio=false --report_anonymous_usage_stats=n \
        --modem_simulator_count=0
    done
  3. Record baseline GPU memory: nvidia-smi --query-gpu=memory.used --format=csv,noheader
  4. Rotate one CVD:
    # Create a new one
    HOME=/tmp/cvd-images cvd create --host_path=/tmp/cvd-images --product_path=/tmp/cvd-images \
      --cpus=4 --memory_mb=4096 --gpu_mode=gfxstream ...
    
    # Stop it
    HOME=/tmp/cvd-images cvd stop --group_name=cvd_5
    
    # Remove it
    HOME=/tmp/cvd-images cvd remove --group_name=cvd_5
  5. Check fleet: HOME=/tmp/cvd-images cvd fleet — instance is still there
  6. Repeat steps 4-5. Each cycle adds ~120 MiB GPU memory that is never freed.

Observed Behavior

Cycle Instance Count GPU Memory
Baseline 5 417 MiB
1 6 (should be 5) 536 MiB
2 7 (should be 5) 673 MiB
3 8 (should be 5) 794 MiB
4 9 (should be 5) 900 MiB
5 10 (should be 5) 1003 MiB
6 11 (should be 5) 1129 MiB
7 12 (should be 5) 1235 MiB
8 HUNG cvd create blocks at "Starting"

Expected Behavior

  • cvd stop --group_name=cvd_N stops the instance
  • cvd remove --group_name=cvd_N fully removes it from the fleet and releases GPU resources
  • Instance count remains constant across rotation cycles
  • GPU memory returns to baseline after remove

Key Observations

  1. Single-CVD rotation works fine — when no other instances are running, cvd stop + cvd remove properly releases GPU memory back to 0 MiB
  2. Multi-instance rotation leaks — the presence of other running CVDs prevents cleanup of the stopped one
  3. The issue is specific to --gpu_mode=gfxstream on ARM64 + NVIDIA. We did not test x86 or swiftshader modes.
  4. The cvd fleet output shows stopped instances remain with "status" : "Running" even after cvd stop completes

Impact

This makes it impossible to run a test automation pipeline that creates/destroys CVDs in a loop (common pattern for mobile CI). GPU memory exhaustion causes the entire container to hang after a variable number of rotations depending on GPU capacity and concurrent instance count.

Workarounds (confirmed)

  • Use cvd powerwash instead of stop/remove/create to reset devices between test iterations
  • Restart the entire container periodically to release all GPU state
  • Limit rotation frequency and monitor GPU memory

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions