Summary
cvd remove --group_name=cvd_N silently fails to remove instances when other CVDs are running in the same container. Stopped instances remain in the fleet as "Running", retain their gfxstream GPU contexts, and GPU memory grows monotonically with each create/stop/remove cycle until cvd create hangs.
Environment
- Image:
us-docker.pkg.dev/android-cuttlefish-artifacts/cuttlefish-orchestration/cuttlefish-orchestration:stable
- cvd version: 1.53.0 (VCS: b5244c8a7b84afd55a07b7617b63bf0beed0c735)
- Host: AWS g5g.metal (aarch64, 64 vCPU, 128 GB RAM, 2× NVIDIA T4G)
- NVIDIA driver: 580.159.03
- OS: Amazon Linux 2023 (EKS-optimized AL2023_ARM_64_NVIDIA AMI)
- GPU mode: gfxstream
- Build:
aosp-android-latest-release/aosp_cf_arm64_only_phone-userdebug (build ID 15357239)
Steps to Reproduce
- Run the cuttlefish-orchestration container in privileged mode with GPU access
- Launch 4+ CVDs with
--gpu_mode=gfxstream:
for i in 1 2 3 4; do
HOME=/tmp/cvd-images cvd create \
--host_path=/tmp/cvd-images --product_path=/tmp/cvd-images \
--cpus=4 --memory_mb=4096 --gpu_mode=gfxstream \
--x_res=540 --y_res=960 --dpi=240 --daemon \
--enable_audio=false --report_anonymous_usage_stats=n \
--modem_simulator_count=0
done
- Record baseline GPU memory:
nvidia-smi --query-gpu=memory.used --format=csv,noheader
- Rotate one CVD:
# Create a new one
HOME=/tmp/cvd-images cvd create --host_path=/tmp/cvd-images --product_path=/tmp/cvd-images \
--cpus=4 --memory_mb=4096 --gpu_mode=gfxstream ...
# Stop it
HOME=/tmp/cvd-images cvd stop --group_name=cvd_5
# Remove it
HOME=/tmp/cvd-images cvd remove --group_name=cvd_5
- Check fleet:
HOME=/tmp/cvd-images cvd fleet — instance is still there
- Repeat steps 4-5. Each cycle adds ~120 MiB GPU memory that is never freed.
Observed Behavior
| Cycle |
Instance Count |
GPU Memory |
| Baseline |
5 |
417 MiB |
| 1 |
6 (should be 5) |
536 MiB |
| 2 |
7 (should be 5) |
673 MiB |
| 3 |
8 (should be 5) |
794 MiB |
| 4 |
9 (should be 5) |
900 MiB |
| 5 |
10 (should be 5) |
1003 MiB |
| 6 |
11 (should be 5) |
1129 MiB |
| 7 |
12 (should be 5) |
1235 MiB |
| 8 |
HUNG |
— cvd create blocks at "Starting" |
Expected Behavior
cvd stop --group_name=cvd_N stops the instance
cvd remove --group_name=cvd_N fully removes it from the fleet and releases GPU resources
- Instance count remains constant across rotation cycles
- GPU memory returns to baseline after remove
Key Observations
- Single-CVD rotation works fine — when no other instances are running,
cvd stop + cvd remove properly releases GPU memory back to 0 MiB
- Multi-instance rotation leaks — the presence of other running CVDs prevents cleanup of the stopped one
- The issue is specific to
--gpu_mode=gfxstream on ARM64 + NVIDIA. We did not test x86 or swiftshader modes.
- The
cvd fleet output shows stopped instances remain with "status" : "Running" even after cvd stop completes
Impact
This makes it impossible to run a test automation pipeline that creates/destroys CVDs in a loop (common pattern for mobile CI). GPU memory exhaustion causes the entire container to hang after a variable number of rotations depending on GPU capacity and concurrent instance count.
Workarounds (confirmed)
- Use
cvd powerwash instead of stop/remove/create to reset devices between test iterations
- Restart the entire container periodically to release all GPU state
- Limit rotation frequency and monitor GPU memory
Summary
cvd remove --group_name=cvd_Nsilently fails to remove instances when other CVDs are running in the same container. Stopped instances remain in the fleet as "Running", retain their gfxstream GPU contexts, and GPU memory grows monotonically with each create/stop/remove cycle untilcvd createhangs.Environment
us-docker.pkg.dev/android-cuttlefish-artifacts/cuttlefish-orchestration/cuttlefish-orchestration:stableaosp-android-latest-release/aosp_cf_arm64_only_phone-userdebug(build ID 15357239)Steps to Reproduce
--gpu_mode=gfxstream:nvidia-smi --query-gpu=memory.used --format=csv,noheaderHOME=/tmp/cvd-images cvd fleet— instance is still thereObserved Behavior
cvd createblocks at "Starting"Expected Behavior
cvd stop --group_name=cvd_Nstops the instancecvd remove --group_name=cvd_Nfully removes it from the fleet and releases GPU resourcesKey Observations
cvd stop+cvd removeproperly releases GPU memory back to 0 MiB--gpu_mode=gfxstreamon ARM64 + NVIDIA. We did not test x86 or swiftshader modes.cvd fleetoutput shows stopped instances remain with"status" : "Running"even aftercvd stopcompletesImpact
This makes it impossible to run a test automation pipeline that creates/destroys CVDs in a loop (common pattern for mobile CI). GPU memory exhaustion causes the entire container to hang after a variable number of rotations depending on GPU capacity and concurrent instance count.
Workarounds (confirmed)
cvd powerwashinstead of stop/remove/create to reset devices between test iterations