cvd remove fails silently with concurrent instances, causing GPU memory leak

## Summary

`cvd remove --group_name=cvd_N` silently fails to remove instances when other CVDs are running in the same container. Stopped instances remain in the fleet as "Running", retain their gfxstream GPU contexts, and GPU memory grows monotonically with each create/stop/remove cycle until `cvd create` hangs.

## Environment

- **Image**: `us-docker.pkg.dev/android-cuttlefish-artifacts/cuttlefish-orchestration/cuttlefish-orchestration:stable`
- **cvd version**: 1.53.0 (VCS: b5244c8a7b84afd55a07b7617b63bf0beed0c735)
- **Host**: AWS g5g.metal (aarch64, 64 vCPU, 128 GB RAM, 2× NVIDIA T4G)
- **NVIDIA driver**: 580.159.03
- **OS**: Amazon Linux 2023 (EKS-optimized AL2023_ARM_64_NVIDIA AMI)
- **GPU mode**: gfxstream
- **Build**: `aosp-android-latest-release/aosp_cf_arm64_only_phone-userdebug` (build ID 15357239)

## Steps to Reproduce

1. Run the cuttlefish-orchestration container in privileged mode with GPU access
2. Launch 4+ CVDs with `--gpu_mode=gfxstream`:
   ```bash
   for i in 1 2 3 4; do
     HOME=/tmp/cvd-images cvd create \
       --host_path=/tmp/cvd-images --product_path=/tmp/cvd-images \
       --cpus=4 --memory_mb=4096 --gpu_mode=gfxstream \
       --x_res=540 --y_res=960 --dpi=240 --daemon \
       --enable_audio=false --report_anonymous_usage_stats=n \
       --modem_simulator_count=0
   done
   ```
3. Record baseline GPU memory: `nvidia-smi --query-gpu=memory.used --format=csv,noheader`
4. Rotate one CVD:
   ```bash
   # Create a new one
   HOME=/tmp/cvd-images cvd create --host_path=/tmp/cvd-images --product_path=/tmp/cvd-images \
     --cpus=4 --memory_mb=4096 --gpu_mode=gfxstream ...

   # Stop it
   HOME=/tmp/cvd-images cvd stop --group_name=cvd_5

   # Remove it
   HOME=/tmp/cvd-images cvd remove --group_name=cvd_5
   ```
5. Check fleet: `HOME=/tmp/cvd-images cvd fleet` — instance is still there
6. Repeat steps 4-5. Each cycle adds ~120 MiB GPU memory that is never freed.

## Observed Behavior

| Cycle | Instance Count | GPU Memory |
|-------|---------------|------------|
| Baseline | 5 | 417 MiB |
| 1 | 6 (should be 5) | 536 MiB |
| 2 | 7 (should be 5) | 673 MiB |
| 3 | 8 (should be 5) | 794 MiB |
| 4 | 9 (should be 5) | 900 MiB |
| 5 | 10 (should be 5) | 1003 MiB |
| 6 | 11 (should be 5) | 1129 MiB |
| 7 | 12 (should be 5) | 1235 MiB |
| 8 | **HUNG** | — `cvd create` blocks at "Starting" |

## Expected Behavior

- `cvd stop --group_name=cvd_N` stops the instance
- `cvd remove --group_name=cvd_N` fully removes it from the fleet and releases GPU resources
- Instance count remains constant across rotation cycles
- GPU memory returns to baseline after remove

## Key Observations

1. **Single-CVD rotation works fine** — when no other instances are running, `cvd stop` + `cvd remove` properly releases GPU memory back to 0 MiB
2. **Multi-instance rotation leaks** — the presence of other running CVDs prevents cleanup of the stopped one
3. The issue is specific to `--gpu_mode=gfxstream` on ARM64 + NVIDIA. We did not test x86 or swiftshader modes.
4. The `cvd fleet` output shows stopped instances remain with `"status" : "Running"` even after `cvd stop` completes

## Impact

This makes it impossible to run a test automation pipeline that creates/destroys CVDs in a loop (common pattern for mobile CI). GPU memory exhaustion causes the entire container to hang after a variable number of rotations depending on GPU capacity and concurrent instance count.

## Workarounds (confirmed)

- Use `cvd powerwash` instead of stop/remove/create to reset devices between test iterations
- Restart the entire container periodically to release all GPU state
- Limit rotation frequency and monitor GPU memory

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cvd remove fails silently with concurrent instances, causing GPU memory leak #528

Summary

Environment

Steps to Reproduce

Observed Behavior

Expected Behavior

Key Observations

Impact

Workarounds (confirmed)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Cycle	Instance Count	GPU Memory
Baseline	5	417 MiB
1	6 (should be 5)	536 MiB
2	7 (should be 5)	673 MiB
3	8 (should be 5)	794 MiB
4	9 (should be 5)	900 MiB
5	10 (should be 5)	1003 MiB
6	11 (should be 5)	1129 MiB
7	12 (should be 5)	1235 MiB
8	HUNG	— `cvd create` blocks at "Starting"

cvd remove fails silently with concurrent instances, causing GPU memory leak #528

Description

Summary

Environment

Steps to Reproduce

Observed Behavior

Expected Behavior

Key Observations

Impact

Workarounds (confirmed)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions