Skip to content

Comments

Add AWS nested virtualization test harness#100

Closed
rgarcia wants to merge 2 commits intomainfrom
rgarcia/aws-nested-virt-test
Closed

Add AWS nested virtualization test harness#100
rgarcia wants to merge 2 commits intomainfrom
rgarcia/aws-nested-virt-test

Conversation

@rgarcia
Copy link
Contributor

@rgarcia rgarcia commented Feb 13, 2026

Summary

  • Adds a Go test program at tests/aws/ that launches c8id instances with AWS's new nested virtualization feature (CpuOptions.NestedVirtualization=enabled), installs hypeman, and benchmarks VM spin-up times and CoreMark performance
  • Supports both regular nested-virt instances and bare metal for comparison, with auto-detection of Debian AMIs, security groups, and subnets
  • Runs smoke tests with both Cloud Hypervisor and QEMU hypervisors, plus CoreMark CPU benchmarks on host and inside VMs

Results

Metric Bare Metal (c8id.metal-48xl) Nested Virt (c8id.4xlarge)
Boot to Running 18s 18s
Boot to SSH 1m 29s 48s
Hypeman Install 4m 19s 2m 9s
Cloud Hypervisor Launch 250ms 191ms
QEMU Launch 410ms 318ms
CoreMark Host 33,300 iter/s 32,806 iter/s
CoreMark VM 32,854 iter/s (1.3% overhead) N/A (L2 VMs crash)

Key Findings

  1. Nested virt works for L1: Both Cloud Hypervisor and QEMU VMs launch and run successfully inside nested-virt instances. Host CPU performance is within ~1.5% of bare metal.
  2. L2 VMs (VM-inside-VM) crash immediately: Both Cloud Hypervisor (enters Shutdown state) and QEMU (socket connection refused — process exits) fail when trying to run a VM inside a VM on nested-virt instances. This means you can't run VMs inside VMs on these instances.
  3. Faster time-to-first-VM: Nested virt instances reach SSH ~40s faster than bare metal and install hypeman ~2 min faster, likely due to the overhead of bare metal hardware initialization.
  4. Same per-vCPU pricing: No premium for nested virtualization, but you can use smaller instance sizes (e.g., c8id.4xlarge at ~$0.88/hr vs c8id.metal-48xl at ~$10.56/hr) if you only need a few VMs.

Usage

cd tests/aws
go run main.go \
  --instance-type c8id.4xlarge \
  --key-name <your-key> \
  --key-path <path-to-pem> \
  --profile <aws-profile>

Test plan

  • Tested on c8id.4xlarge (nested virt) — smoke tests pass, CoreMark host runs, VM benchmark gracefully reports N/A
  • Tested on c8id.metal-48xl (bare metal) — smoke tests pass, CoreMark host + VM both run, 1.3% overhead measured
  • Verified cleanup: instances terminated and security groups deleted after each run

🤖 Generated with Claude Code


Note

Medium Risk
Adds new code that programmatically creates and tears down AWS resources (instances/security groups) and executes remote commands, so misconfiguration can lead to unexpected cost or security exposure if run improperly, though it’s isolated to tests/aws/.

Overview
Adds a standalone Go-based AWS test harness under tests/aws/ that provisions an EC2 instance (optionally enabling CpuOptions.NestedVirtualization), installs hypeman via cloud-init, and validates /dev/kvm + service health over SSH.

The harness can auto-resolve a Debian 12 AMI and default subnet, create a temporary SSH-only security group, run smoke tests for both cloud-hypervisor and QEMU (using hypeman exec for verification), and optionally run VM launch latency and CoreMark benchmarks before cleaning up resources (or keeping the instance via --keep).

Written by Cursor Bugbot for commit 9a1f936. This will update automatically on new commits. Configure here.

Go program that launches a c8id instance with nested virtualization
(CpuOptions.NestedVirtualization=enabled), installs hypeman, and runs
smoke tests with both Cloud Hypervisor and QEMU, plus CoreMark benchmarks.

Key findings from testing:
- c8id.4xlarge: 18s boot, 48s SSH, ~2m install, CH 191ms, QEMU 318ms
- c8id.metal-48xl: 18s boot, 1m29s SSH, ~4m install, CH 250ms, QEMU 410ms
- Host CoreMark: ~32,800-33,300 iter/s (bare metal vs nested nearly identical)
- L1 VM CoreMark on bare metal: 32,854 iter/s (1.3% overhead)
- L2 VMs (VM-inside-VM) crash immediately on nested virt instances
  (both Cloud Hypervisor and QEMU — the QEMU process exits, socket refused)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
// Treat as warning if we at least got the host score.
if hostScore > 0 {
logf("CoreMark VM benchmark failed (host score available): %v", err)
} else {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VM benchmark failures are silently ignored

Medium Severity

run() treats any runCoreMark VM-side failure as a warning whenever hostScore > 0, so the program exits successfully even when the VM benchmark is broken. This hides real regressions in hypeman/hypervisor behavior and can report misleading benchmark results from tests/aws/main.go.

Fix in Cursor Fix in Web

@cursor

This comment has been minimized.

The rmmod/modprobe approach for disabling APICv corrupts VMX state and
makes VM crashes significantly more frequent. Replace with modprobe.d
config that takes effect on reboot.

Through extensive testing, identified the root cause of nested virt VM
crashes: TAP networking triggers a Nitro hypervisor bug where VMCS
VM-Exit interrupt info is set to 0xffffffff. VMs without TAP (user-mode
networking, vsock-only) work fine.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Bugbot Autofix is ON. A Cloud Agent has been kicked off to fix the reported issues.


if len(chTimes) == 0 && len(qemuTimes) == 0 {
return nil, nil, fmt.Errorf("all launch benchmark iterations failed")
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial benchmark failures misreported as valid stats

Medium Severity

runLaunchBenchmark only returns an error when both result slices are empty. If one hypervisor has zero successful iterations, computeStats gets an empty slice and returns zero durations, and the summary prints 0s metrics as if they were real benchmark results.

Additional Locations (1)

Fix in Cursor Fix in Web

continue
}
if strings.Contains(full, "ready") {
return nil
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image readiness check can return too early

Low Severity

waitForImageReady verifies alpine appears in hypeman image list -q, but then returns success if the full image list contains ready anywhere. A different image being ready can satisfy this check while alpine is still not ready.

Fix in Cursor Fix in Web

GroupId: aws.String(createdSGID),
}); err != nil {
logf("Warning: failed to delete security group: %v", err)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep mode leaks temporary security groups

Low Severity

When --keep is set, the code keeps the instance but still tries to delete createdSGID. AWS rejects deleting a security group attached to a running instance, so the temporary group is left behind. This causes predictable resource leakage in keep-mode runs.

Additional Locations (1)

Fix in Cursor Fix in Web

@cursor
Copy link

cursor bot commented Feb 13, 2026

Bugbot Autofix prepared fixes for 3 of the 3 bugs found in the latest run.

  • ✅ Fixed: Partial benchmark failures misreported as valid stats
    • Added guards to only compute and display stats for hypervisors with non-empty result slices, showing explicit failure messages instead of misleading 0s values.
  • ✅ Fixed: Image readiness check can return too early
    • Changed the readiness check to verify that the specific line containing 'alpine' also contains 'ready', instead of checking for 'ready' anywhere in the full image list.
  • ✅ Fixed: Keep mode leaks temporary security groups
    • When --keep is set and the instance is kept running, the security group deletion is now skipped entirely since the running instance still references it.

Create PR

Or push these changes by commenting:

@cursor push f75d0b7a3c
Preview (f75d0b7a3c)
diff --git a/tests/aws/main.go b/tests/aws/main.go
--- a/tests/aws/main.go
+++ b/tests/aws/main.go
@@ -99,19 +99,25 @@
 			logf("Keeping instance %s (--keep flag set)", instanceID)
 		}
 		if createdSGID != "" {
-			if instanceID != "" && !*keep {
-				logf("Waiting for instance to terminate before deleting security group...")
-				w := ec2.NewInstanceTerminatedWaiter(svc)
-				_ = w.Wait(cleanCtx, &ec2.DescribeInstancesInput{
-					InstanceIds: []string{instanceID},
-				}, 5*time.Minute)
+			if instanceID != "" && *keep {
+				// Instance is kept running and still references the SG;
+				// attempting to delete it would fail, so skip cleanup.
+				logf("Keeping security group %s (attached to kept instance %s)", createdSGID, instanceID)
+			} else {
+				if instanceID != "" {
+					logf("Waiting for instance to terminate before deleting security group...")
+					w := ec2.NewInstanceTerminatedWaiter(svc)
+					_ = w.Wait(cleanCtx, &ec2.DescribeInstancesInput{
+						InstanceIds: []string{instanceID},
+					}, 5*time.Minute)
+				}
+				logf("Deleting security group %s...", createdSGID)
+				if _, err := svc.DeleteSecurityGroup(cleanCtx, &ec2.DeleteSecurityGroupInput{
+					GroupId: aws.String(createdSGID),
+				}); err != nil {
+					logf("Warning: failed to delete security group: %v", err)
+				}
 			}
-			logf("Deleting security group %s...", createdSGID)
-			if _, err := svc.DeleteSecurityGroup(cleanCtx, &ec2.DeleteSecurityGroupInput{
-				GroupId: aws.String(createdSGID),
-			}); err != nil {
-				logf("Warning: failed to delete security group: %v", err)
-			}
 		}
 	}()
 
@@ -403,31 +409,39 @@
 			logf("Launch benchmark failed: %v", err)
 			return 1
 		}
-		chStats := computeStats(chLaunchTimes)
-		qemuStats := computeStats(qemuLaunchTimes)
 
 		fmt.Println()
 		logf("VM Launch Benchmark (50 iterations):")
-		logf("  Cloud Hypervisor: median=%s avg=%s min=%s max=%s p95=%s",
-			chStats.median.Round(time.Millisecond), chStats.avg.Round(time.Millisecond),
-			chStats.min.Round(time.Millisecond), chStats.max.Round(time.Millisecond),
-			chStats.p95.Round(time.Millisecond))
-		logf("  QEMU:             median=%s avg=%s min=%s max=%s p95=%s",
-			qemuStats.median.Round(time.Millisecond), qemuStats.avg.Round(time.Millisecond),
-			qemuStats.min.Round(time.Millisecond), qemuStats.max.Round(time.Millisecond),
-			qemuStats.p95.Round(time.Millisecond))
 
-		// Store stats for final summary.
-		timings["ch_median"] = chStats.median
-		timings["ch_avg"] = chStats.avg
-		timings["ch_min"] = chStats.min
-		timings["ch_max"] = chStats.max
-		timings["ch_p95"] = chStats.p95
-		timings["qemu_median"] = qemuStats.median
-		timings["qemu_avg"] = qemuStats.avg
-		timings["qemu_min"] = qemuStats.min
-		timings["qemu_max"] = qemuStats.max
-		timings["qemu_p95"] = qemuStats.p95
+		if len(chLaunchTimes) > 0 {
+			chStats := computeStats(chLaunchTimes)
+			logf("  Cloud Hypervisor: median=%s avg=%s min=%s max=%s p95=%s",
+				chStats.median.Round(time.Millisecond), chStats.avg.Round(time.Millisecond),
+				chStats.min.Round(time.Millisecond), chStats.max.Round(time.Millisecond),
+				chStats.p95.Round(time.Millisecond))
+			timings["ch_median"] = chStats.median
+			timings["ch_avg"] = chStats.avg
+			timings["ch_min"] = chStats.min
+			timings["ch_max"] = chStats.max
+			timings["ch_p95"] = chStats.p95
+		} else {
+			logf("  Cloud Hypervisor: all iterations failed — no stats available")
+		}
+
+		if len(qemuLaunchTimes) > 0 {
+			qemuStats := computeStats(qemuLaunchTimes)
+			logf("  QEMU:             median=%s avg=%s min=%s max=%s p95=%s",
+				qemuStats.median.Round(time.Millisecond), qemuStats.avg.Round(time.Millisecond),
+				qemuStats.min.Round(time.Millisecond), qemuStats.max.Round(time.Millisecond),
+				qemuStats.p95.Round(time.Millisecond))
+			timings["qemu_median"] = qemuStats.median
+			timings["qemu_avg"] = qemuStats.avg
+			timings["qemu_min"] = qemuStats.min
+			timings["qemu_max"] = qemuStats.max
+			timings["qemu_p95"] = qemuStats.p95
+		} else {
+			logf("  QEMU:             all iterations failed — no stats available")
+		}
 	} else {
 		logf("Skipping smoke test (--skip-smoke)")
 	}
@@ -466,23 +480,33 @@
 	if _, ok := timings["smoke"]; ok {
 		fmt.Printf("  Launch -> Smoke Test:   %s\n", timings["smoke"].Round(time.Second))
 	}
-	if _, ok := timings["ch_median"]; ok {
+	_, hasCH := timings["ch_median"]
+	_, hasQEMU := timings["qemu_median"]
+	if hasCH || hasQEMU {
 		fmt.Println()
 		fmt.Println("VM Launch Latency (50 iterations):")
-		fmt.Println("  Cloud Hypervisor:")
-		fmt.Printf("    Median: %s  Avg: %s  Min: %s  Max: %s  P95: %s\n",
-			timings["ch_median"].Round(time.Millisecond),
-			timings["ch_avg"].Round(time.Millisecond),
-			timings["ch_min"].Round(time.Millisecond),
-			timings["ch_max"].Round(time.Millisecond),
-			timings["ch_p95"].Round(time.Millisecond))
-		fmt.Println("  QEMU:")
-		fmt.Printf("    Median: %s  Avg: %s  Min: %s  Max: %s  P95: %s\n",
-			timings["qemu_median"].Round(time.Millisecond),
-			timings["qemu_avg"].Round(time.Millisecond),
-			timings["qemu_min"].Round(time.Millisecond),
-			timings["qemu_max"].Round(time.Millisecond),
-			timings["qemu_p95"].Round(time.Millisecond))
+		if hasCH {
+			fmt.Println("  Cloud Hypervisor:")
+			fmt.Printf("    Median: %s  Avg: %s  Min: %s  Max: %s  P95: %s\n",
+				timings["ch_median"].Round(time.Millisecond),
+				timings["ch_avg"].Round(time.Millisecond),
+				timings["ch_min"].Round(time.Millisecond),
+				timings["ch_max"].Round(time.Millisecond),
+				timings["ch_p95"].Round(time.Millisecond))
+		} else {
+			fmt.Println("  Cloud Hypervisor: all iterations failed")
+		}
+		if hasQEMU {
+			fmt.Println("  QEMU:")
+			fmt.Printf("    Median: %s  Avg: %s  Min: %s  Max: %s  P95: %s\n",
+				timings["qemu_median"].Round(time.Millisecond),
+				timings["qemu_avg"].Round(time.Millisecond),
+				timings["qemu_min"].Round(time.Millisecond),
+				timings["qemu_max"].Round(time.Millisecond),
+				timings["qemu_p95"].Round(time.Millisecond))
+		} else {
+			fmt.Println("  QEMU: all iterations failed")
+		}
 	}
 	if !*skipBenchmark && hostScore > 0 {
 		fmt.Println()
@@ -849,7 +873,16 @@
 				if err != nil {
 					continue
 				}
-				if strings.Contains(full, "ready") {
+				// Check that the specific alpine line shows "ready",
+				// not just any image in the list.
+				alpineReady := false
+				for _, line := range strings.Split(full, "\n") {
+					if strings.Contains(line, "alpine") && strings.Contains(line, "ready") {
+						alpineReady = true
+						break
+					}
+				}
+				if alpineReady {
 					return nil
 				}
 				logf("Image status: %s", strings.TrimSpace(full))

@rgarcia rgarcia closed this Feb 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant