-
Notifications
You must be signed in to change notification settings - Fork 138
Description
CPU Utilization Performance Regression in Container v3.4.1 vs v3.2.5
Description
We're experiencing a significant performance regression when upgrading from DLT container image v3.2.5 to v3.4.1. Load tests that previously ran successfully with v3.2.5 are now failing with CPU utilization >90% and containers being forcefully shutdown.
Environment Details
- Current Working Version: v3.2.5
- Problematic Version: v3.4.1
- AWS Region: Multiple regions tested
- ECS Configuration:
- Fargate platform
- 8 vCPU, 20GB RAM
- Container Insights enabled
- Test Scale: 5,000 concurrent users
- Test Duration: Extended load tests (>30 minutes)
Issue Details
What Works (v3.2.5)
- 5K user load tests complete successfully
- CPU utilization remains <70% throughout test
- No container shutdowns or timeouts
- Stable performance across multiple test runs
What Fails (v3.4.1)
- Same 5K user load tests fail with CPU >90%
- Containers forcefully shutdown due to resource exhaustion
- Tests that previously took 30+ minutes now fail within 10-15 minutes
- Issue occurs consistently across multiple attempts
Root Cause Analysis Performed
1. CloudWatch Logs Investigation
We identified excessive S3 HeadObject operations causing CPU spikes:
ERROR - botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: The specified key does not exist.
Key: Container_IPs/testId_IPHOSTS_region.txt
2. Container Behavior Changes
Between v3.2.5 and v3.4.1, the container now expects Container_IPs files in S3 for multi-task coordination:
- v3.2.5: Uses environment variables (IPNETWORK/IPHOSTS) as fallback
- v3.4.1: Aggressively polls S3 for Container_IPs files, causing 404 errors and retry loops
3. Workaround Implemented
We've temporarily resolved the issue by creating the expected Container_IPs files:
// In task-runner Lambda - create Container_IPs files for v3.4.1 compatibility
if (ipAddresses.length > 1) {
const containerIpsContent = ipAddresses.join('\n');
await s3.putObject({
Bucket: process.env.SCENARIOS_BUCKET,
Key: `Container_IPs/${testId}_IPHOSTS_${region}.txt`,
Body: containerIpsContent,
ContentType: 'text/plain'
}).promise();
}Performance Impact Analysis
Container Resource Changes (Dockerfile comparison)
v3.2.5 Base:
- Base image:
blazemeter/taurus:1.16.27 - Python: 3.10
- JDK: OpenJDK 11/17
v3.4.1 Changes:
- Base image:
amazonlinux:2023-minimal - Python: 3.11
- JDK:
java-21-amazon-corretto - Added K6 and Locust framework support
Suspected Performance Issues
- S3 Retry Loops: Missing Container_IPs files cause aggressive S3 polling
- JDK 21 Performance: Potential JVM tuning incompatibility with new JDK version
- Framework Detection Overhead: New framework detection logic may impact performance
- Base Image Overhead: Amazon Linux 2023 vs optimized Taurus image
JVM Configuration Attempts
We've tried various JVM optimizations for JDK 21:
JAVA_OPTS="-Xms4g -Xmx12g -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+DisableExplicitGC -XX:+UnlockExperimentalVMOptions -XX:+UseJVMCICompiler"Expected Behavior
Container v3.4.1 should maintain the same performance characteristics as v3.2.5 for equivalent test loads, especially when provided with adequate resources (8 vCPU, 20GB RAM).
Actual Behavior
CPU utilization spikes to >90% causing container shutdown, making the solution unusable for production load testing scenarios.
Steps to Reproduce
- Deploy DLT solution with container image v3.4.1
- Configure ECS task with 8 vCPU, 20GB RAM
- Run load test with 5,000 concurrent users for 30+ minutes
- Observe CPU utilization metrics in CloudWatch
- Container will be terminated due to resource exhaustion
Workaround Status
- ✅ Container_IPs file creation: Resolves S3 404 errors
- ❌ CPU performance: Still experiencing high CPU utilization
⚠️ Production impact: Tests that worked in v3.2.5 cannot run in v3.4.1
Request for AWS Team
-
Performance regression investigation: Why does v3.4.1 consume significantly more CPU than v3.2.5 for identical workloads?
-
JDK 21 optimization guidance: Are there recommended JVM parameters for large-scale load testing with the new OpenJDK 21?
-
Framework detection overhead: Does the new K6/Locust framework detection impact JMeter performance?
-
Container_IPs documentation: Should the Lambda functions automatically create these files, or is this expected user configuration?
-
Performance benchmarks: Are there known performance differences between v3.2.5 and v3.4.1 that users should be aware of?
Additional Context
- We maintain extensive customizations to the DLT UI and Lambda functions
- Cannot upgrade to the latest DLT version due to custom code dependencies
- Need to upgrade container image only for security patches while maintaining performance
- This affects production load testing capabilities for enterprise applications
Priority: High - Blocks production load testing capabilities
Impact: Performance regression prevents using updated container versions
Workaround: Partial (resolves S3 errors but not CPU performance)