CPU Utilization Performance Regression in Container v3.4.1 vs v3.2.5

# CPU Utilization Performance Regression in Container v3.4.1 vs v3.2.5

## Description

We're experiencing a significant performance regression when upgrading from DLT container image v3.2.5 to v3.4.1. Load tests that previously ran successfully with v3.2.5 are now failing with CPU utilization >90% and containers being forcefully shutdown.

## Environment Details

- **Current Working Version**: v3.2.5
- **Problematic Version**: v3.4.1
- **AWS Region**: Multiple regions tested
- **ECS Configuration**: 
  - Fargate platform
  - 8 vCPU, 20GB RAM
  - Container Insights enabled
- **Test Scale**: 5,000 concurrent users
- **Test Duration**: Extended load tests (>30 minutes)

## Issue Details

### What Works (v3.2.5)
- 5K user load tests complete successfully
- CPU utilization remains <70% throughout test
- No container shutdowns or timeouts
- Stable performance across multiple test runs

### What Fails (v3.4.1)
- Same 5K user load tests fail with CPU >90%
- Containers forcefully shutdown due to resource exhaustion
- Tests that previously took 30+ minutes now fail within 10-15 minutes
- Issue occurs consistently across multiple attempts

## Root Cause Analysis Performed

### 1. CloudWatch Logs Investigation
We identified excessive S3 HeadObject operations causing CPU spikes:
```
ERROR - botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: The specified key does not exist.
Key: Container_IPs/testId_IPHOSTS_region.txt
```

### 2. Container Behavior Changes
Between v3.2.5 and v3.4.1, the container now expects `Container_IPs` files in S3 for multi-task coordination:
- **v3.2.5**: Uses environment variables (IPNETWORK/IPHOSTS) as fallback
- **v3.4.1**: Aggressively polls S3 for Container_IPs files, causing 404 errors and retry loops

### 3. Workaround Implemented
We've temporarily resolved the issue by creating the expected Container_IPs files:
```javascript
// In task-runner Lambda - create Container_IPs files for v3.4.1 compatibility
if (ipAddresses.length > 1) {
  const containerIpsContent = ipAddresses.join('\n');
  await s3.putObject({
    Bucket: process.env.SCENARIOS_BUCKET,
    Key: `Container_IPs/${testId}_IPHOSTS_${region}.txt`,
    Body: containerIpsContent,
    ContentType: 'text/plain'
  }).promise();
}
```

## Performance Impact Analysis

### Container Resource Changes (Dockerfile comparison)

**v3.2.5 Base**:
- Base image: `blazemeter/taurus:1.16.27`
- Python: 3.10
- JDK: OpenJDK 11/17

**v3.4.1 Changes**:
- Base image: `amazonlinux:2023-minimal`
- Python: 3.11
- JDK: `java-21-amazon-corretto`
- Added K6 and Locust framework support

### Suspected Performance Issues

1. **S3 Retry Loops**: Missing Container_IPs files cause aggressive S3 polling
2. **JDK 21 Performance**: Potential JVM tuning incompatibility with new JDK version
3. **Framework Detection Overhead**: New framework detection logic may impact performance
4. **Base Image Overhead**: Amazon Linux 2023 vs optimized Taurus image

## JVM Configuration Attempts

We've tried various JVM optimizations for JDK 21:
```bash
JAVA_OPTS="-Xms4g -Xmx12g -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+DisableExplicitGC -XX:+UnlockExperimentalVMOptions -XX:+UseJVMCICompiler"
```

## Expected Behavior

Container v3.4.1 should maintain the same performance characteristics as v3.2.5 for equivalent test loads, especially when provided with adequate resources (8 vCPU, 20GB RAM).

## Actual Behavior

CPU utilization spikes to >90% causing container shutdown, making the solution unusable for production load testing scenarios.

## Steps to Reproduce

1. Deploy DLT solution with container image v3.4.1
2. Configure ECS task with 8 vCPU, 20GB RAM
3. Run load test with 5,000 concurrent users for 30+ minutes
4. Observe CPU utilization metrics in CloudWatch
5. Container will be terminated due to resource exhaustion

## Workaround Status

- ✅ **Container_IPs file creation**: Resolves S3 404 errors
- ❌ **CPU performance**: Still experiencing high CPU utilization
- ⚠️ **Production impact**: Tests that worked in v3.2.5 cannot run in v3.4.1

## Request for AWS Team

1. **Performance regression investigation**: Why does v3.4.1 consume significantly more CPU than v3.2.5 for identical workloads?

2. **JDK 21 optimization guidance**: Are there recommended JVM parameters for large-scale load testing with the new OpenJDK 21?

3. **Framework detection overhead**: Does the new K6/Locust framework detection impact JMeter performance?

4. **Container_IPs documentation**: Should the Lambda functions automatically create these files, or is this expected user configuration?

5. **Performance benchmarks**: Are there known performance differences between v3.2.5 and v3.4.1 that users should be aware of?

## Additional Context

- We maintain extensive customizations to the DLT UI and Lambda functions
- Cannot upgrade to the latest DLT version due to custom code dependencies  
- Need to upgrade container image only for security patches while maintaining performance
- This affects production load testing capabilities for enterprise applications

---

**Priority**: High - Blocks production load testing capabilities  
**Impact**: Performance regression prevents using updated container versions  
**Workaround**: Partial (resolves S3 errors but not CPU performance)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CPU Utilization Performance Regression in Container v3.4.1 vs v3.2.5 #257

CPU Utilization Performance Regression in Container v3.4.1 vs v3.2.5

Description

Environment Details

Issue Details

What Works (v3.2.5)

What Fails (v3.4.1)

Root Cause Analysis Performed

1. CloudWatch Logs Investigation

2. Container Behavior Changes

3. Workaround Implemented

Performance Impact Analysis

Container Resource Changes (Dockerfile comparison)

Suspected Performance Issues

JVM Configuration Attempts

Expected Behavior

Actual Behavior

Steps to Reproduce

Workaround Status

Request for AWS Team

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CPU Utilization Performance Regression in Container v3.4.1 vs v3.2.5 #257

Description

CPU Utilization Performance Regression in Container v3.4.1 vs v3.2.5

Description

Environment Details

Issue Details

What Works (v3.2.5)

What Fails (v3.4.1)

Root Cause Analysis Performed

1. CloudWatch Logs Investigation

2. Container Behavior Changes

3. Workaround Implemented

Performance Impact Analysis

Container Resource Changes (Dockerfile comparison)

Suspected Performance Issues

JVM Configuration Attempts

Expected Behavior

Actual Behavior

Steps to Reproduce

Workaround Status

Request for AWS Team

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions