Skip to content

CPU Utilization Performance Regression in Container v3.4.1 vs v3.2.5 #257

@smkorde

Description

@smkorde

CPU Utilization Performance Regression in Container v3.4.1 vs v3.2.5

Description

We're experiencing a significant performance regression when upgrading from DLT container image v3.2.5 to v3.4.1. Load tests that previously ran successfully with v3.2.5 are now failing with CPU utilization >90% and containers being forcefully shutdown.

Environment Details

  • Current Working Version: v3.2.5
  • Problematic Version: v3.4.1
  • AWS Region: Multiple regions tested
  • ECS Configuration:
    • Fargate platform
    • 8 vCPU, 20GB RAM
    • Container Insights enabled
  • Test Scale: 5,000 concurrent users
  • Test Duration: Extended load tests (>30 minutes)

Issue Details

What Works (v3.2.5)

  • 5K user load tests complete successfully
  • CPU utilization remains <70% throughout test
  • No container shutdowns or timeouts
  • Stable performance across multiple test runs

What Fails (v3.4.1)

  • Same 5K user load tests fail with CPU >90%
  • Containers forcefully shutdown due to resource exhaustion
  • Tests that previously took 30+ minutes now fail within 10-15 minutes
  • Issue occurs consistently across multiple attempts

Root Cause Analysis Performed

1. CloudWatch Logs Investigation

We identified excessive S3 HeadObject operations causing CPU spikes:

ERROR - botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: The specified key does not exist.
Key: Container_IPs/testId_IPHOSTS_region.txt

2. Container Behavior Changes

Between v3.2.5 and v3.4.1, the container now expects Container_IPs files in S3 for multi-task coordination:

  • v3.2.5: Uses environment variables (IPNETWORK/IPHOSTS) as fallback
  • v3.4.1: Aggressively polls S3 for Container_IPs files, causing 404 errors and retry loops

3. Workaround Implemented

We've temporarily resolved the issue by creating the expected Container_IPs files:

// In task-runner Lambda - create Container_IPs files for v3.4.1 compatibility
if (ipAddresses.length > 1) {
  const containerIpsContent = ipAddresses.join('\n');
  await s3.putObject({
    Bucket: process.env.SCENARIOS_BUCKET,
    Key: `Container_IPs/${testId}_IPHOSTS_${region}.txt`,
    Body: containerIpsContent,
    ContentType: 'text/plain'
  }).promise();
}

Performance Impact Analysis

Container Resource Changes (Dockerfile comparison)

v3.2.5 Base:

  • Base image: blazemeter/taurus:1.16.27
  • Python: 3.10
  • JDK: OpenJDK 11/17

v3.4.1 Changes:

  • Base image: amazonlinux:2023-minimal
  • Python: 3.11
  • JDK: java-21-amazon-corretto
  • Added K6 and Locust framework support

Suspected Performance Issues

  1. S3 Retry Loops: Missing Container_IPs files cause aggressive S3 polling
  2. JDK 21 Performance: Potential JVM tuning incompatibility with new JDK version
  3. Framework Detection Overhead: New framework detection logic may impact performance
  4. Base Image Overhead: Amazon Linux 2023 vs optimized Taurus image

JVM Configuration Attempts

We've tried various JVM optimizations for JDK 21:

JAVA_OPTS="-Xms4g -Xmx12g -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+DisableExplicitGC -XX:+UnlockExperimentalVMOptions -XX:+UseJVMCICompiler"

Expected Behavior

Container v3.4.1 should maintain the same performance characteristics as v3.2.5 for equivalent test loads, especially when provided with adequate resources (8 vCPU, 20GB RAM).

Actual Behavior

CPU utilization spikes to >90% causing container shutdown, making the solution unusable for production load testing scenarios.

Steps to Reproduce

  1. Deploy DLT solution with container image v3.4.1
  2. Configure ECS task with 8 vCPU, 20GB RAM
  3. Run load test with 5,000 concurrent users for 30+ minutes
  4. Observe CPU utilization metrics in CloudWatch
  5. Container will be terminated due to resource exhaustion

Workaround Status

  • Container_IPs file creation: Resolves S3 404 errors
  • CPU performance: Still experiencing high CPU utilization
  • ⚠️ Production impact: Tests that worked in v3.2.5 cannot run in v3.4.1

Request for AWS Team

  1. Performance regression investigation: Why does v3.4.1 consume significantly more CPU than v3.2.5 for identical workloads?

  2. JDK 21 optimization guidance: Are there recommended JVM parameters for large-scale load testing with the new OpenJDK 21?

  3. Framework detection overhead: Does the new K6/Locust framework detection impact JMeter performance?

  4. Container_IPs documentation: Should the Lambda functions automatically create these files, or is this expected user configuration?

  5. Performance benchmarks: Are there known performance differences between v3.2.5 and v3.4.1 that users should be aware of?

Additional Context

  • We maintain extensive customizations to the DLT UI and Lambda functions
  • Cannot upgrade to the latest DLT version due to custom code dependencies
  • Need to upgrade container image only for security patches while maintaining performance
  • This affects production load testing capabilities for enterprise applications

Priority: High - Blocks production load testing capabilities
Impact: Performance regression prevents using updated container versions
Workaround: Partial (resolves S3 errors but not CPU performance)

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions