Skip to content

Conversation

sanderegg
Copy link
Member

@sanderegg sanderegg commented Aug 12, 2025

What do these changes do?

This PR:

  • changes the ENV for SUBNET_ID to become a JSONized array where multiple subnet IDs might be given (this concerns autoscaling and clusters-keeper services)
  • launching of instances will iterate over the multiple AvailabilityZone(s) before failing for InsufficientInstanceCapacity (changes in aws-library), test-driven by test_ec2_client.py

NOTE1: warm-buffers that were launched to pre-pull images and then stopped at some time, might still fail when needed as the warm-buffer is then hard-linked to the specific AZ where it was launched first. Therefore there is no guarantee that it can be started again at any time.
NOTE2: In this case, autoscaling should have coped and tried to launch a fresh instance probably in a different AZ. The additional tests in test_modules_cluster_scaling_dynami.py show that this fails and is a pretty big bug, which will need to be fixed: #8273

Partial bugfix: if starting warm buffers fail then the autoscaling service will still take care of scaling down and up for other instance types than warm buffer types

Related issue/s

How to test

Dev-ops

@sanderegg sanderegg added this to the Voyager milestone Aug 12, 2025
@sanderegg sanderegg self-assigned this Aug 12, 2025
@sanderegg sanderegg added a:autoscaling autoscaling service in simcore's stack a:clusters-keeper labels Aug 12, 2025
Copy link

codecov bot commented Aug 12, 2025

Codecov Report

❌ Patch coverage is 86.81319% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.92%. Comparing base (3e8d9ed) to head (553d30b).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8210      +/-   ##
==========================================
- Coverage   87.95%   87.92%   -0.03%     
==========================================
  Files        1930     1926       -4     
  Lines       74773    74762      -11     
  Branches     1306     1309       +3     
==========================================
- Hits        65767    65738      -29     
- Misses       8615     8632      +17     
- Partials      391      392       +1     
Flag Coverage Δ
integrationtests 64.11% <ø> (+0.02%) ⬆️
unittests 86.60% <86.81%> (-0.02%) ⬇️
Components Coverage Δ
pkg_aws_library 93.13% <84.81%> (-0.80%) ⬇️
pkg_celery_library 87.37% <ø> (ø)
pkg_dask_task_models_library 79.62% <ø> (ø)
pkg_models_library 93.09% <ø> (ø)
pkg_notifications_library 85.26% <ø> (ø)
pkg_postgres_database 88.02% <ø> (ø)
pkg_service_integration 70.19% <ø> (ø)
pkg_service_library 71.74% <ø> (ø)
pkg_settings_library 90.17% <ø> (ø)
pkg_simcore_sdk 85.03% <ø> (ø)
agent 93.53% <ø> (ø)
api_server 92.73% <ø> (ø)
autoscaling 95.90% <100.00%> (+<0.01%) ⬆️
catalog 92.34% <ø> (ø)
clusters_keeper 99.13% <100.00%> (ø)
dask_sidecar 92.37% <ø> (ø)
datcore_adapter 97.94% <ø> (ø)
director 75.81% <ø> (ø)
director_v2 90.87% <ø> (-0.02%) ⬇️
dynamic_scheduler 96.27% <ø> (ø)
dynamic_sidecar 90.18% <ø> (-0.19%) ⬇️
efs_guardian 89.62% <ø> (ø)
invitations 91.44% <ø> (ø)
payments 92.61% <ø> (ø)
resource_usage_tracker 91.76% <ø> (-0.43%) ⬇️
storage 86.49% <ø> (ø)
webclient ∅ <ø> (∅)
webserver 88.06% <ø> (+<0.01%) ⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3e8d9ed...553d30b. Read the comment docs.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

mergify bot commented Aug 12, 2025

🧪 CI Insights

Here's what we observed from your CI run for 553d30b.

✅ Passed Jobs With Interesting Signals

Pipeline Job Signal Health on master Retries 🔍 CI Insights 📄 Logs
CI integration-tests Base branch is healthy, but retries were needed. Could be early signs of flakiness 👀 Healthy 1 View View

@sanderegg sanderegg changed the title Enlarge AZ calls 🎨EC2 launches using multiple AZs Aug 14, 2025
@sanderegg sanderegg force-pushed the ec2/allow-multiple-az branch 5 times, most recently from 40c92dc to ae240fa Compare August 26, 2025 09:35
@sanderegg sanderegg changed the title 🎨EC2 launches using multiple AZs 🎨Autoscaling: Allow EC2 launches in multiple AvailabilityZones Aug 26, 2025
@sanderegg sanderegg force-pushed the ec2/allow-multiple-az branch 2 times, most recently from 03b003f to 9c37d09 Compare August 28, 2025 07:09
@sanderegg sanderegg requested a review from Copilot August 28, 2025 07:12
Copilot

This comment was marked as outdated.

@sanderegg sanderegg changed the title 🎨Autoscaling: Allow EC2 launches in multiple AvailabilityZones 🎨Autoscaling: Allow EC2 launches in multiple AvailabilityZones ⚠️ (DevOPS) 🚨 Aug 28, 2025
@sanderegg sanderegg force-pushed the ec2/allow-multiple-az branch from 547d58b to 3be802d Compare August 28, 2025 12:17
@sanderegg sanderegg changed the title 🎨Autoscaling: Allow EC2 launches in multiple AvailabilityZones ⚠️ (DevOPS) 🚨 🎨🐛Autoscaling: Allow EC2 launches in multiple AvailabilityZones ⚠️ (DevOPS) 🚨 Aug 28, 2025
@sanderegg sanderegg requested a review from Copilot August 28, 2025 12:54
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enables autoscaling services to launch EC2 instances across multiple Availability Zones to improve resilience against capacity limitations. The main purpose is to handle InsufficientInstanceCapacity errors by iterating through multiple subnets in different AZs before failing.

Key Changes:

  • Environment variable change from EC2_INSTANCES_SUBNET_ID (single) to EC2_INSTANCES_SUBNET_IDS (array) for autoscaling and clusters-keeper services
  • Enhanced EC2 launch logic in aws-library to cycle through multiple subnets when encountering capacity issues
  • Addition of new error handling for EC2InsufficientCapacityError and EC2SubnetsNotEnoughIPsError

Reviewed Changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
services/docker-compose.yml Updated environment variable names to use plural form for subnet IDs
services/autoscaling/src/simcore_service_autoscaling/core/settings.py Changed field type from single string to list of strings for subnet configuration
services/clusters-keeper/src/simcore_service_clusters_keeper/core/settings.py Changed field type from single string to list of strings for subnet configuration
packages/aws-library/src/aws_library/ec2/_client.py Enhanced launch_instances to iterate through multiple subnets and handle capacity errors
packages/aws-library/src/aws_library/ec2/_errors.py Added new exception types for insufficient capacity and IP address errors
packages/aws-library/tests/test_ec2_client.py Added comprehensive tests for multi-subnet capacity handling
services/autoscaling/tests/unit/test_modules_cluster_scaling_dynamic.py Added tests for multi-subnet scenarios and warm buffer failure handling

@sanderegg sanderegg force-pushed the ec2/allow-multiple-az branch from 3be802d to cb6f177 Compare August 28, 2025 12:57
@sanderegg sanderegg marked this pull request as ready for review August 28, 2025 12:57
@sanderegg sanderegg force-pushed the ec2/allow-multiple-az branch from f52dc5a to db5f577 Compare August 29, 2025 12:40
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:autoscaling autoscaling service in simcore's stack a:aws-library a:clusters-keeper
Projects
None yet
3 participants