-
Notifications
You must be signed in to change notification settings - Fork 31
🎨🐛Autoscaling: Allow EC2 launches in multiple AvailabilityZones ⚠️ (DevOPS) 🚨 #8210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #8210 +/- ##
==========================================
- Coverage 87.95% 87.92% -0.03%
==========================================
Files 1930 1926 -4
Lines 74773 74762 -11
Branches 1306 1309 +3
==========================================
- Hits 65767 65738 -29
- Misses 8615 8632 +17
- Partials 391 392 +1
Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
40c92dc
to
ae240fa
Compare
03b003f
to
9c37d09
Compare
547d58b
to
3be802d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enables autoscaling services to launch EC2 instances across multiple Availability Zones to improve resilience against capacity limitations. The main purpose is to handle InsufficientInstanceCapacity
errors by iterating through multiple subnets in different AZs before failing.
Key Changes:
- Environment variable change from
EC2_INSTANCES_SUBNET_ID
(single) toEC2_INSTANCES_SUBNET_IDS
(array) for autoscaling and clusters-keeper services - Enhanced EC2 launch logic in aws-library to cycle through multiple subnets when encountering capacity issues
- Addition of new error handling for
EC2InsufficientCapacityError
andEC2SubnetsNotEnoughIPsError
Reviewed Changes
Copilot reviewed 26 out of 26 changed files in this pull request and generated 6 comments.
Show a summary per file
File | Description |
---|---|
services/docker-compose.yml | Updated environment variable names to use plural form for subnet IDs |
services/autoscaling/src/simcore_service_autoscaling/core/settings.py | Changed field type from single string to list of strings for subnet configuration |
services/clusters-keeper/src/simcore_service_clusters_keeper/core/settings.py | Changed field type from single string to list of strings for subnet configuration |
packages/aws-library/src/aws_library/ec2/_client.py | Enhanced launch_instances to iterate through multiple subnets and handle capacity errors |
packages/aws-library/src/aws_library/ec2/_errors.py | Added new exception types for insufficient capacity and IP address errors |
packages/aws-library/tests/test_ec2_client.py | Added comprehensive tests for multi-subnet capacity handling |
services/autoscaling/tests/unit/test_modules_cluster_scaling_dynamic.py | Added tests for multi-subnet scenarios and warm buffer failure handling |
3be802d
to
cb6f177
Compare
Co-authored-by: Copilot <[email protected]>
f52dc5a
to
db5f577
Compare
|
What do these changes do?
This PR:
autoscaling
andclusters-keeper
services)InsufficientInstanceCapacity
(changes inaws-library
), test-driven bytest_ec2_client.py
NOTE1: warm-buffers that were launched to pre-pull images and then stopped at some time, might still fail when needed as the warm-buffer is then hard-linked to the specific AZ where it was launched first. Therefore there is no guarantee that it can be started again at any time.
NOTE2: In this case,
autoscaling
should have coped and tried to launch a fresh instance probably in a different AZ. The additional tests intest_modules_cluster_scaling_dynami.py
show that this fails and is a pretty big bug, which will need to be fixed: #8273Partial bugfix: if starting warm buffers fail then the autoscaling service will still take care of scaling down and up for other instance types than warm buffer types
Related issue/s
How to test
Dev-ops
EC2_INSTANCES_SUBNET_ID
withEC2_INSTANCES_SUBNET_IDS
takes a JSON list of subnet IDs in autoscaling and clusters-keeper (e.g.EC2_INSTANCES_SUBNET_IDS
,PRIMARY_EC2_INSTANCES_SUBNET_IDS
,WORKERS_EC2_INSTANCES_SUBNET_IDS
)