-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC#0192 - Ensure workers do not get unnecessarily killed #192
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
# RFC 192 - `minCapacity` ensures workers do not get unnecessarily killed | ||
* Comments: [#192](https://github.com/taskcluster/taskcluster-rfcs/pull/192) | ||
* Proposed by: @JohanLorenzo | ||
|
||
# Summary | ||
|
||
`worker-manager` allows `minCapacity` to be set, ensuring a certain number of workers | ||
are available at any given time. Unlike what current happens now, these workers | ||
shouldn't be killed unless `minCapacity` is exceeded. | ||
|
||
## Motivation - why now? | ||
|
||
As far as I can remember, the current behavior has always existed. This year, the | ||
Engineering Effectiveness org is optimizing the cost of the Firefox CI instance. | ||
[Bug 1899511](https://bugzilla.mozilla.org/show_bug.cgi?id=1899511) made a change that | ||
actually uncovered the problem with the current behavior: workers gets killed after 2 | ||
minuted and a new one gets spawned. | ||
|
||
|
||
# Details | ||
|
||
In the current implementation, workers are in charge of knowning when they have to shut | ||
down. Given the fact `docker-worker` is officially not supported anymore and we can't | ||
cut a new release and use it, let's change what config `worker-manager` gives to all | ||
workers, `docker-worker` included. | ||
|
||
## When `minCapacity` is exceeded | ||
|
||
In this case, nothing should change. `worker-manager` sends the same config to workers | ||
as it always did. | ||
|
||
## When `minCapacity` is not yet met | ||
|
||
Here, `worker-manager` should increase `afterIdleSeconds` to a much higher value (e.g.: | ||
24 hours). This way, workers remain online long enough and we don't kill them too often. | ||
In case one of these long-lived workers get killed by an external factor (say: the | ||
cloud provider reclaims the spot instance), then `minCapacity` won't be met an a new | ||
long-lived one will be created. | ||
|
||
### What if we deploy new worker images? | ||
|
||
Long-lived workers will have to be killed if there's a change in their config, including | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This goes along with the #191 proposal quite well. We could probably create a more generic endpoint which would answer if worker should terminate:
However, I'm not totally sure how to solve this issue if all workers at once ask if they could shutdown, and all get There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Nice! 👍
I guess that's an okay behavior, at least for a first iteration. Maybe |
||
their image. | ||
|
||
### What if short-lived workers are taken into account in `minCapacity`? | ||
|
||
When this happens, the short-lived worker will eventually get killed, making the number | ||
of workers below `minCapacity`. Then, `worker-manager` will spawn a new long-lived one. | ||
|
||
## How to ensure these behaviors are correctly implemented? | ||
|
||
We should leverage telemetry to know how long workers live and what config they got | ||
from `worker-manager`. This will help us find any gaps in this plan. | ||
|
||
|
||
# Implementation | ||
|
||
TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"tweaking" idle seconds could be a good solution for existing workers that are unlikely to get upgraded soon.
There should be some upper limit of what worker-manager could set for such workers to balance
minCapacity
promise and avoiding additional costs from running them for too longThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, if we set
minCapacity
then we expect to have a number of workers at any time of the day. I would expect these workers to keep running. If some worker-types shouldn't be running for too long, I wonder if we should setminCapacity
to 0.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's true! Indeed, if someone sets min value he probably knows what he's doing, no need to restrict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry term I think this isn't a problem.
Long term something probabilistic could work: p(no) = 0.5 if exactly at minCapacity, p lower if higher than minCapacity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we need a probabilistic model at this stage. To me, there are already several random factors impacting workers (e.g.: spot instance termination, network issues). So having a deterministic model here may help diagnosing what goes wrong with the implementation. I may be missing something in my understanding of the problem, though.