Skip to content

Commit

Permalink
RFC#0192 - ensures workers do not get unnecessarily killed
Browse files Browse the repository at this point in the history
  • Loading branch information
JohanLorenzo committed Jun 20, 2024
1 parent 802091b commit a90eb39
Show file tree
Hide file tree
Showing 2 changed files with 59 additions and 0 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,3 +69,4 @@ See [mechanics](mechanics.md) for more detail.
| RFC#182 | [Allow remote references to .taskcluster.yml files processed by Taskcluster-GitHub](rfcs/0182-taskcluster-yml-remote-references.md) |
| RFC#189 | [Batch APIs for task definition, status and index path](rfcs/0189-batch-task-apis.md) |
| RFC#191 | [Worker Manager launch configurations](rfcs/0191-worker-manager-launch-configs.md) |
| RFC#192 | [`minCapacity` ensures workers do not get unnecessarily killed](rfcs/0192-min-capacity-ensures-workers-do-not-get-unnecessarily-killed.md) |
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# RFC 192 - `minCapacity` ensures workers do not get unnecessarily killed
* Comments: [#192](https://github.com/taskcluster/taskcluster-rfcs/pull/192)
* Proposed by: @JohanLorenzo

# Summary

`worker-manager` allows `minCapacity` to be set, ensuring a certain number of workers
are available at any given time. Unlike what current happens now, these workers
shouldn't be killed unless `minCapacity` is exceeded.

## Motivation - why now?

As far as I can remember, the current behavior has always existed. This year, the
Engineering Effectiveness org is optimizing the cost of the Firefox CI instance.
[Bug 1899511](https://bugzilla.mozilla.org/show_bug.cgi?id=1899511) made a change that
actually uncovered the problem with the current behavior: workers gets killed after 2
minuted and a new one gets spawned.


# Details

In the current implementation, workers are in charge of knowning when they have to shut
down. Given the fact `docker-worker` is officially not supported anymore and we can't
cut a new release and use it, let's change what config `worker-manager` gives to all
workers, `docker-worker` included.

## When `minCapacity` is exceeded

In this case, nothing should change. `worker-manager` sends the same config to workers
as it always did.

## When `minCapacity` is not yet met

Here, `worker-manager` should increase `afterIdleSeconds` to a much higher value (e.g.:
24 hours). This way, workers remain online long enough and we don't kill them too often.
In case one of these long-lived workers get killed by an external factor (say: the
cloud provider reclaims the spot instance), then `minCapacity` won't be met an a new
long-lived one will be created.

### What if we deploy new worker images?

Long-lived workers will have to be killed if there's a change in their config, including
their image.

### What if short-lived workers are taken into account in `minCapacity`?

When this happens, the short-lived worker will eventually get killed, making the number
of workers below `minCapacity`. Then, `worker-manager` will spawn a new long-lived one.

## How to ensure these behaviors are correctly implemented?

We should leverage telemetry to know how long workers live and what config they got
from `worker-manager`. This will help us find any gaps in this plan.


# Implementation

TODO

0 comments on commit a90eb39

Please sign in to comment.