Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC#0192 - Ensure workers do not get unnecessarily killed #192

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,3 +69,4 @@ See [mechanics](mechanics.md) for more detail.
| RFC#182 | [Allow remote references to .taskcluster.yml files processed by Taskcluster-GitHub](rfcs/0182-taskcluster-yml-remote-references.md) |
| RFC#189 | [Batch APIs for task definition, status and index path](rfcs/0189-batch-task-apis.md) |
| RFC#191 | [Worker Manager launch configurations](rfcs/0191-worker-manager-launch-configs.md) |
| RFC#192 | [`minCapacity` ensures workers do not get unnecessarily killed](rfcs/0192-min-capacity-ensures-workers-do-not-get-unnecessarily-killed.md) |
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# RFC 192 - `minCapacity` ensures workers do not get unnecessarily killed
* Comments: [#192](https://github.com/taskcluster/taskcluster-rfcs/pull/192)
* Proposed by: @JohanLorenzo

# Summary

`worker-manager` allows `minCapacity` to be set, ensuring a certain number of workers
are available at any given time. Unlike what current happens now, these workers
shouldn't be killed unless `minCapacity` is exceeded.

## Motivation - why now?

As far as I can remember, the current behavior has always existed. This year, the
Engineering Effectiveness org is optimizing the cost of the Firefox CI instance.
[Bug 1899511](https://bugzilla.mozilla.org/show_bug.cgi?id=1899511) made a change that
actually uncovered the problem with the current behavior: workers gets killed after 2
minuted and a new one gets spawned.


# Details

In the current implementation, workers are in charge of knowning when they have to shut
down. Given the fact `docker-worker` is officially not supported anymore and we can't
cut a new release and use it, let's change what config `worker-manager` gives to all
workers, `docker-worker` included.

## When `minCapacity` is exceeded

In this case, nothing should change. `worker-manager` sends the same config to workers
as it always did.

## When `minCapacity` is not yet met

Here, `worker-manager` should increase `afterIdleSeconds` to a much higher value (e.g.:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"tweaking" idle seconds could be a good solution for existing workers that are unlikely to get upgraded soon.

There should be some upper limit of what worker-manager could set for such workers to balance minCapacity promise and avoiding additional costs from running them for too long

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, if we set minCapacity then we expect to have a number of workers at any time of the day. I would expect these workers to keep running. If some worker-types shouldn't be running for too long, I wonder if we should set minCapacity to 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's true! Indeed, if someone sets min value he probably knows what he's doing, no need to restrict

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry term I think this isn't a problem.

Long term something probabilistic could work: p(no) = 0.5 if exactly at minCapacity, p lower if higher than minCapacity

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need a probabilistic model at this stage. To me, there are already several random factors impacting workers (e.g.: spot instance termination, network issues). So having a deterministic model here may help diagnosing what goes wrong with the implementation. I may be missing something in my understanding of the problem, though.

24 hours). This way, workers remain online long enough and we don't kill them too often.
In case one of these long-lived workers get killed by an external factor (say: the
cloud provider reclaims the spot instance), then `minCapacity` won't be met an a new
long-lived one will be created.

### What if we deploy new worker images?

Long-lived workers will have to be killed if there's a change in their config, including
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes along with the #191 proposal quite well.
With introduction of launch configurations, workers would be able to query w-m periodically to see if config is still active, and restart/shutdown if not.

We could probably create a more generic endpoint which would answer if worker should terminate:

  • yes if config changed
  • no if minCapacity is not observed.

However, I'm not totally sure how to solve this issue if all workers at once ask if they could shutdown, and all get yes.. just to have zero running workers after

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes along with the #191 proposal quite well.

Nice! 👍

However, I'm not totally sure how to solve this issue if all workers at once ask if they could shutdown, and all get yes.. just to have zero running workers after

I guess that's an okay behavior, at least for a first iteration. Maybe worker-manager could remember how many workers are being shut down so it doesn't tell more workers to turn off.

their image.

### What if short-lived workers are taken into account in `minCapacity`?

When this happens, the short-lived worker will eventually get killed, making the number
of workers below `minCapacity`. Then, `worker-manager` will spawn a new long-lived one.

## How to ensure these behaviors are correctly implemented?

We should leverage telemetry to know how long workers live and what config they got
from `worker-manager`. This will help us find any gaps in this plan.


# Implementation

TODO
1 change: 1 addition & 0 deletions rfcs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,3 +57,4 @@
| RFC#182 | [Allow remote references to .taskcluster.yml files processed by Taskcluster-GitHub](0182-taskcluster-yml-remote-references.md) |
| RFC#189 | [Batch APIs for task definition, status and index path](0189-batch-task-apis.md) |
| RFC#191 | [Worker Manager launch configurations](0191-worker-manager-launch-configs.md) |
| RFC#192 | [`minCapacity` ensures workers do not get unnecessarily killed](0192-min-capacity-ensures-workers-do-not-get-unnecessarily-killed.md) |