Epic: Worker Monitoring #608

josephjclark · 2024-02-21T09:31:05Z

An epic issue to have oversight over monitoring on the worker.

The high level brief is: we need better visibility of what's going on inside the worker, especially when things go wrong.

We should consider metrics tracking, sentry reporting, email notification, grafana, etc.

Things we want

When a worker claims something (we do this already in Lightning, but will be useful later to track drift between what lightning thinks and what the worker knows).
When a worker has to kill a run or job.
Memory sampling for:
- At least the whole process tree
- engine
- workers? (will need to think about how useful this is on it's own since they are 'disposable processes' and picking them out of a crowd in monitoring may not be that useful).
CPU usage? (might just be solved by monitoring the pod directly)

We need to figure out the best approach for how to integrate this into prometheus, do we expose an aggregate http service (or use lightning for that) that collects up the metrics?

We probably don't want to use service discovery for monitoring? Do we?
There is an advantage of workers exposing their own /metrics server, makes the worker better for everyone.

The text was updated successfully, but these errors were encountered:

josephjclark · 2024-08-22T11:11:47Z

This keeps coming up so I think we want to spend some time on it.

I think there are two seperate but related big issues right now:

benchmarking: local tests on the worker performance. We want to better understand or current performance and how it scales. This also lets us verify that future improvements are helping
Transparency: we need to better understand what the worker is doing in live environments. Does this mean more eventing? More logging? Can we have a live dashboard? Can we output performance metrics?

Some quick thoughts about possible performance bottlenecks:

adaptor installation and compilation are in the main thread. A worker which is compiling code cannot pick up new work. - - there is no compiler caching
we do want to move compilation into the thread, there's an issue around that
So actually tests on how large jobs (lots of compiler work) and maybe large inputs (lots of main thread json parsing) would be useful. How do those things affect compiler performance?

josephjclark · 2025-03-11T10:14:38Z

What would be useful for debugging sometimes (ie right now) is to see a) the JSON sent to the worker for each run, and b) the compiled execution plan from that incoming run.

It's expensive annoying and unreadable to just log the json of these data structures.

Can we post them somewhere where they're easily accessible?

This would also help us reproduce runs that are lost or broken, because we should be able to get exact input data

taylordowns2000 added this to v2 Feb 21, 2024

github-project-automation bot moved this to New Issues in v2 Feb 21, 2024

christad92 moved this from New Issues to Icebox in v2 Feb 22, 2024

christad92 added the epic label Feb 22, 2024

christad92 assigned josephjclark May 8, 2024

josephjclark removed their assignment Jun 19, 2024

christad92 moved this from Icebox to Backlog in v2 Jul 4, 2024

josephjclark self-assigned this Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: Worker Monitoring #608

Epic: Worker Monitoring #608

josephjclark commented Feb 21, 2024 •

edited by stuartc

Loading

josephjclark commented Aug 22, 2024

josephjclark commented Mar 11, 2025

Epic: Worker Monitoring #608

Epic: Worker Monitoring #608

Comments

josephjclark commented Feb 21, 2024 • edited by stuartc Loading

Things we want

josephjclark commented Aug 22, 2024

josephjclark commented Mar 11, 2025

josephjclark commented Feb 21, 2024 •

edited by stuartc

Loading