-
Notifications
You must be signed in to change notification settings - Fork 388
Better explain logcontext in run_in_background(...)
, run_as_background_process(...)
, and the sentinel
logcontext
#18900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better explain logcontext in run_in_background(...)
, run_as_background_process(...)
, and the sentinel
logcontext
#18900
Conversation
Nothing from the Synapse homeserver should be logged with the sentinel context. i.e. | ||
we should always know which server the logs are coming from. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is aspirational since we currently go against this in a few places. This is being addressed in #18870
…as_background_process`? We start to see some of the tests fail: ``` SYNAPSE_TEST_LOG_LEVEL=INFO poetry run trial tests.crypto.test_keyring ```
# When the caller yield/await's the returned deferred, it may yield | ||
# control back to the reactor. To avoid leaking the current logcontext to the | ||
# reactor (which would then get erroneously picked up by the next thing the reactor | ||
# does) while the deferred runs in the reactor event loop, we reset the logcontext | ||
# and add a callback to the deferred to restore it so the caller's logcontext is | ||
# active when the deferred completes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this more accurate than the previous explanation?
The "yield
will block" scenario is coming from the situation described in the PR description of matrix-org/synapse#3170
So, it turns out that if you have a first
Deferred
D1
, you can add a callback which returns anotherDeferred
D2
, andD2
must then complete before any further callbacks onD1
will execute (and later callbacks onD1
get the result ofD2
rather thanD2
itself).So,
D1
might havecalled=True
(as in, it has started running its callbacks), but any new callbacks added toD1
won't get run untilD2
completes - so if youyield D1
in aninlineCallbacks
function, youryield
will 'block'.In conclusion: some of our assumptions in
logcontext
were invalid. We need to make sure that we don't optimise out the logcontext juggling when this situation happens. Fortunately, it is easy to detect by checkingD1.paused
.
I feel like this may be irrelevant to what I think is actually happening (as explained now) 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is related to the d.paused
check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like that detail might not even matter compared to this other reason why we would set sentinel
here.
That detail might still be good to also include here but I have no confidence even with this scenario explanation and feels like the "why" half is missing which is what I would want filled in.
# remember the `sentinel` context as its previous context to return to when it | ||
# exits and yields control back to the reactor. | ||
# | ||
# TODO: Why can't we simplify to using `return run_in_background(run)`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to explore this further but I haven't come up with an answer yet.
Currently, some tests fail when we try this -> #18900 (comment)
It probably directly relates to #18900 (comment)
I think this PR is a good step forward and I'd rather leave the note in until we either comment "why" or simplify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its likely due to run_in_background
starts the run()
function with the calling logcontext, rather than the sentinel context it expects. So when BackgroundProcessLoggingContext
exits it would set the context back to the calling context rather than the sentinel context, though it'd get reset back to sentinel context again due to run_in_background
, I can imagine that entering the calling context is unexpected at that point
# by spawning off another deferred, we are effectively | ||
# adding a new exit point.) | ||
d.addBoth(_set_context_cb, ctx) | ||
d.addBoth(_set_context_cb, SENTINEL_CONTEXT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
run_in_background
has been updated to match the docstring (set the sentinel context when the deferred completes) and makes sense to how things should work in my mind.
Can you come up with a reason why we were previously doing the trying to restore some other context?
This was introduced in matrix-org/synapse#3170 without an explanation around this specific piece.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmhmhmhmhmhm, I wonder if there is some awfulness with deferred chaining where the execution doesn't actually go back to the reactor but back to some other context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have chained deferred tests that still pass from that PR 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To give this an update with more hindsight; this logic was redundant with the early return and it is safe to remove this complexity ✅
It seems like this actually has to do with completed vs incomplete deferreds...
To explain how things previously worked without the early-return shortcut:
With the normal case of incomplete awaitable, we store the calling_context
and the f
function is called and runs until it yields to the reactor. Because f
follows the logcontext rules, it sets the sentinel
logcontext. Then in run_in_background(...)
, we restore the calling_context
, store the current ctx
(which is sentinel
) and return. When the deferred completes, we restore ctx
(which is sentinel
) before yielding to the reactor again (all good ✅)
With the other case where we see a completed awaitable, we store the calling_context
and the f
function is called and runs to completion (no logcontext change). This is where the shortcut would kick in but I'm going to continue explaining as if we commented out the shortcut. -- Then in run_in_background(...)
, we restore the calling_context
, store the current ctx
(which is same as the calling_context
). Because the deferred is already completed, our extra callback is called immediately and we restore ctx
(which is same as the calling_context
). Since we never yield to the reactor, the calling_context
is perfect as that's what we want again (all good ✅)
But this also means that our early-return shortcut is no longer just an optimization and is necessary to act correctly in the completed awaitable case as we want to return with the calling_context
and not reset to the sentinel
context. I've updated the comment in #18964 to explain the necessity as it's currently just described as an optimization.
But because we made the same change to run_coroutine_in_background(...)
which didn't have the same early-return shortcut, we regressed the correct behavior ❌ . This is being fixed in #18964
# Wrap the coroutine in a deferred, which will have the side effect of executing the | ||
# coroutine in the background. | ||
d = defer.ensureDeferred(coroutine) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: run_coroutine_in_background
: Why do we have this specialized version of run_in_background
? Is this really a worthwhile shortcut?
This was introduced in #17884
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it makes it a bit more ergonomic to use the gather_optional_coroutines
function, as you can do:
await gather_optional_coroutines(
func1(arg1), func2(arg2), func3(arg3),
)
which otherwise you'd need to do something like:
await gather_optional(
lambda: func1(arg1), lambda: func2(arg2), lambda: func3(arg3)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, ergonomic helper 👍
Updated the docstring to explain this ✅
"""Calls a function, ensuring that the current context is restored after | ||
return from the function, and that the sentinel context is set once the | ||
deferred returned by the function completes. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
run_in_background
(orrun_as_background_process
if we're not going to later wait on the result)-- @erikjohnston, https://github.com/element-hq/synapse/pull/17884/files#r1822356809
Is this distinction actually true?
It seems like both can used to wait for a result. Both need to be used with make_deferred_yieldable(...)
.
run_as_background_process
just adds some metrics around what's being run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference is:
run_in_background
makes the function run with the context of the calling function. This then needs to be awaited on later within the same context to ensure tracing works correctly.run_as_background_process
starts the function with a new context, plus measures the resource usage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the caller, they interact the same (as far as I can tell):
- Both preserve the calling context
- Both return a Deferred that does not follow the synapse logcontext rules. Which means they need to be used with
make_deferred_yieldable
to get access to the result.
The only difference is internal as you pointed out:
run_in_background
makes the function run with the context of the calling function.run_as_background_process
starts the function with a new context, plus measures the resource usage
But I think this already answers the question in this thread that the off the cuff comment on the distinction wasn't accurate ⏩
…ontext-docs-and-simplification
docs/log_contexts.md
Outdated
|
||
Whenever yielding control back to the reactor (event loop), the `sentinel` logcontext is | ||
set to avoid leaking the current logcontext into the other reactor which would | ||
erroneously get attached to the next operation picked up by the event loop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e. the event loop / reactor code runs in the sentinel context. It's only when it calls into application code that a context gets activated. This means the reactor should be started in the sentinel context, and any time an awaitable yields back to the reactor it should reset the context to be the sentinel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've incorporated your "in other words" example into the sentinel docs ✅
# remember the `sentinel` context as its previous context to return to when it | ||
# exits and yields control back to the reactor. | ||
# | ||
# TODO: Why can't we simplify to using `return run_in_background(run)`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its likely due to run_in_background
starts the run()
function with the calling logcontext, rather than the sentinel context it expects. So when BackgroundProcessLoggingContext
exits it would set the context back to the calling context rather than the sentinel context, though it'd get reset back to sentinel context again due to run_in_background
, I can imagine that entering the calling context is unexpected at that point
Thanks for the review and knowledge transfer snippets to enhance this even further 🦧 |
run_in_background(...)
and run_as_background_process(...)
run_in_background(...)
, run_as_background_process(...)
, and the sentinel
logcontext
…ckground(...)` or `run_as_background_process(...)` Follow-up to #18900
#18870) Remove `sentinel` logcontext where we log in `setup`, `start`, and exit. Instead of having one giant PR that removes all places we use `sentinel` logcontext, I've decided to tackle this more piece-meal. This PR covers the parts if you just startup Synapse and exit it with no requests or activity going on in between. Part of #18905 (Remove `sentinel` logcontext where we log in Synapse) Prerequisite for #18868. Logging with the `sentinel` logcontext means we won't know which server the log came from. ### Why https://github.com/element-hq/synapse/blob/9cc400177822805e2a08d4d934daad6f3bc2a4df/docs/log_contexts.md#L71-L81 (docs updated in #18900) ### Testing strategy 1. Run Synapse normally and with `daemonize: true`: `poetry run synapse_homeserver --config-path homeserver.yaml` 1. Execute some requests 1. Shutdown the server 1. Look for any bad log entries in your homeserver logs: - `Expected logging context sentinel but found main` - `Expected logging context main was lost` - `Expected previous context` - `utime went backwards!`/`stime went backwards!` - `Called stop on logcontext POST-0 without recording a start rusage` 1. Look for any logs coming from the `sentinel` context With these changes, you should only see the following logs (not from Synapse) using the `sentinel` context if you start up Synapse and exit: `homeserver.log` ``` 2025-09-10 14:45:39,924 - asyncio - 64 - DEBUG - sentinel - Using selector: EpollSelector 2025-09-10 14:45:40,562 - twisted - 281 - INFO - sentinel - Received SIGINT, shutting down. 2025-09-10 14:45:40,562 - twisted - 281 - INFO - sentinel - (TCP Port 9322 Closed) 2025-09-10 14:45:40,563 - twisted - 281 - INFO - sentinel - (TCP Port 8008 Closed) 2025-09-10 14:45:40,563 - twisted - 281 - INFO - sentinel - (TCP Port 9093 Closed) 2025-09-10 14:45:40,564 - twisted - 281 - INFO - sentinel - Main loop terminated. ```
#18964) Regressed in #18900 (comment) (see conversation there for more context) ### How is this a regression? > To give this an update with more hindsight; this logic *was* redundant with the early return and it is safe to remove this complexity :white_check_mark: > > It seems like this actually has to do with completed vs incomplete deferreds... > > To explain how things previously worked *without* the early-return shortcut: > > With the normal case of **incomplete awaitable**, we store the `calling_context` and the `f` function is called and runs until it yields to the reactor. Because `f` follows the logcontext rules, it sets the `sentinel` logcontext. Then in `run_in_background(...)`, we restore the `calling_context`, store the current `ctx` (which is `sentinel`) and return. When the deferred completes, we restore `ctx` (which is `sentinel`) before yielding to the reactor again (all good :white_check_mark:) > > With the other case where we see a **completed awaitable**, we store the `calling_context` and the `f` function is called and runs to completion (no logcontext change). *This is where the shortcut would kick in but I'm going to continue explaining as if we commented out the shortcut.* -- Then in `run_in_background(...)`, we restore the `calling_context`, store the current `ctx` (which is same as the `calling_context`). Because the deferred is already completed, our extra callback is called immediately and we restore `ctx` (which is same as the `calling_context`). Since we never yield to the reactor, the `calling_context` is perfect as that's what we want again (all good :white_check_mark:) > > --- > > But this also means that our early-return shortcut is no longer just an optimization and is *necessary* to act correctly in the **completed awaitable** case as we want to return with the `calling_context` and not reset to the `sentinel` context. I've updated the comment in #18964 to explain the necessity as it's currently just described as an optimization. > > But because we made the same change to `run_coroutine_in_background(...)` which didn't have the same early-return shortcut, we regressed the correct behavior ❌ . This is being fixed in #18964 > > > *-- @MadLittleMods, #18900 (comment) ### How did we find this problem? Spawning from @wrjlewis [seeing](https://matrix.to/#/!SGNQGPGUwtcPBUotTL:matrix.org/$h3TxxPVlqC6BTL07dbrsz6PmaUoZxLiXnSTEY-QYDtA?via=jki.re&via=matrix.org&via=element.io) `Starting metrics collection 'typing.get_new_events' from sentinel context: metrics will be lost` in the logs: <details> <summary>More logs</summary> ``` synapse.http.request_metrics - 222 - ERROR - sentinel - Trying to stop RequestMetrics in the sentinel context. 2025-09-23 14:43:19,712 - synapse.util.metrics - 212 - WARNING - sentinel - Starting metrics collection 'typing.get_new_events' from sentinel context: metrics will be lost 2025-09-23 14:43:19,713 - synapse.rest.client.sync - 851 - INFO - sentinel - Client has disconnected; not serializing response. 2025-09-23 14:43:19,713 - synapse.http.server - 825 - WARNING - sentinel - Not sending response to request <XForwardedForRequest at 0x7f23e8111ed0 method='POST' uri='/_matrix/client/unstable/org.matrix.simplified_msc3575/sync?pos=281963%2Fs929324_147053_10_2652457_147960_2013_25554_4709564_0_164_2&timeout=30000' clientproto='HTTP/1.1' site='8008'>, already dis connected. 2025-09-23 14:43:19,713 - synapse.access.http.8008 - 515 - INFO - sentinel - 92.40.194.87 - 8008 - {@me:wi11.co.uk} Processed request: 30.005sec/-8.041sec (0.001sec, 0.000sec) (0.000sec/0.002sec/2) 0B 200! "POST /_matrix/client/unstable/org.matrix.simplified_msc3575/ ``` </details> From the logs there, we can see things relating to `typing.get_new_events` and `/_matrix/client/unstable/org.matrix.simplified_msc3575/sync` which led me to trying out Sliding Sync with the typing extension enabled and allowed me to reproduce the problem locally. Sliding Sync is a unique scenario as it's the only place we use `gather_optional_coroutines(...)` -> `run_coroutine_in_background(...)` (introduced in #17884) to exhibit this behavior. ### Testing strategy 1. Configure Synapse to enable [MSC4186](matrix-org/matrix-spec-proposals#4186): Simplified Sliding Sync which is actually under [MSC3575](matrix-org/matrix-spec-proposals#3575) ```yaml experimental_features: msc3575_enabled: true ``` 1. Start synapse: `poetry run synapse_homeserver --config-path homeserver.yaml` 1. Make a Sliding Sync request with one of the extensions enabled ```http POST http://localhost:8008/_matrix/client/unstable/org.matrix.simplified_msc3575/sync { "lists": {}, "room_subscriptions": { "!FlgJYGQKAIvAscfBhq:my.synapse.linux.server": { "required_state": [], "timeline_limit": 1 } }, "extensions": { "typing": { "enabled": true } } } ``` 1. Open your homeserver logs and notice warnings about `Starting ... from sentinel context: metrics will be lost`
#18964) Regressed in #18900 (comment) (see conversation there for more context) ### How is this a regression? > To give this an update with more hindsight; this logic *was* redundant with the early return and it is safe to remove this complexity :white_check_mark: > > It seems like this actually has to do with completed vs incomplete deferreds... > > To explain how things previously worked *without* the early-return shortcut: > > With the normal case of **incomplete awaitable**, we store the `calling_context` and the `f` function is called and runs until it yields to the reactor. Because `f` follows the logcontext rules, it sets the `sentinel` logcontext. Then in `run_in_background(...)`, we restore the `calling_context`, store the current `ctx` (which is `sentinel`) and return. When the deferred completes, we restore `ctx` (which is `sentinel`) before yielding to the reactor again (all good :white_check_mark:) > > With the other case where we see a **completed awaitable**, we store the `calling_context` and the `f` function is called and runs to completion (no logcontext change). *This is where the shortcut would kick in but I'm going to continue explaining as if we commented out the shortcut.* -- Then in `run_in_background(...)`, we restore the `calling_context`, store the current `ctx` (which is same as the `calling_context`). Because the deferred is already completed, our extra callback is called immediately and we restore `ctx` (which is same as the `calling_context`). Since we never yield to the reactor, the `calling_context` is perfect as that's what we want again (all good :white_check_mark:) > > --- > > But this also means that our early-return shortcut is no longer just an optimization and is *necessary* to act correctly in the **completed awaitable** case as we want to return with the `calling_context` and not reset to the `sentinel` context. I've updated the comment in #18964 to explain the necessity as it's currently just described as an optimization. > > But because we made the same change to `run_coroutine_in_background(...)` which didn't have the same early-return shortcut, we regressed the correct behavior ❌ . This is being fixed in #18964 > > > *-- @MadLittleMods, #18900 (comment) ### How did we find this problem? Spawning from @wrjlewis [seeing](https://matrix.to/#/!SGNQGPGUwtcPBUotTL:matrix.org/$h3TxxPVlqC6BTL07dbrsz6PmaUoZxLiXnSTEY-QYDtA?via=jki.re&via=matrix.org&via=element.io) `Starting metrics collection 'typing.get_new_events' from sentinel context: metrics will be lost` in the logs: <details> <summary>More logs</summary> ``` synapse.http.request_metrics - 222 - ERROR - sentinel - Trying to stop RequestMetrics in the sentinel context. 2025-09-23 14:43:19,712 - synapse.util.metrics - 212 - WARNING - sentinel - Starting metrics collection 'typing.get_new_events' from sentinel context: metrics will be lost 2025-09-23 14:43:19,713 - synapse.rest.client.sync - 851 - INFO - sentinel - Client has disconnected; not serializing response. 2025-09-23 14:43:19,713 - synapse.http.server - 825 - WARNING - sentinel - Not sending response to request <XForwardedForRequest at 0x7f23e8111ed0 method='POST' uri='/_matrix/client/unstable/org.matrix.simplified_msc3575/sync?pos=281963%2Fs929324_147053_10_2652457_147960_2013_25554_4709564_0_164_2&timeout=30000' clientproto='HTTP/1.1' site='8008'>, already dis connected. 2025-09-23 14:43:19,713 - synapse.access.http.8008 - 515 - INFO - sentinel - 92.40.194.87 - 8008 - {@me:wi11.co.uk} Processed request: 30.005sec/-8.041sec (0.001sec, 0.000sec) (0.000sec/0.002sec/2) 0B 200! "POST /_matrix/client/unstable/org.matrix.simplified_msc3575/ ``` </details> From the logs there, we can see things relating to `typing.get_new_events` and `/_matrix/client/unstable/org.matrix.simplified_msc3575/sync` which led me to trying out Sliding Sync with the typing extension enabled and allowed me to reproduce the problem locally. Sliding Sync is a unique scenario as it's the only place we use `gather_optional_coroutines(...)` -> `run_coroutine_in_background(...)` (introduced in #17884) to exhibit this behavior. ### Testing strategy 1. Configure Synapse to enable [MSC4186](matrix-org/matrix-spec-proposals#4186): Simplified Sliding Sync which is actually under [MSC3575](matrix-org/matrix-spec-proposals#3575) ```yaml experimental_features: msc3575_enabled: true ``` 1. Start synapse: `poetry run synapse_homeserver --config-path homeserver.yaml` 1. Make a Sliding Sync request with one of the extensions enabled ```http POST http://localhost:8008/_matrix/client/unstable/org.matrix.simplified_msc3575/sync { "lists": {}, "room_subscriptions": { "!FlgJYGQKAIvAscfBhq:my.synapse.linux.server": { "required_state": [], "timeline_limit": 1 } }, "extensions": { "typing": { "enabled": true } } } ``` 1. Open your homeserver logs and notice warnings about `Starting ... from sentinel context: metrics will be lost`
Better explain
LoggingContext
inrun_in_background(...)
andrun_as_background_process(...)
. Also adds a section in the docs explaining thesentinel
logcontext.Spawning from #18870
Testing strategy
daemonize: true
:poetry run synapse_homeserver --config-path homeserver.yaml
Expected logging context sentinel but found main
Expected logging context main was lost
Expected previous context
utime went backwards!
/stime went backwards!
Called stop on logcontext POST-0 without recording a start rusage
Background process re-entered without a proc
Twisted trial tests:
Test starting with non-sentinel logging context ...
Dev notes
Todo
Expected logging context main was lost
/Expected logging context GET-0 was lost
issues (start Synapse,curl http://localhost:8008/_matrix/client/versions
, exit Synapse)Pull Request Checklist
EventStore
toEventWorkerStore
.".code blocks
.