-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
libstore/filetransfer,http-binary-cache-store: disable on substituter 500s instead of throwing #13971
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
… 500s instead of throwing Right now, if a substituter returns a 5xx other than 501, 505 or 511, nix will keep trying to use it even though it is broken/misconfigured at least four more times before trying to use the next one (given the necessary patch is applied, otherwise it throws), resulting in unnecessary log spam. Also, if you are receiving 504s (such as if an nginx proxy is trying to connect to a substituter but timing out) this can lead to rather nasty waits - I have personally seen it take 20-30s, with it timing out at each individual request, before it timing out completely and giving up. Awful. This doesn't really make sense because 500s wont go away unless the server is fixed? This change would make nix disable an http substituter on the first 500 it gets and immediately switch to the next substituter (intuitive default behaviour?) This also preserves `settings.tryFallback` as it will still throw if false and all the substituters have been disabled because they have had (terminal) errors. (that logic is unrelated to the http binary cache store and is in substition-goal/store-api, see NixOS#13301 ) linked to NixOS#13301 
6a8a68a
to
cdccd5f
Compare
Not sure I agree with this. In my experience, 500 errors typically are transient (e.g. the server is out of memory, ran into a connection limit, etc). OTOH, "502 Bad Gateway" is likely to be a configuration error where retrying doesn't make sense. So it makes sense to add that one to the list of exceptions. |
One example where it could make sense to retry is when you are redeploying your binary cache at the same time as the other machines. |
Let's suppose that a server OOMs; the kernel will kill a process to resolve the OOM. If the substituter is running in a monolithic deployment, it is unlikely to be killed as the kernel will choose short lived processes instead of long lived ones like the server - which shouldn't give us an error. But if it were in a container or any other environment with cgroups on it limiting its memory max (resulting in an oom) - then it would certainly be killed. What happens next? The substituter has to restart - during which you can't use it. Now the ones I personally care about most, 502s and 504s, are unlikely to ever come back up without manual intervention, which we all agree on, so let's (in what i will try to show shortly) suppose for the time being that the time for those to come back up is infinite. Meanwhile, the timescale to download a path (from the next substituter) is seconds? (obviously that is entirely network dependent) I suppose the point I'm trying to make is that I don't think that the duration of 500s is on the same scale as what is usually one 'transaction'(one use of a nix command?) when you query the substituter for paths. @Mic92 makes an interesting point on redeploying your cache at the same time as other machines, which results in funky behaviour regardless of how you choose to handle it. There is no way of us knowing what the 503 is caused by so we're either ignoring that the server is inaccessible at that point and keep trying (which does not make sense unless we have hindsight of it shortly becoming accessible) or giving up on it immediately and potentially throwing if it were the only substituter and it became unavailable only momentarily. If there is a concern because you want to more heavily prefer one cache over others (if there is some cost from going over the network or such) we could decrease the initial disable time from 60s and make it an exponential backoff instead. If it failed initially, it would be tried again soon after (given that all paths haven't been found already) (this would also be very easy to add a configuration option for to increase/decrease the initial timeout/disable time). If this is a sticking point then I am willing to compromise and only add 502 and 504s to the special case, even if I think it will be an incomplete solution. |
{ | ||
auto state(_state.lock()); | ||
if (state->enabled && settings.tryFallback) { | ||
if (state->enabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can PR this PR regardless, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi, what does this mean?
Motivation
Right now, if a substituter returns a 5xx other than 501, 505 or 511, nix will keep trying to use it even though it is broken/misconfigured at least four more times before trying to use the next one (given the necessary patch is applied, otherwise it throws), resulting in unnecessary log spam.
Also, if you are receiving 504s (such as if an nginx proxy is trying to connect to a substituter but timing out) this can lead to rather nasty waits - I have personally seen it take 20-30s, with it timing out at each individual request, before it timing out completely and giving up. Awful.
This doesn't really make sense because 500s wont go away unless the server is fixed? This change would make nix disable an http substituter on the first 500 it gets and immediately switch to the next substituter (intuitive default behaviour?)
This also preserves
settings.tryFallback
as it will still throw if false and all the substituters have been disabled because they have had (terminal) errors. (that logic is unrelated to the http binary cache store and is in substition-goal/store-api, see #13301 )Context
linked to #13301
Add 👍 to pull requests you find important.
The Nix maintainer team uses a GitHub project board to schedule and track reviews.