Skip to content

Conversation

philipwilk
Copy link
Contributor

@philipwilk philipwilk commented Sep 12, 2025

Motivation

Right now, if a substituter returns a 5xx other than 501, 505 or 511, nix will keep trying to use it even though it is broken/misconfigured at least four more times before trying to use the next one (given the necessary patch is applied, otherwise it throws), resulting in unnecessary log spam.
Also, if you are receiving 504s (such as if an nginx proxy is trying to connect to a substituter but timing out) this can lead to rather nasty waits - I have personally seen it take 20-30s, with it timing out at each individual request, before it timing out completely and giving up. Awful.

This doesn't really make sense because 500s wont go away unless the server is fixed? This change would make nix disable an http substituter on the first 500 it gets and immediately switch to the next substituter (intuitive default behaviour?)

This also preserves settings.tryFallback as it will still throw if false and all the substituters have been disabled because they have had (terminal) errors. (that logic is unrelated to the http binary cache store and is in substition-goal/store-api, see #13301 )

Context

linked to #13301

image

Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

@github-actions github-actions bot added the store Issues and pull requests concerning the Nix store label Sep 12, 2025
… 500s instead of throwing

Right now, if a substituter returns a 5xx other than 501, 505 or 511, nix will keep trying to use it even though it is broken/misconfigured at least four more times before trying to use the next one (given the necessary patch is applied, otherwise it throws), resulting in unnecessary log spam.
Also, if you are receiving 504s (such as if an nginx proxy is trying to connect to a substituter but timing out) this can lead to rather nasty waits - I have personally seen it take 20-30s, with it timing out at each individual request, before it timing out completely and giving up. Awful.

This doesn't really make sense because 500s wont go away unless the server is fixed? This change would make nix disable an http substituter on the first 500 it gets and immediately switch to the next substituter (intuitive default behaviour?)

This also preserves `settings.tryFallback` as it will still throw if false and all the substituters have been disabled because they have had (terminal) errors. (that logic is unrelated to the http binary cache store and is in substition-goal/store-api, see NixOS#13301 )

linked to NixOS#13301

![nix error log](https://github.com/user-attachments/assets/cc2a7911-4298-43fa-a3b6-b68eaf88132a)
@philipwilk philipwilk force-pushed the disable500dsubstituters branch from 6a8a68a to cdccd5f Compare September 13, 2025 00:22
@edolstra
Copy link
Member

Not sure I agree with this. In my experience, 500 errors typically are transient (e.g. the server is out of memory, ran into a connection limit, etc).

OTOH, "502 Bad Gateway" is likely to be a configuration error where retrying doesn't make sense. So it makes sense to add that one to the list of exceptions.

@Mic92
Copy link
Member

Mic92 commented Sep 15, 2025

Not sure I agree with this. In my experience, 500 errors typically are transient (e.g. the server is out of memory, ran into a connection limit, etc).

OTOH, "502 Bad Gateway" is likely to be a configuration error where retrying doesn't make sense. So it makes sense to add that one to the list of exceptions.

One example where it could make sense to retry is when you are redeploying your binary cache at the same time as the other machines.

@philipwilk
Copy link
Contributor Author

Let's suppose that a server OOMs; the kernel will kill a process to resolve the OOM. If the substituter is running in a monolithic deployment, it is unlikely to be killed as the kernel will choose short lived processes instead of long lived ones like the server - which shouldn't give us an error. But if it were in a container or any other environment with cgroups on it limiting its memory max (resulting in an oom) - then it would certainly be killed. What happens next? The substituter has to restart - during which you can't use it.
Or if we look at the other example, if a server hits its connection limit; you either get some cookie back (not an error, fine) or you get dropped and you have to try again... just as everyone else trying to get to this server at that point in time. Eventually you'll get through, but considering that it is under heavy load, performance will be less than desirable.. will it be worth it?

Now the ones I personally care about most, 502s and 504s, are unlikely to ever come back up without manual intervention, which we all agree on, so let's (in what i will try to show shortly) suppose for the time being that the time for those to come back up is infinite.

Meanwhile, the timescale to download a path (from the next substituter) is seconds? (obviously that is entirely network dependent)

I suppose the point I'm trying to make is that I don't think that the duration of 500s is on the same scale as what is usually one 'transaction'(one use of a nix command?) when you query the substituter for paths.
image

@Mic92 makes an interesting point on redeploying your cache at the same time as other machines, which results in funky behaviour regardless of how you choose to handle it. There is no way of us knowing what the 503 is caused by so we're either ignoring that the server is inaccessible at that point and keep trying (which does not make sense unless we have hindsight of it shortly becoming accessible) or giving up on it immediately and potentially throwing if it were the only substituter and it became unavailable only momentarily.
In such a case, imo moving the retries to then end of the query would make more sense? Instead of retrying a potentially terminal cache immediately and wasting time, mark it as disabled, then once all other caches have been visited and did not have the path, revisit the caches that were disabled and retry them - assuming that a visit to the next cache is instantaneous vs waiting for this one to respond.

If there is a concern because you want to more heavily prefer one cache over others (if there is some cost from going over the network or such) we could decrease the initial disable time from 60s and make it an exponential backoff instead. If it failed initially, it would be tried again soon after (given that all paths haven't been found already) (this would also be very easy to add a configuration option for to increase/decrease the initial timeout/disable time).

If this is a sticking point then I am willing to compromise and only add 502 and 504s to the special case, even if I think it will be an incomplete solution.

{
auto state(_state.lock());
if (state->enabled && settings.tryFallback) {
if (state->enabled) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can PR this PR regardless, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi, what does this mean?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

store Issues and pull requests concerning the Nix store

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants