libstore/filetransfer,http-binary-cache-store: disable on substituter 500s instead of throwing #13971

philipwilk · 2025-09-12T19:25:42Z

Motivation

Right now, if a substituter returns a 5xx other than 501, 505 or 511, nix will keep trying to use it even though it is broken/misconfigured at least four more times before trying to use the next one (given the necessary patch is applied, otherwise it throws), resulting in unnecessary log spam.
Also, if you are receiving 504s (such as if an nginx proxy is trying to connect to a substituter but timing out) this can lead to rather nasty waits - I have personally seen it take 20-30s, with it timing out at each individual request, before it timing out completely and giving up. Awful.

This doesn't really make sense because 500s wont go away unless the server is fixed? This change would make nix disable an http substituter on the first 500 it gets and immediately switch to the next substituter (intuitive default behaviour?)

This also preserves settings.tryFallback as it will still throw if false and all the substituters have been disabled because they have had (terminal) errors. (that logic is unrelated to the http binary cache store and is in substition-goal/store-api, see #13301 )

Context

linked to #13301

Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

… 500s instead of throwing Right now, if a substituter returns a 5xx other than 501, 505 or 511, nix will keep trying to use it even though it is broken/misconfigured at least four more times before trying to use the next one (given the necessary patch is applied, otherwise it throws), resulting in unnecessary log spam. Also, if you are receiving 504s (such as if an nginx proxy is trying to connect to a substituter but timing out) this can lead to rather nasty waits - I have personally seen it take 20-30s, with it timing out at each individual request, before it timing out completely and giving up. Awful. This doesn't really make sense because 500s wont go away unless the server is fixed? This change would make nix disable an http substituter on the first 500 it gets and immediately switch to the next substituter (intuitive default behaviour?) This also preserves `settings.tryFallback` as it will still throw if false and all the substituters have been disabled because they have had (terminal) errors. (that logic is unrelated to the http binary cache store and is in substition-goal/store-api, see NixOS#13301 ) linked to NixOS#13301 ![nix error log](https://github.com/user-attachments/assets/cc2a7911-4298-43fa-a3b6-b68eaf88132a)

edolstra · 2025-09-15T16:47:30Z

Not sure I agree with this. In my experience, 500 errors typically are transient (e.g. the server is out of memory, ran into a connection limit, etc).

OTOH, "502 Bad Gateway" is likely to be a configuration error where retrying doesn't make sense. So it makes sense to add that one to the list of exceptions.

Mic92 · 2025-09-15T19:43:55Z

Not sure I agree with this. In my experience, 500 errors typically are transient (e.g. the server is out of memory, ran into a connection limit, etc).

OTOH, "502 Bad Gateway" is likely to be a configuration error where retrying doesn't make sense. So it makes sense to add that one to the list of exceptions.

One example where it could make sense to retry is when you are redeploying your binary cache at the same time as the other machines.

philipwilk · 2025-09-17T21:42:55Z

Let's suppose that a server OOMs; the kernel will kill a process to resolve the OOM. If the substituter is running in a monolithic deployment, it is unlikely to be killed as the kernel will choose short lived processes instead of long lived ones like the server - which shouldn't give us an error. But if it were in a container or any other environment with cgroups on it limiting its memory max (resulting in an oom) - then it would certainly be killed. What happens next? The substituter has to restart - during which you can't use it.
Or if we look at the other example, if a server hits its connection limit; you either get some cookie back (not an error, fine) or you get dropped and you have to try again... just as everyone else trying to get to this server at that point in time. Eventually you'll get through, but considering that it is under heavy load, performance will be less than desirable.. will it be worth it?

Now the ones I personally care about most, 502s and 504s, are unlikely to ever come back up without manual intervention, which we all agree on, so let's (in what i will try to show shortly) suppose for the time being that the time for those to come back up is infinite.

Meanwhile, the timescale to download a path (from the next substituter) is seconds? (obviously that is entirely network dependent)

I suppose the point I'm trying to make is that I don't think that the duration of 500s is on the same scale as what is usually one 'transaction'(one use of a nix command?) when you query the substituter for paths.

@Mic92 makes an interesting point on redeploying your cache at the same time as other machines, which results in funky behaviour regardless of how you choose to handle it. There is no way of us knowing what the 503 is caused by so we're either ignoring that the server is inaccessible at that point and keep trying (which does not make sense unless we have hindsight of it shortly becoming accessible) or giving up on it immediately and potentially throwing if it were the only substituter and it became unavailable only momentarily.
In such a case, imo moving the retries to then end of the query would make more sense? Instead of retrying a potentially terminal cache immediately and wasting time, mark it as disabled, then once all other caches have been visited and did not have the path, revisit the caches that were disabled and retry them - assuming that a visit to the next cache is instantaneous vs waiting for this one to respond.

If there is a concern because you want to more heavily prefer one cache over others (if there is some cost from going over the network or such) we could decrease the initial disable time from 60s and make it an exponential backoff instead. If it failed initially, it would be tried again soon after (given that all paths haven't been found already) (this would also be very easy to add a configuration option for to increase/decrease the initial timeout/disable time).

If this is a sticking point then I am willing to compromise and only add 502 and 504s to the special case, even if I think it will be an incomplete solution.

Ericson2314 · 2025-09-26T15:23:38Z

src/libstore/http-binary-cache-store.cc

    {
        auto state(_state.lock());
-        if (state->enabled && settings.tryFallback) {
+        if (state->enabled) {


We can PR this PR regardless, right?

hi, what does this mean?

src/libstore/http-binary-cache-store.cc

philipwilk requested a review from Ericson2314 as a code owner September 12, 2025 19:25

github-actions bot added the store Issues and pull requests concerning the Nix store label Sep 12, 2025

philipwilk mentioned this pull request Sep 12, 2025

'Gracefully' fallback from failing substituters #13301

Merged

philipwilk force-pushed the disable500dsubstituters branch from 6a8a68a to cdccd5f Compare September 13, 2025 00:22

Ericson2314 reviewed Sep 26, 2025

View reviewed changes

src/libstore/http-binary-cache-store.cc Outdated Show resolved Hide resolved

review change

cddfc7b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

libstore/filetransfer,http-binary-cache-store: disable on substituter 500s instead of throwing #13971

libstore/filetransfer,http-binary-cache-store: disable on substituter 500s instead of throwing #13971

Uh oh!

philipwilk commented Sep 12, 2025 •

edited by Ericson2314

Loading

Uh oh!

edolstra commented Sep 15, 2025

Uh oh!

Mic92 commented Sep 15, 2025 •

edited

Loading

Uh oh!

philipwilk commented Sep 17, 2025

Uh oh!

Ericson2314 Sep 26, 2025

Uh oh!

philipwilk Oct 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

libstore/filetransfer,http-binary-cache-store: disable on substituter 500s instead of throwing #13971

Are you sure you want to change the base?

libstore/filetransfer,http-binary-cache-store: disable on substituter 500s instead of throwing #13971

Uh oh!

Conversation

philipwilk commented Sep 12, 2025 • edited by Ericson2314 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Context

Uh oh!

edolstra commented Sep 15, 2025

Uh oh!

Mic92 commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

philipwilk commented Sep 17, 2025

Uh oh!

Ericson2314 Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

philipwilk Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

philipwilk commented Sep 12, 2025 •

edited by Ericson2314

Loading

Mic92 commented Sep 15, 2025 •

edited

Loading