Skip to content

Conversation

@philipwilk
Copy link
Contributor

@philipwilk philipwilk commented May 30, 2025

Motivation

Nix currently hard fails if a substituter is inaccessible, even when they are other substituters available, unless fallback = true.
This breaks nix build, run, shell et al entirely.
This would modify the default behaviour so that nix would actually use the other available substituters and not hard error.

Here is an example before vs after when using dotenv where I have manually stopped my own cache to trigger this issue, before and after the patch. The initial error is really frustrating because there is other caches available.
image
image

Context

#3514 (comment) is the earliest issue I could find, but there are many duplicates.

There is an initial PR at #7188, but this appears to have been abandoned - over 2 years with no activity, then a no comment review in jan. There was a subsequent PR at #8983 but this was closed without merge - over a year without activity.

I have visualised the current and proposed flows. I believe my logic flows line up with what is suggested in #7188 (comment) but correct me if I am wrong.
Current behaviour:
current
Proposed behaviour:
proposed

Charts in lucid

Possible issues to think about:

  • I could not figure out where the curl error is created... I can't figure out how to swallow it and turn it into a warn or better yet, a debug log.
  • Unfortunately, in contrast with the previous point, I'm not sure how verbose we want to warns/traces to be - personally I think that the warn that a substituter has been disabled (when it happens) is sufficient, and that the next one is being used, but this is personal preference.

Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

@github-actions github-actions bot added the store Issues and pull requests concerning the Nix store label May 30, 2025
@philipwilk philipwilk marked this pull request as ready for review May 30, 2025 22:24
@philipwilk philipwilk requested a review from Ericson2314 as a code owner May 30, 2025 22:24
@philipwilk
Copy link
Contributor Author

Further consideration:
Potentially a configuration option to mark a substituter as silently fail-able/optional - this would solve the issue of whether we want to log fails (some if not all) or not - such as in the use case of a LAN only cache... would still need to figure out how to silence the curl fails though.

@philipwilk
Copy link
Contributor Author

philipwilk commented Jun 1, 2025

Further consideration: Potentially a configuration option to mark a substituter as silently fail-able/optional - this would solve the issue of whether we want to log fails (some if not all) or not - such as in the use case of a LAN only cache... would still need to figure out how to silence the curl fails though.

Without a json config #8373, I think per-store configuration is nasty to implement (builders/machines config is just list in list) if not impossible, which would also make this currently infeasible...

@philipwilk philipwilk force-pushed the graceful-subfallback branch from e38a214 to 245bc5d Compare June 2, 2025 02:49
@philipwilk philipwilk marked this pull request as draft June 2, 2025 02:50
@philipwilk
Copy link
Contributor Author

philipwilk commented Jun 2, 2025

I think there's a flake (literally) edge case where it still hard errors out in a flake - not sure what is causing this, as I only just tested it working fine... and will need to taken another look at it.

@Jonas-Sander
Copy link

Is this PR still in the works? :)
Would love to see this 🧡

@philipwilk
Copy link
Contributor Author

currently in the process of moving house so dont have the pc to dev on, but will look back into it asap when available.

@philipwilk
Copy link
Contributor Author

philipwilk commented Aug 31, 2025

Took another look just now and i think switching the for loops around in Store::querySubstitutablePathInfos is enough to fix it, it seemed to work when i pulled in packages by running nix shell nixpkgs#(name) in the installPhase environment, but I'm struggling a bit to figure out how to patch nix post 2.26 so I can test it on my whole system in different situations.

(final: prev: {
  nixVersions = prev.nixVersions.extend (
    finalNixV: prevNixV: {
      nixComponents_2_30 =
        (prevNixV.nixComponents_2_30.overrideScope (
          finalScope: prevScope: {
            aws-sdk-cpp = null;
            withAWS = false;
          }
        )).appendPatches
          [ ./ffb2ee8eefc9946a8ebb406cb1fe02734549406d.patch ];
    }
  );
})
EDIT: patch was being flaky, latest version is stable

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/announcing-ncps-a-nix-cache-proxy-server-for-faster-builds/58166/18

@philipwilk philipwilk force-pushed the graceful-subfallback branch 4 times, most recently from fcf5b71 to 9d6a0a4 Compare September 6, 2025 02:25
@philipwilk philipwilk marked this pull request as ready for review September 6, 2025 02:26
@philipwilk
Copy link
Contributor Author

I am reasonably sure that this is the shortest solution.

There is one small edge whereby it can interpret nginx (or whatever your proxy you use in front of your unreliable cache) taking a while to respond on the first request (ex, the moment your cache goes down and nginx is taking exceptionally long trying to connect to it) as a timeout. This is not critical as timeouts are regarded as transient and it tries again momentarily, after which it'll get the 502/504 proxy error and will disable that cache and continue with the next one. This could be solved by increasing the default timeout time for substituters (that way the 502/504 would actually get received the first time around instead of dropped). The main impact this has for users is that a timeout is logged as an error whereas disablement due to a cache being gone is not.

@philipwilk
Copy link
Contributor Author

also quick CTA out to interested people to try break this patch rn while nixcon is going on and i have time to look at it (message me on matrix and/or find me on saturday or sunday idk)

@Ericson2314
Copy link
Member

@philipwilk nice work!

One small thing, instead of checking whether we're at the end of iterators, I rather you just save the last exception. (You an do this with std::move on a nix::Error, for example, or std exception pointer). Then the logic becomes things like:

  • if beginning loop with a saved exception, log it
  • if left loop with a saved exception because no substituters left, throw.

Also, it might be nice to make the 5XX error handling a separate commit?

@philipwilk philipwilk force-pushed the graceful-subfallback branch 3 times, most recently from 6490863 to 0c6826c Compare September 6, 2025 11:58
@philipwilk
Copy link
Contributor Author

philipwilk commented Sep 6, 2025

@Ericson2314 like that?

@Ericson2314
Copy link
Member

Sorry, but are you vibe coding this? I bunch of other stuff changed in this revision that seems orthogonal to what I requested.

@philipwilk
Copy link
Contributor Author

no, which part of do you think is vibed? i amended it to be two commits, the first one lets it continue to the next substituter if it fails, but that isnt very useful without the second commit that makes it disabled broken servers (that have 500'd) instead of hard throwing

@Ericson2314
Copy link
Member

OK thanks. So I would prefer to do all HTTP status logic in a separate PR, even if the state machine logic is not so useful on its own. Is that OK?

continue;
else
throw e;
continue;
Copy link
Member

@Ericson2314 Ericson2314 Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
continue;
if (!settings.tryFallback)
throw e;
continue;

I think we want this, the saving exception thing should just be for the no-fallback case, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh my bad, I think I am wrong and this is fine. You tell can confirm, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think there has been some confusion when implementing this loop that means tryFallback has been coerced into meaning both "try the next substituter" and "try building the path ourselves" - I am quite sure it should only be used for the latter. Otherwise... that would mean we are ok with trying the next substituter if the first one doesn't have a path, but not if it errors out and gets disabled... really weird.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was some other PR where I left a big comment trying to explain the theory.

Here's what I think: "!tryFallback" is a coherence trick: you only build/use a substituter if you are sure that the previous substituters don't have the store object in question. If you can access them and they don't have it, you are sure. If you can't access it, you don't know, and thus you could be getting/build the "wrong version".

That actually would be a reason to have what I wrote above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#7188 (comment) ah this is the comment I had in mind

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps im ignorant, but in what case could we get a path that has different versions depending on the substituter used, would this not result in a different path entirely?

I don't think the state of the substituter should affect the contents of the path we get back?

Copy link
Member

@Ericson2314 Ericson2314 Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With input-addressing, the a store path may be allowed to point to different contents if the derivation is non-deterministic.

@Ericson2314
Copy link
Member

Ah now that I finished revieweing, I feel better too. Sorry about that --- I think we just had a misunderstanding about what exactly I meant with these loops, and so I was like "huh? what is all this?". Hopefully the comments I just left now make sense!

@philipwilk
Copy link
Contributor Author

OK thanks. So I would prefer to do all HTTP status logic in a separate PR, even if the state machine logic is not so useful on its own. Is that OK?

put that in #13971

@Ericson2314
Copy link
Member

@philipwilk Thank so much! Last thing is can you put more information in the commit message? (We want the history to be meaningful without GitHub access, though that said, you don't need to inline the contents of the issue. Referring to it by number is fine.)

@Ericson2314
Copy link
Member

Actually, maybe I can fix up myself with a squash merge.

@Ericson2314 Ericson2314 enabled auto-merge (squash) September 12, 2025 21:29
@Ericson2314 Ericson2314 merged commit aef431f into NixOS:master Sep 12, 2025
14 checks passed
@github-project-automation github-project-automation bot moved this from Triage to Done in Nix team Sep 12, 2025
philipwilk added a commit to philipwilk/nix that referenced this pull request Sep 13, 2025
… are still enabled

Right now, if a substituter returns a 5xx other than 501, 505 or 511, nix will keep trying to use it even though it is broken/misconfigured at least four more times before trying to use the next one (given the necessary patch is applied, otherwise it throws), resulting in unnecessary log spam.
Also, if you are receiving 504s (such as if an nginx proxy is trying to connect to a substituter but timing out) this can lead to rather nasty waits - I have personally seen it take 20-30s, with it timing out at each individual request, before it timing out completely and giving up. Awful.

This doesn't really make sense because 500s wont go away unless the server is fixed? This change would make nix disable an http substituter on the first 500 it gets and immediately switch to the next substituter (intuitive default behaviour?)

This also preserves `settings.tryFallback` as it will still throw if false and all the substituters have been disabled because they have had (terminal) errors. (that logic is unrelated to the http binary cache store and is in substition-goal/store-api, see NixOS#13301 )

linked to NixOS#13301

![nix error log](https://github.com/user-attachments/assets/cc2a7911-4298-43fa-a3b6-b68eaf88132a)
philipwilk added a commit to philipwilk/nix that referenced this pull request Sep 13, 2025
… 500s instead of throwing

Right now, if a substituter returns a 5xx other than 501, 505 or 511, nix will keep trying to use it even though it is broken/misconfigured at least four more times before trying to use the next one (given the necessary patch is applied, otherwise it throws), resulting in unnecessary log spam.
Also, if you are receiving 504s (such as if an nginx proxy is trying to connect to a substituter but timing out) this can lead to rather nasty waits - I have personally seen it take 20-30s, with it timing out at each individual request, before it timing out completely and giving up. Awful.

This doesn't really make sense because 500s wont go away unless the server is fixed? This change would make nix disable an http substituter on the first 500 it gets and immediately switch to the next substituter (intuitive default behaviour?)

This also preserves `settings.tryFallback` as it will still throw if false and all the substituters have been disabled because they have had (terminal) errors. (that logic is unrelated to the http binary cache store and is in substition-goal/store-api, see NixOS#13301 )

linked to NixOS#13301

![nix error log](https://github.com/user-attachments/assets/cc2a7911-4298-43fa-a3b6-b68eaf88132a)
philipwilk added a commit to philipwilk/nix that referenced this pull request Sep 13, 2025
… are still enabled (NixOS#13301)

Nix currently hard fails if a substituter is inaccessible, even when they are other substituters available, unless `fallback = true`.
This breaks nix build, run, shell et al entirely.
This would modify the default behaviour so that nix would actually use the other available substituters and not hard error.

Here is an example before vs after when using dotenv where I have manually stopped my own cache to trigger this issue, before and after the patch. The initial error is really frustrating because there is other caches available.
![image](https://github.com/user-attachments/assets/b4aec474-52d1-497d-b4e8-6f5737d6acc7)
![image](https://github.com/user-attachments/assets/ee91fcd4-4a1a-4c33-bf88-3aee67ad3cc9)

There is an initial PR at NixOS#7188, but this appears to have been abandoned - over 2 years with no activity, then a no comment review in jan. There was a subsequent PR at NixOS#8983 but this was closed without merge - over a year without activity.
<!-- Non-trivial change: Briefly outline the implementation strategy. -->
I have visualised the current and proposed flows. I believe my logic flows line up with what is suggested in NixOS#7188 (comment) but correct me if I am wrong.
Current behaviour:
![current](https://github.com/user-attachments/assets/d9501b34-274c-4eb3-88c3-9021a482e364)
Proposed behaviour:
![proposed](https://github.com/user-attachments/assets/8236e4f4-21ef-45d7-87e1-6c8d416e8c1c)

[Charts in lucid](https://lucid.app/lucidchart/1b51b08d-6c4f-40e0-bf54-480df322cccf/view)
<!-- Invasive change: Discuss alternative designs or approaches you considered. -->

Possible issues to think about:
- I could not figure out where the curl error is created... I can't figure out how to swallow it and turn it into a warn or better yet, a debug log.
- Unfortunately, in contrast with the previous point, I'm not sure how verbose we want to warns/traces to be - personally I think that the warn that a substituter has been disabled (when it happens) is sufficient, and that the next one is being used, but this is personal preference.
@jakobkukla
Copy link

Sorry to bump this, but does this fix the linked issues? If so, why are they still open? If not, what is still needed to close them? Thanks!

@philipwilk
Copy link
Contributor Author

#7188, #3514 probably

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/life-of-a-pull-request-against-nixos-nixpkgs/21691/15

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

store Issues and pull requests concerning the Nix store

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

5 participants