-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
'Gracefully' fallback from failing substituters #13301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Further consideration: |
Without a json config #8373, I think per-store configuration is nasty to implement (builders/machines config is just list in list) if not impossible, which would also make this currently infeasible... |
e38a214 to
245bc5d
Compare
|
I think there's a flake (literally) edge case where it still hard errors out in a flake - not sure what is causing this, as I only just tested it working fine... and will need to taken another look at it. |
|
Is this PR still in the works? :) |
|
currently in the process of moving house so dont have the pc to dev on, but will look back into it asap when available. |
245bc5d to
ffb2ee8
Compare
|
(final: prev: {
nixVersions = prev.nixVersions.extend (
finalNixV: prevNixV: {
nixComponents_2_30 =
(prevNixV.nixComponents_2_30.overrideScope (
finalScope: prevScope: {
aws-sdk-cpp = null;
withAWS = false;
}
)).appendPatches
[ ./ffb2ee8eefc9946a8ebb406cb1fe02734549406d.patch ];
}
);
}) |
|
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/announcing-ncps-a-nix-cache-proxy-server-for-faster-builds/58166/18 |
fcf5b71 to
9d6a0a4
Compare
9d6a0a4 to
564f1c3
Compare
|
I am reasonably sure that this is the shortest solution. There is one small edge whereby it can interpret nginx (or whatever your proxy you use in front of your unreliable cache) taking a while to respond on the first request (ex, the moment your cache goes down and nginx is taking exceptionally long trying to connect to it) as a timeout. This is not critical as timeouts are regarded as transient and it tries again momentarily, after which it'll get the 502/504 proxy error and will disable that cache and continue with the next one. This could be solved by increasing the default timeout time for substituters (that way the 502/504 would actually get received the first time around instead of dropped). The main impact this has for users is that a timeout is logged as an error whereas disablement due to a cache being gone is not. |
|
also quick CTA out to interested people to try break this patch rn while nixcon is going on and i have time to look at it (message me on matrix and/or find me on saturday or sunday idk) |
|
@philipwilk nice work! One small thing, instead of checking whether we're at the end of iterators, I rather you just save the last exception. (You an do this with
Also, it might be nice to make the 5XX error handling a separate commit? |
6490863 to
0c6826c
Compare
|
@Ericson2314 like that? |
|
Sorry, but are you vibe coding this? I bunch of other stuff changed in this revision that seems orthogonal to what I requested. |
|
no, which part of do you think is vibed? i amended it to be two commits, the first one lets it continue to the next substituter if it fails, but that isnt very useful without the second commit that makes it disabled broken servers (that have 500'd) instead of hard throwing |
|
OK thanks. So I would prefer to do all HTTP status logic in a separate PR, even if the state machine logic is not so useful on its own. Is that OK? |
| continue; | ||
| else | ||
| throw e; | ||
| continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| continue; | |
| if (!settings.tryFallback) | |
| throw e; | |
| continue; |
I think we want this, the saving exception thing should just be for the no-fallback case, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh my bad, I think I am wrong and this is fine. You tell can confirm, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think there has been some confusion when implementing this loop that means tryFallback has been coerced into meaning both "try the next substituter" and "try building the path ourselves" - I am quite sure it should only be used for the latter. Otherwise... that would mean we are ok with trying the next substituter if the first one doesn't have a path, but not if it errors out and gets disabled... really weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was some other PR where I left a big comment trying to explain the theory.
Here's what I think: "!tryFallback" is a coherence trick: you only build/use a substituter if you are sure that the previous substituters don't have the store object in question. If you can access them and they don't have it, you are sure. If you can't access it, you don't know, and thus you could be getting/build the "wrong version".
That actually would be a reason to have what I wrote above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#7188 (comment) ah this is the comment I had in mind
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps im ignorant, but in what case could we get a path that has different versions depending on the substituter used, would this not result in a different path entirely?
I don't think the state of the substituter should affect the contents of the path we get back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With input-addressing, the a store path may be allowed to point to different contents if the derivation is non-deterministic.
|
Ah now that I finished revieweing, I feel better too. Sorry about that --- I think we just had a misunderstanding about what exactly I meant with these loops, and so I was like "huh? what is all this?". Hopefully the comments I just left now make sense! |
… are still enabled
0c6826c to
11d7c80
Compare
put that in #13971 |
|
@philipwilk Thank so much! Last thing is can you put more information in the commit message? (We want the history to be meaningful without GitHub access, though that said, you don't need to inline the contents of the issue. Referring to it by number is fine.) |
|
Actually, maybe I can fix up myself with a squash merge. |
… are still enabled Right now, if a substituter returns a 5xx other than 501, 505 or 511, nix will keep trying to use it even though it is broken/misconfigured at least four more times before trying to use the next one (given the necessary patch is applied, otherwise it throws), resulting in unnecessary log spam. Also, if you are receiving 504s (such as if an nginx proxy is trying to connect to a substituter but timing out) this can lead to rather nasty waits - I have personally seen it take 20-30s, with it timing out at each individual request, before it timing out completely and giving up. Awful. This doesn't really make sense because 500s wont go away unless the server is fixed? This change would make nix disable an http substituter on the first 500 it gets and immediately switch to the next substituter (intuitive default behaviour?) This also preserves `settings.tryFallback` as it will still throw if false and all the substituters have been disabled because they have had (terminal) errors. (that logic is unrelated to the http binary cache store and is in substition-goal/store-api, see NixOS#13301 ) linked to NixOS#13301 
… 500s instead of throwing Right now, if a substituter returns a 5xx other than 501, 505 or 511, nix will keep trying to use it even though it is broken/misconfigured at least four more times before trying to use the next one (given the necessary patch is applied, otherwise it throws), resulting in unnecessary log spam. Also, if you are receiving 504s (such as if an nginx proxy is trying to connect to a substituter but timing out) this can lead to rather nasty waits - I have personally seen it take 20-30s, with it timing out at each individual request, before it timing out completely and giving up. Awful. This doesn't really make sense because 500s wont go away unless the server is fixed? This change would make nix disable an http substituter on the first 500 it gets and immediately switch to the next substituter (intuitive default behaviour?) This also preserves `settings.tryFallback` as it will still throw if false and all the substituters have been disabled because they have had (terminal) errors. (that logic is unrelated to the http binary cache store and is in substition-goal/store-api, see NixOS#13301 ) linked to NixOS#13301 
… are still enabled (NixOS#13301) Nix currently hard fails if a substituter is inaccessible, even when they are other substituters available, unless `fallback = true`. This breaks nix build, run, shell et al entirely. This would modify the default behaviour so that nix would actually use the other available substituters and not hard error. Here is an example before vs after when using dotenv where I have manually stopped my own cache to trigger this issue, before and after the patch. The initial error is really frustrating because there is other caches available.   There is an initial PR at NixOS#7188, but this appears to have been abandoned - over 2 years with no activity, then a no comment review in jan. There was a subsequent PR at NixOS#8983 but this was closed without merge - over a year without activity. <!-- Non-trivial change: Briefly outline the implementation strategy. --> I have visualised the current and proposed flows. I believe my logic flows line up with what is suggested in NixOS#7188 (comment) but correct me if I am wrong. Current behaviour:  Proposed behaviour:  [Charts in lucid](https://lucid.app/lucidchart/1b51b08d-6c4f-40e0-bf54-480df322cccf/view) <!-- Invasive change: Discuss alternative designs or approaches you considered. --> Possible issues to think about: - I could not figure out where the curl error is created... I can't figure out how to swallow it and turn it into a warn or better yet, a debug log. - Unfortunately, in contrast with the previous point, I'm not sure how verbose we want to warns/traces to be - personally I think that the warn that a substituter has been disabled (when it happens) is sufficient, and that the next one is being used, but this is personal preference.
|
Sorry to bump this, but does this fix the linked issues? If so, why are they still open? If not, what is still needed to close them? Thanks! |
|
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/life-of-a-pull-request-against-nixos-nixpkgs/21691/15 |
Motivation
Nix currently hard fails if a substituter is inaccessible, even when they are other substituters available, unless
fallback = true.This breaks nix build, run, shell et al entirely.
This would modify the default behaviour so that nix would actually use the other available substituters and not hard error.
Here is an example before vs after when using dotenv where I have manually stopped my own cache to trigger this issue, before and after the patch. The initial error is really frustrating because there is other caches available.


Context
#3514 (comment) is the earliest issue I could find, but there are many duplicates.
There is an initial PR at #7188, but this appears to have been abandoned - over 2 years with no activity, then a no comment review in jan. There was a subsequent PR at #8983 but this was closed without merge - over a year without activity.
I have visualised the current and proposed flows. I believe my logic flows line up with what is suggested in #7188 (comment) but correct me if I am wrong.


Current behaviour:
Proposed behaviour:
Charts in lucid
Possible issues to think about:
Add 👍 to pull requests you find important.
The Nix maintainer team uses a GitHub project board to schedule and track reviews.