-
-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document why Lix is used on builders #554
Conversation
Among other reasons, I have begun work on what I hope is the last batch of changes before it is OK to try out CA derivations on |
There's like two or three random segfault/assert failure bugs on the daemon that have persisted since CppNix 2.18. One is mishandling of pthread_cancel (usage of which was removed in lix), one is the signal handler thread dying on daemon fork (fixed in lix). These were fixed since we looked at the core dumps happening on Lix's own infrastructure and our own personal machines. This is a communication fault between CppNix and infra: the random crashes in infra (and likely on CppNix devs' own personal machines; one of these crashes happens on most unexpected disconnections of the nix daemon!) aren't getting triaged into bugs and fixed. Equally, something like the Lix overlay doesn't exist for CppNix HEAD, which leads me to believe it gets inferior testing since it's not super feasible to run the entire software suite against HEAD like Lix supports, so the prerelease testing will necessarily be weaker. From a broader perspective, I'm not sure why CppNix should receive preferential treatment from nixpkgs and infra. This has been an unstated baseline assumption in all of these discussions: as soon as anyone suggests anything should not use CppNix we get arguments of the form "this is NixOS not LixOS" which reject the discussion rather than consider the actual trade-offs of the decision (which are real!! lix objectively cares less about flake evolution for example). Should it not be infra's prerogative to ship an infrastructure that works well and is maintainable, regardless of what they have to do to achieve that? Though I'm surprised that nobody has reported the "fatal: exception not rethrown" pthread_cancel one in the CppNix bug tracker; it's certainly happened a bunch at work on 2.19. |
Just to be clear about where the blame lies here: you can for example look at NixOS/nix#9961 which is from the last time I personally tried running a recent CppNix on h.n.o and failed due to regressions and instability. Happy 1 year anniversary to that issue 🍰! 4 comments from various infra team members over a year, 0 response from CppNix maintainers other than setting a tag and then promptly forgetting about it. |
Why are we testing experimental CppNix features on production Hydra, particularly if the interfaces (like narinfo) and derivation behaviour aren't stable yet? I don't think we should be doing this, it risks non-reproducible or otherwise "weird" cache paths that test edge cases in narinfo handling or aren't reachable or such. Can we not afford a test cluster to do such experiments? Edit: why is CppNix's repo CI not using HEAD for testing with decent metrics/core collection/etc? This is what we did at Lix and it's worked excellently. |
That's exactly it though. This is the NixOS project, with infra running largely on hardware funded by the NixOS Foundation, so it should be running/dogfooding Nix. The Lix project is of course free to run its own infra. |
@delroth Thank you for highlighting that issue.
This may have been a label from a bulk operation. I'm sorry that this has fallen through the cracks. We receive many bug reports, and a crash involving an outdated version may not have been considered a high priority, fwiw. |
Is this your personal opinion or is this the position of the NixOS Foundation board, of which you are the chairman? EDIT: quoting the bylaws:
Nothing being done here is incompatible with the stated, written on paper, legal purpose of the NixOS Foundation. Lix is very much an implementation of a purely functional software deployment model. |
It's my understanding that this bug persists in Nix to this day, no? |
@delroth My personal opinion. While the bylaws are phrased broadly, the foundation (as hinted at by its name) was created concretely to support the Nix/NixOS projects, not just any projects related to purely functional configuration management (such as Guix or Lix). The foundation can of course decide to adopt other projects in the future. |
Why should the infra team give a preference to a specific implementation though? When there are two competing implementations, it makes sense to use the one that's more fit for purpose, and since 2.19 it's been pretty clear that reliability and stability is not a goal of the CppNix team. Like, the infra team didn't make it up, CppNix was crashing on h.n.o builders, Lix doesn't crash, hence you're providing an inferior alternative for that purpose. Fix it and make your competing option the better one instead of relying on external pressure? As a NixOS user: the infra team's priority should be making sure NixOS can deliver updates in a consistent and timely way. If Lix allows that better than CppNix, that sounds like a big win. |
To state the obvious: NixOS is based on Nix, so that means we should strive to run our infrastructure on both Nix and NixOS. We shouldn't switch to unrelated external projects just because that's convenient with respect to some bugs (which should be fixed, obviously). For instance, we shouldn't switch to Ubuntu on our infra if NixOS has some stability issues. |
Well, it seems like what happened here is that the formal organization diverged from the actual social organization and the people actually working on the distro are much closer to Lix than to CppNix, and are much more able to get bugs fixed in Lix. Though I will point out that the only reason this argument has held any water is because of the CppNix team's insistence that Nix always means CppNix (including by creating a fake intentional name collision project), rather than Nix the technology, of which Lix is a compatible implementation. NixOS running Lix is very much still NixOS and is still a Nix based operating system. NixOS and nixpkgs is so much bigger than a Nix implementation as a project, and the needs of the bigger project should really be considered first and foremost: infra that isn't spending time chasing down bugs, users having a reliable Nix implementation, etc. Most people who use and develop NixOS don't care what the Nix implementation is because they both are very similar and have a stable derivation interface, etc. The fact that CppNix can be removed from Hydra build boxes for a couple of months and still have a NixOS that looks and behaves like NixOS is an indication that Nix the technology is not just CppNix. Meta note: The infra team runs the infra and takes responsibility for it. By pulling rank, you are undermining their autonomy. |
I'd really just like to figure out what the bug is so we can track it. Whether/when it gets fixed is a separate issue and up to the Nix team. And whether infra uses Nix or Lix is also a separate issue and up to the infra team. If anybody strongly disagrees with decisions or the direction of Nix projects, you can now escalate to the Nix Steering Committee. I think you all know by now that this dated disagreement won't suddenly be resolved by arguing about it once more in this new issue. |
In fact, looking at the track record of compatibility between CppNix and itself (again: NixOS/nix#9961 it's a landmine on every version update in a heterogeneous infra, including h.n.o, and the CppNix team has ignored it for > 1 year) it could be very well argued that Lix is a more compatible Nix implementation than CppNix. Would it be completely in good faith? Probably not. Would it be completely out of left field? Also probably not. |
Yeah, me too. There was never an intent to hide these away, but I set these machines up in a busy late December 2024 up and forgot about it. And now the coredumps have apparently already been garbage collected.
In fact, we seem to be running a mix of Nix and Lix right now, as it apparently was helpful at the time the decision was made. I would appreciate for infra to keep being able to make these calls as needed. I'm super disappointed that as a community we keep trading these blows in public, when everyone knows it's not a great look. Maybe try reaching out directly next time? |
Was told this [on Matrix](https://matrix.to/#/!RROtHmAaQIkiJzJZZE:nixos.org/$_zH2bGUSUChkNFNjxwpCeWL_OVa-9XELobMhsqCOinE?via=nixos.org&via=matrix.org&via=nixos.dev) by K900 after asking about it
This was introduced without any linked discussion/explanation, so let's document it now. I asked about the reason on Matrix and was told the answer by @K900. Though I'm leaving this as a draft until we actually know what the segfault was.
Nothing against Lix, but on the Nix Hydra we should really be dog-fooding Nix. Imo if there's a bug, it should be reported so it can be fixed before upgrading.
Pinging @NixOS/nix-team in case anybody has a clue what the problem could be, and how it could be fixed.
Note that evals on the coordinator still use Nix, it's only builder machines that are using Lix for now.