Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document why Lix is used on builders #554

Merged
merged 1 commit into from
Feb 12, 2025
Merged

Document why Lix is used on builders #554

merged 1 commit into from
Feb 12, 2025

Conversation

infinisil
Copy link
Member

@infinisil infinisil commented Feb 11, 2025

This was introduced without any linked discussion/explanation, so let's document it now. I asked about the reason on Matrix and was told the answer by @K900. Though I'm leaving this as a draft until we actually know what the segfault was.

Nothing against Lix, but on the Nix Hydra we should really be dog-fooding Nix. Imo if there's a bug, it should be reported so it can be fixed before upgrading.

Pinging @NixOS/nix-team in case anybody has a clue what the problem could be, and how it could be fixed.

Note that evals on the coordinator still use Nix, it's only builder machines that are using Lix for now.

@Ericson2314
Copy link
Member

Among other reasons, I have begun work on what I hope is the last batch of changes before it is OK to try out CA derivations on hydra.nixos.org, and Lix will not have those changes, so getting the builders running Nix again would block that.

@lf-
Copy link
Member

lf- commented Feb 11, 2025

There's like two or three random segfault/assert failure bugs on the daemon that have persisted since CppNix 2.18. One is mishandling of pthread_cancel (usage of which was removed in lix), one is the signal handler thread dying on daemon fork (fixed in lix). These were fixed since we looked at the core dumps happening on Lix's own infrastructure and our own personal machines. This is a communication fault between CppNix and infra: the random crashes in infra (and likely on CppNix devs' own personal machines; one of these crashes happens on most unexpected disconnections of the nix daemon!) aren't getting triaged into bugs and fixed. Equally, something like the Lix overlay doesn't exist for CppNix HEAD, which leads me to believe it gets inferior testing since it's not super feasible to run the entire software suite against HEAD like Lix supports, so the prerelease testing will necessarily be weaker.

From a broader perspective, I'm not sure why CppNix should receive preferential treatment from nixpkgs and infra. This has been an unstated baseline assumption in all of these discussions: as soon as anyone suggests anything should not use CppNix we get arguments of the form "this is NixOS not LixOS" which reject the discussion rather than consider the actual trade-offs of the decision (which are real!! lix objectively cares less about flake evolution for example). Should it not be infra's prerogative to ship an infrastructure that works well and is maintainable, regardless of what they have to do to achieve that?

Though I'm surprised that nobody has reported the "fatal: exception not rethrown" pthread_cancel one in the CppNix bug tracker; it's certainly happened a bunch at work on 2.19.

@delroth
Copy link
Contributor

delroth commented Feb 11, 2025

communication fault between CppNix and infra

aren't getting triaged into bugs

Just to be clear about where the blame lies here: you can for example look at NixOS/nix#9961 which is from the last time I personally tried running a recent CppNix on h.n.o and failed due to regressions and instability. Happy 1 year anniversary to that issue 🍰! 4 comments from various infra team members over a year, 0 response from CppNix maintainers other than setting a tag and then promptly forgetting about it.

@lf-
Copy link
Member

lf- commented Feb 11, 2025

Among other reasons, I have begun work on what I hope is the last batch of changes before it is OK to try out CA derivations on hydra.nixos.org, and Lix will not have those changes, so getting the builders running Nix again would block that.

Why are we testing experimental CppNix features on production Hydra, particularly if the interfaces (like narinfo) and derivation behaviour aren't stable yet? I don't think we should be doing this, it risks non-reproducible or otherwise "weird" cache paths that test edge cases in narinfo handling or aren't reachable or such. Can we not afford a test cluster to do such experiments?

Edit: why is CppNix's repo CI not using HEAD for testing with decent metrics/core collection/etc? This is what we did at Lix and it's worked excellently.

@edolstra
Copy link
Member

"this is NixOS not LixOS"

That's exactly it though. This is the NixOS project, with infra running largely on hardware funded by the NixOS Foundation, so it should be running/dogfooding Nix. The Lix project is of course free to run its own infra.

@roberth
Copy link
Member

roberth commented Feb 11, 2025

@delroth Thank you for highlighting that issue.

setting a tag and then promptly forgetting about it.

This may have been a label from a bulk operation.

I'm sorry that this has fallen through the cracks. We receive many bug reports, and a crash involving an outdated version may not have been considered a high priority, fwiw.

@delroth
Copy link
Contributor

delroth commented Feb 11, 2025

by the NixOS Foundation, so it should be running/dogfooding Nix

Is this your personal opinion or is this the position of the NixOS Foundation board, of which you are the chairman?

EDIT: quoting the bylaws:

The purpose of the foundation is: to develop, propagate, and promote the adoption of a purely functional software deployment model and to support open-source projects that implement that model, as well as other activities that relate to, pertain to, and/or can be conducive to the foregoing in the broadest sense.

Nothing being done here is incompatible with the stated, written on paper, legal purpose of the NixOS Foundation. Lix is very much an implementation of a purely functional software deployment model.

@winterqt
Copy link
Member

We receive many bug reports, and a crash involving an outdated version may not have been considered a high priority, fwiw.

It's my understanding that this bug persists in Nix to this day, no?

@edolstra
Copy link
Member

@delroth My personal opinion.

While the bylaws are phrased broadly, the foundation (as hinted at by its name) was created concretely to support the Nix/NixOS projects, not just any projects related to purely functional configuration management (such as Guix or Lix). The foundation can of course decide to adopt other projects in the future.

@delroth
Copy link
Contributor

delroth commented Feb 11, 2025

While the bylaws are phrased broadly, the foundation (as hinted at by its name) was created concretely to support the Nix/NixOS projects, not just any projects related to purely functional configuration management (such as Guix or Lix). The foundation can of course decide to adopt other projects in the future.

Why should the infra team give a preference to a specific implementation though? When there are two competing implementations, it makes sense to use the one that's more fit for purpose, and since 2.19 it's been pretty clear that reliability and stability is not a goal of the CppNix team. Like, the infra team didn't make it up, CppNix was crashing on h.n.o builders, Lix doesn't crash, hence you're providing an inferior alternative for that purpose. Fix it and make your competing option the better one instead of relying on external pressure?

As a NixOS user: the infra team's priority should be making sure NixOS can deliver updates in a consistent and timely way. If Lix allows that better than CppNix, that sounds like a big win.

@edolstra
Copy link
Member

To state the obvious: NixOS is based on Nix, so that means we should strive to run our infrastructure on both Nix and NixOS. We shouldn't switch to unrelated external projects just because that's convenient with respect to some bugs (which should be fixed, obviously). For instance, we shouldn't switch to Ubuntu on our infra if NixOS has some stability issues.

@lf-
Copy link
Member

lf- commented Feb 12, 2025

To state the obvious: NixOS is based on Nix, so that means we should strive to run our infrastructure on both Nix and NixOS. We shouldn't switch to unrelated external projects just because that's convenient with respect to some bugs (which should be fixed, obviously). For instance, we shouldn't switch to Ubuntu on our infra if NixOS has some stability issues.

Well, it seems like what happened here is that the formal organization diverged from the actual social organization and the people actually working on the distro are much closer to Lix than to CppNix, and are much more able to get bugs fixed in Lix.

Though I will point out that the only reason this argument has held any water is because of the CppNix team's insistence that Nix always means CppNix (including by creating a fake intentional name collision project), rather than Nix the technology, of which Lix is a compatible implementation. NixOS running Lix is very much still NixOS and is still a Nix based operating system. NixOS and nixpkgs is so much bigger than a Nix implementation as a project, and the needs of the bigger project should really be considered first and foremost: infra that isn't spending time chasing down bugs, users having a reliable Nix implementation, etc.

Most people who use and develop NixOS don't care what the Nix implementation is because they both are very similar and have a stable derivation interface, etc. The fact that CppNix can be removed from Hydra build boxes for a couple of months and still have a NixOS that looks and behaves like NixOS is an indication that Nix the technology is not just CppNix.

Meta note: The infra team runs the infra and takes responsibility for it. By pulling rank, you are undermining their autonomy.

@infinisil
Copy link
Member Author

infinisil commented Feb 12, 2025

I'd really just like to figure out what the bug is so we can track it. Whether/when it gets fixed is a separate issue and up to the Nix team. And whether infra uses Nix or Lix is also a separate issue and up to the infra team.

If anybody strongly disagrees with decisions or the direction of Nix projects, you can now escalate to the Nix Steering Committee. I think you all know by now that this dated disagreement won't suddenly be resolved by arguing about it once more in this new issue.

@delroth
Copy link
Contributor

delroth commented Feb 12, 2025

Most people who use and develop NixOS don't care what the Nix implementation is because they both are very similar and have a stable derivation interface, etc. The fact that CppNix can be removed from Hydra build boxes for a couple of months and still have a NixOS that looks and behaves like NixOS is an indication that Nix the technology is not just CppNix.

In fact, looking at the track record of compatibility between CppNix and itself (again: NixOS/nix#9961 it's a landmine on every version update in a heterogeneous infra, including h.n.o, and the CppNix team has ignored it for > 1 year) it could be very well argued that Lix is a more compatible Nix implementation than CppNix. Would it be completely in good faith? Probably not. Would it be completely out of left field? Also probably not.

@mweinelt
Copy link
Member

I'd really just like to figure out what the bug is so we can track it. Whether/when it gets fixed is a separate issue and up to the Nix team.

Yeah, me too. There was never an intent to hide these away, but I set these machines up in a busy late December 2024 up and forgot about it. And now the coredumps have apparently already been garbage collected.

And whether infra uses Nix or Lix is also a separate issue and up to the infra team.

In fact, we seem to be running a mix of Nix and Lix right now, as it apparently was helpful at the time the decision was made. I would appreciate for infra to keep being able to make these calls as needed.

I'm super disappointed that as a community we keep trading these blows in public, when everyone knows it's not a great look. Maybe try reaching out directly next time?

@Mic92 Mic92 marked this pull request as ready for review February 12, 2025 10:27
@Mic92 Mic92 requested a review from a team as a code owner February 12, 2025 10:27
@NixOS NixOS locked as too heated and limited conversation to collaborators Feb 12, 2025
@Mic92 Mic92 enabled auto-merge February 12, 2025 10:28
@Mic92 Mic92 merged commit feea1ec into main Feb 12, 2025
8 checks passed
@Mic92 Mic92 deleted the lix-expl branch February 12, 2025 10:29
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants