Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
120 changes: 116 additions & 4 deletions src/libexpr/include/nix/expr/value.hh
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,10 @@

#include <nlohmann/json_fwd.hpp>

#if defined(__x86_64__) && defined(__SSE2__)
# include <emmintrin.h>
#endif

namespace nix {

struct Value;
Expand Down Expand Up @@ -362,6 +366,11 @@ protected:
{
return internalType;
}

static bool isAtomic()
{
return false;
}
};

namespace detail {
Expand Down Expand Up @@ -395,7 +404,11 @@ class alignas(16) ValueStorage<ptrSize, std::enable_if_t<detail::useBitPackedVal

using PackedPointer = typename PackedPointerTypeStruct<ptrSize>::type;
using Payload = std::array<PackedPointer, 2>;
Payload payload = {};
#if defined(__x86_64__) && defined(__SSE2__)
__m128i payloadWords;
#else
Payload payloadWords = {};
#endif

static constexpr int discriminatorBits = 3;
static constexpr PackedPointer discriminatorMask = (PackedPointer(1) << discriminatorBits) - 1;
Expand Down Expand Up @@ -439,16 +452,64 @@ class alignas(16) ValueStorage<ptrSize, std::enable_if_t<detail::useBitPackedVal
pdPairOfPointers, //< layout: Pair of pointers payload
};

#if defined(__x86_64__) && defined(__SSE2__)
/* Why do we even bother with hand-rolling these arch-specific intrinsics
* and don't use libatomic directly for 16 byte atomics? Here's why:
* - https://gcc.gnu.org/legacy-ml/gcc/2018-02/msg00224.html
* - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84563
*
* Basically, we don't really ever want to go through libatomic. As is
* so happens on x86_64 with AVX, MOVDQA/MOVAPS instructions (16-byte aligned
* 128-bit loads and stores) are atomic [^]. Note that
* these instructions are not part of AVX but rather SSE2, which is x86_64-v1.
* They are just not guaranteed to be atomic without AVX.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the optimization done by this PR.
AFAIK there is no performance difference between MOVDQA and MOVDQU on modern hardware, if the memory is aligned. There won't be any unaligned accesses.

I doubt that atomic stores are sufficient in the multi threaded evaluator. Acquire/release semantic is required. I don't think that there is any way around something like LOCK CMPXCHG16B (maybe for writes), while MOVDQA could be used for reads.
Is it guaranteed that the compiler cannot reorder memory accesses across intrinsics?

Copy link
Contributor Author

@xokdvium xokdvium Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acquire/release semantic is required.

Indeed, libatomic uses an mfence for stores. I was waiting for @edolstra for feedback for what would be required for the parallel eval branch.

CMPXCHG16B

Yes, DCAS is also necessary for thunk locking.

Basically the whole idea for this patch is to show that this hackery (https://github.com/DeterminateSystems/nix-src/blob/766f43aa6acb1b3578db488c19fbbedf04ed9f24/src/libexpr/include/nix/expr/value.hh#L526-L528, https://github.com/DeterminateSystems/nix-src/blob/766f43aa6acb1b3578db488c19fbbedf04ed9f24/src/libexpr/include/nix/expr/value.hh#L471-L472) can be avoided. Instead of splitting the ValueStorage in this dubious way we'd instead have CMPXCHG16B in places of (https://github.com/DeterminateSystems/nix-src/blob/766f43aa6acb1b3578db488c19fbbedf04ed9f24/src/libexpr/include/nix/expr/eval-inline.hh#L106-L113)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the optimization done by this PR.

This isn't really an optimization, more of a laying groundwork to facilitate parallel eval. It's missing some pieces like release semantics for stores, which would really just be achieved with __atomic_compare_exchange in places where thunks get updated. So updatePayload doesn't really need to have release semantics, but the atomic load is crucial and can't be achieved by other means. Standard atomics go through libatomic's IFUNCs, which is really bad for performance as noted by Eelco.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compiler builtins could be used, e.g. std::atomic<unsigned __int128>, but this produces function calls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Standard atomics go through libatomic's IFUNCs, which is really bad for performance as noted by Eelco.

That's what I mention above, yes. Default implementaion of 16 byte atomics is useless.

Copy link
Contributor

@NaN-git NaN-git Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acquire/release semantic is required.

Indeed, libatomic uses an mfence for stores. I was waiting for @edolstra for feedback for what would be required for the parallel eval branch.

Indeed no LOCK CMPXCHG16B or mfence instruction is needed because MOVDQA pairs have acquire/release semantics when the feature flag CPUID.01H:ECX.AVX[bit 28] is set and memory_order_seq_cst semantic is not required.
Thus, dynamic dispatch is required because of older CPUs.

Hence your implementation is OK from the hardware point of view, except that the compiler needs to know about this, i.e. that no memory accesses are reordered across the atomic memory accesses and that the atomic memory accesses aren't optimized out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thus, dynamic dispatch is required because of older CPUs.

Yeah, the idea is to not support parallel eval without AVX. Seems like even a 10 year old potato like ivy bridge has it, so it's not a big deal.

the compiler needs to know about this

We can plonk down [std::atomic_thread_fence(std::memory_order_acq_rel)] std::memory_order_acq_rel. On x86 that should be purely an optimization barrier without any instructions being emittted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NaN-git btw, your expertise in x86 and memory models is much appreciated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AVX specifically? Put another way, would recent feature-full versions of ARM processors (Graviton, Apple's M-series, etc.) be supported?

*
* For more details see:
* - [^] Intel® 64 and IA-32 Architectures Software Developer’s Manual (10.1.1 Guaranteed Atomic Operations).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in Volume 3A

* - https://patchwork.sourceware.org/project/gcc/patch/YhxkfzGEEQ9KHbBC@tucnak/
* - https://ibraheem.ca/posts/128-bit-atomics/
* - https://rigtorp.se/isatomic/
*/
[[gnu::always_inline]]
void updatePayload(Payload payload) noexcept
{
/* This intrinsic corresponds to MOVAPS. Note that Value (and thus the first member
payloadWords) is 16 bytes aligned. */
_mm_store_si128(&payloadWords, std::bit_cast<__m128i>(payload));
}

[[gnu::always_inline]]
Payload loadPayload() const noexcept
{
/* This intrinsic corresponds to MOVDQA. Note that Value (and thus the first member
payloadWords) is 16 bytes aligned. */
__m128i res = _mm_load_si128(&payloadWords);
return std::bit_cast<Payload>(res);
}
#else
[[gnu::always_inline]]
void updatePayload(Payload payload) noexcept
{
payloadWords = payload;
}

[[gnu::always_inline]]
Payload loadPayload() const noexcept
{
return payloadWords;
}
#endif

template<typename T>
requires std::is_pointer_v<T>
static T untagPointer(PackedPointer val) noexcept
{
return std::bit_cast<T>(val & ~discriminatorMask);
}

PrimaryDiscriminator getPrimaryDiscriminator() const noexcept
PrimaryDiscriminator getPrimaryDiscriminator(PackedPointer firstDWord) const noexcept
{
return static_cast<PrimaryDiscriminator>(payload[0] & discriminatorMask);
return static_cast<PrimaryDiscriminator>(firstDWord & discriminatorMask);
}

static void assertAligned(PackedPointer val) noexcept
Expand All @@ -459,25 +520,30 @@ class alignas(16) ValueStorage<ptrSize, std::enable_if_t<detail::useBitPackedVal
template<InternalType type>
void setSingleDWordPayload(PackedPointer untaggedVal) noexcept
{
Payload payload;
/* There's plenty of free upper bits in the first dword, which is
used only for the discriminator. */
payload[0] = static_cast<int>(pdSingleDWord) | (static_cast<int>(type) << discriminatorBits);
payload[1] = untaggedVal;
updatePayload(payload);
}

template<PrimaryDiscriminator discriminator, typename T, typename U>
void setUntaggablePayload(T * firstPtrField, U untaggableField) noexcept
{
Payload payload;
static_assert(discriminator >= pdListN && discriminator <= pdPath);
auto firstFieldPayload = std::bit_cast<PackedPointer>(firstPtrField);
assertAligned(firstFieldPayload);
payload[0] = static_cast<int>(discriminator) | firstFieldPayload;
payload[1] = std::bit_cast<PackedPointer>(untaggableField);
updatePayload(payload);
}

template<InternalType type, typename T, typename U>
void setPairOfPointersPayload(T * firstPtrField, U * secondPtrField) noexcept
{
Payload payload;
static_assert(type >= tListSmall && type <= tLambda);
{
auto firstFieldPayload = std::bit_cast<PackedPointer>(firstPtrField);
Expand All @@ -489,21 +555,58 @@ class alignas(16) ValueStorage<ptrSize, std::enable_if_t<detail::useBitPackedVal
assertAligned(secondFieldPayload);
payload[1] = (type - tListSmall) | secondFieldPayload;
}
updatePayload(payload);
}

template<typename T, typename U>
requires std::is_pointer_v<T> && std::is_pointer_v<U>
void getPairOfPointersPayload(T & firstPtrField, U & secondPtrField) const noexcept
{
Payload payload = loadPayload();
firstPtrField = untagPointer<T>(payload[0]);
secondPtrField = untagPointer<U>(payload[1]);
}

public:
ValueStorage()
{
updatePayload({});
}

ValueStorage(const ValueStorage & other)
{
updatePayload(other.loadPayload());
}

ValueStorage & operator=(const ValueStorage & other)
{
updatePayload(other.loadPayload());
return *this;
}

ValueStorage(ValueStorage && other) noexcept
{
updatePayload(other.loadPayload());
other.updatePayload({}); // Zero out rhs
}

ValueStorage & operator=(ValueStorage && other) noexcept
{
updatePayload(other.loadPayload());
other.updatePayload({}); // Zero out rhs
return *this;
}

~ValueStorage() noexcept {}

protected:
static bool isAtomic();

/** Get internal type currently occupying the storage. */
InternalType getInternalType() const noexcept
{
switch (auto pd = getPrimaryDiscriminator()) {
Payload payload = loadPayload();
switch (auto pd = getPrimaryDiscriminator(payload[0])) {
case pdUninitialized:
/* Discriminator value of zero is used to distinguish uninitialized values. */
return tUninitialized;
Expand Down Expand Up @@ -545,52 +648,61 @@ protected:

void getStorage(NixInt & integer) const noexcept
{
Payload payload = loadPayload();
/* PackedPointerType -> int64_t here is well-formed, since the standard requires
this conversion to follow 2's complement rules. This is just a no-op. */
integer = NixInt(payload[1]);
}

void getStorage(bool & boolean) const noexcept
{
Payload payload = loadPayload();
boolean = payload[1];
}

void getStorage(Null & null) const noexcept {}

void getStorage(NixFloat & fpoint) const noexcept
{
Payload payload = loadPayload();
fpoint = std::bit_cast<NixFloat>(payload[1]);
}

void getStorage(ExternalValueBase *& external) const noexcept
{
Payload payload = loadPayload();
external = std::bit_cast<ExternalValueBase *>(payload[1]);
}

void getStorage(PrimOp *& primOp) const noexcept
{
Payload payload = loadPayload();
primOp = std::bit_cast<PrimOp *>(payload[1]);
}

void getStorage(Bindings *& attrs) const noexcept
{
Payload payload = loadPayload();
attrs = std::bit_cast<Bindings *>(payload[1]);
}

void getStorage(List & list) const noexcept
{
Payload payload = loadPayload();
list.elems = untagPointer<decltype(list.elems)>(payload[0]);
list.size = payload[1];
}

void getStorage(StringWithContext & string) const noexcept
{
Payload payload = loadPayload();
string.context = untagPointer<decltype(string.context)>(payload[0]);
string.c_str = std::bit_cast<const char *>(payload[1]);
}

void getStorage(Path & path) const noexcept
{
Payload payload = loadPayload();
path.accessor = untagPointer<decltype(path.accessor)>(payload[0]);
path.path = std::bit_cast<const char *>(payload[1]);
}
Expand Down
4 changes: 4 additions & 0 deletions src/libexpr/meson.build
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,10 @@ configdata_priv.set(

deps_other += toml11

cpuid = dependency('libcpuid', 'cpuid', required : false)
configdata_priv.set('HAVE_LIBCPUID', cpuid.found().to_int())
deps_private += cpuid

config_priv_h = configure_file(
configuration : configdata_priv,
output : 'expr-config-private.hh',
Expand Down
4 changes: 3 additions & 1 deletion src/libexpr/package.nix
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
boehmgc,
nlohmann_json,
toml11,
libcpuid,

# Configuration Options

Expand Down Expand Up @@ -64,7 +65,8 @@ mkMesonLibrary (finalAttrs: {

buildInputs = [
toml11
];
]
++ lib.optional stdenv.hostPlatform.isx86_64 libcpuid;

propagatedBuildInputs = [
nix-util
Expand Down
21 changes: 21 additions & 0 deletions src/libexpr/value.cc
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
#include "nix/expr/value.hh"

#include "expr-config-private.hh"

#if HAVE_LIBCPUID
# include <libcpuid/libcpuid.h>
#endif

namespace nix {

Value Value::vEmptyList = []() {
Expand All @@ -26,4 +32,19 @@ Value Value::vFalse = []() {
return res;
}();

template<>
bool ValueStorage<8>::isAtomic()
{
#if HAVE_LIBCPUID
struct cpu_id_t data;

if (cpu_identify(NULL, &data) < 0)
return false;

return data.flags[CPU_FEATURE_AVX];
#else
return false; // Can't tell
#endif
}

} // namespace nix
Loading