-
-
Couldn't load subscription status.
- Fork 1.8k
libexpr: Use 16-byte atomic loads/stores MOVAPS/MOVDQA for ValueStora… #13964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
xokdvium
wants to merge
1
commit into
NixOS:master
Choose a base branch
from
xokdvium:value-double-width-atomics
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -14,6 +14,10 @@ | |
|
|
||
| #include <nlohmann/json_fwd.hpp> | ||
|
|
||
| #if defined(__x86_64__) && defined(__SSE2__) | ||
| # include <emmintrin.h> | ||
| #endif | ||
|
|
||
| namespace nix { | ||
|
|
||
| struct Value; | ||
|
|
@@ -362,6 +366,11 @@ protected: | |
| { | ||
| return internalType; | ||
| } | ||
|
|
||
| static bool isAtomic() | ||
| { | ||
| return false; | ||
| } | ||
| }; | ||
|
|
||
| namespace detail { | ||
|
|
@@ -395,7 +404,11 @@ class alignas(16) ValueStorage<ptrSize, std::enable_if_t<detail::useBitPackedVal | |
|
|
||
| using PackedPointer = typename PackedPointerTypeStruct<ptrSize>::type; | ||
| using Payload = std::array<PackedPointer, 2>; | ||
| Payload payload = {}; | ||
| #if defined(__x86_64__) && defined(__SSE2__) | ||
| __m128i payloadWords; | ||
| #else | ||
| Payload payloadWords = {}; | ||
| #endif | ||
|
|
||
| static constexpr int discriminatorBits = 3; | ||
| static constexpr PackedPointer discriminatorMask = (PackedPointer(1) << discriminatorBits) - 1; | ||
|
|
@@ -439,16 +452,64 @@ class alignas(16) ValueStorage<ptrSize, std::enable_if_t<detail::useBitPackedVal | |
| pdPairOfPointers, //< layout: Pair of pointers payload | ||
| }; | ||
|
|
||
| #if defined(__x86_64__) && defined(__SSE2__) | ||
| /* Why do we even bother with hand-rolling these arch-specific intrinsics | ||
| * and don't use libatomic directly for 16 byte atomics? Here's why: | ||
| * - https://gcc.gnu.org/legacy-ml/gcc/2018-02/msg00224.html | ||
| * - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84563 | ||
| * | ||
| * Basically, we don't really ever want to go through libatomic. As is | ||
| * so happens on x86_64 with AVX, MOVDQA/MOVAPS instructions (16-byte aligned | ||
| * 128-bit loads and stores) are atomic [^]. Note that | ||
| * these instructions are not part of AVX but rather SSE2, which is x86_64-v1. | ||
| * They are just not guaranteed to be atomic without AVX. | ||
| * | ||
| * For more details see: | ||
| * - [^] Intel® 64 and IA-32 Architectures Software Developer’s Manual (10.1.1 Guaranteed Atomic Operations). | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. in Volume 3A |
||
| * - https://patchwork.sourceware.org/project/gcc/patch/YhxkfzGEEQ9KHbBC@tucnak/ | ||
| * - https://ibraheem.ca/posts/128-bit-atomics/ | ||
| * - https://rigtorp.se/isatomic/ | ||
| */ | ||
| [[gnu::always_inline]] | ||
| void updatePayload(Payload payload) noexcept | ||
| { | ||
| /* This intrinsic corresponds to MOVAPS. Note that Value (and thus the first member | ||
| payloadWords) is 16 bytes aligned. */ | ||
| _mm_store_si128(&payloadWords, std::bit_cast<__m128i>(payload)); | ||
| } | ||
|
|
||
| [[gnu::always_inline]] | ||
| Payload loadPayload() const noexcept | ||
| { | ||
| /* This intrinsic corresponds to MOVDQA. Note that Value (and thus the first member | ||
| payloadWords) is 16 bytes aligned. */ | ||
| __m128i res = _mm_load_si128(&payloadWords); | ||
| return std::bit_cast<Payload>(res); | ||
| } | ||
| #else | ||
| [[gnu::always_inline]] | ||
| void updatePayload(Payload payload) noexcept | ||
| { | ||
| payloadWords = payload; | ||
| } | ||
|
|
||
| [[gnu::always_inline]] | ||
| Payload loadPayload() const noexcept | ||
| { | ||
| return payloadWords; | ||
| } | ||
| #endif | ||
|
|
||
| template<typename T> | ||
| requires std::is_pointer_v<T> | ||
| static T untagPointer(PackedPointer val) noexcept | ||
| { | ||
| return std::bit_cast<T>(val & ~discriminatorMask); | ||
| } | ||
|
|
||
| PrimaryDiscriminator getPrimaryDiscriminator() const noexcept | ||
| PrimaryDiscriminator getPrimaryDiscriminator(PackedPointer firstDWord) const noexcept | ||
| { | ||
| return static_cast<PrimaryDiscriminator>(payload[0] & discriminatorMask); | ||
| return static_cast<PrimaryDiscriminator>(firstDWord & discriminatorMask); | ||
| } | ||
|
|
||
| static void assertAligned(PackedPointer val) noexcept | ||
|
|
@@ -459,25 +520,30 @@ class alignas(16) ValueStorage<ptrSize, std::enable_if_t<detail::useBitPackedVal | |
| template<InternalType type> | ||
| void setSingleDWordPayload(PackedPointer untaggedVal) noexcept | ||
| { | ||
| Payload payload; | ||
| /* There's plenty of free upper bits in the first dword, which is | ||
| used only for the discriminator. */ | ||
| payload[0] = static_cast<int>(pdSingleDWord) | (static_cast<int>(type) << discriminatorBits); | ||
| payload[1] = untaggedVal; | ||
| updatePayload(payload); | ||
| } | ||
|
|
||
| template<PrimaryDiscriminator discriminator, typename T, typename U> | ||
| void setUntaggablePayload(T * firstPtrField, U untaggableField) noexcept | ||
| { | ||
| Payload payload; | ||
| static_assert(discriminator >= pdListN && discriminator <= pdPath); | ||
| auto firstFieldPayload = std::bit_cast<PackedPointer>(firstPtrField); | ||
| assertAligned(firstFieldPayload); | ||
| payload[0] = static_cast<int>(discriminator) | firstFieldPayload; | ||
| payload[1] = std::bit_cast<PackedPointer>(untaggableField); | ||
| updatePayload(payload); | ||
| } | ||
|
|
||
| template<InternalType type, typename T, typename U> | ||
| void setPairOfPointersPayload(T * firstPtrField, U * secondPtrField) noexcept | ||
| { | ||
| Payload payload; | ||
| static_assert(type >= tListSmall && type <= tLambda); | ||
| { | ||
| auto firstFieldPayload = std::bit_cast<PackedPointer>(firstPtrField); | ||
|
|
@@ -489,21 +555,58 @@ class alignas(16) ValueStorage<ptrSize, std::enable_if_t<detail::useBitPackedVal | |
| assertAligned(secondFieldPayload); | ||
| payload[1] = (type - tListSmall) | secondFieldPayload; | ||
| } | ||
| updatePayload(payload); | ||
| } | ||
|
|
||
| template<typename T, typename U> | ||
| requires std::is_pointer_v<T> && std::is_pointer_v<U> | ||
| void getPairOfPointersPayload(T & firstPtrField, U & secondPtrField) const noexcept | ||
| { | ||
| Payload payload = loadPayload(); | ||
| firstPtrField = untagPointer<T>(payload[0]); | ||
| secondPtrField = untagPointer<U>(payload[1]); | ||
| } | ||
|
|
||
| public: | ||
| ValueStorage() | ||
| { | ||
| updatePayload({}); | ||
| } | ||
|
|
||
| ValueStorage(const ValueStorage & other) | ||
| { | ||
| updatePayload(other.loadPayload()); | ||
| } | ||
|
|
||
| ValueStorage & operator=(const ValueStorage & other) | ||
| { | ||
| updatePayload(other.loadPayload()); | ||
| return *this; | ||
| } | ||
|
|
||
| ValueStorage(ValueStorage && other) noexcept | ||
| { | ||
| updatePayload(other.loadPayload()); | ||
| other.updatePayload({}); // Zero out rhs | ||
| } | ||
|
|
||
| ValueStorage & operator=(ValueStorage && other) noexcept | ||
| { | ||
| updatePayload(other.loadPayload()); | ||
| other.updatePayload({}); // Zero out rhs | ||
| return *this; | ||
| } | ||
|
|
||
| ~ValueStorage() noexcept {} | ||
|
|
||
| protected: | ||
| static bool isAtomic(); | ||
|
|
||
| /** Get internal type currently occupying the storage. */ | ||
| InternalType getInternalType() const noexcept | ||
| { | ||
| switch (auto pd = getPrimaryDiscriminator()) { | ||
| Payload payload = loadPayload(); | ||
| switch (auto pd = getPrimaryDiscriminator(payload[0])) { | ||
| case pdUninitialized: | ||
| /* Discriminator value of zero is used to distinguish uninitialized values. */ | ||
| return tUninitialized; | ||
|
|
@@ -545,52 +648,61 @@ protected: | |
|
|
||
| void getStorage(NixInt & integer) const noexcept | ||
| { | ||
| Payload payload = loadPayload(); | ||
| /* PackedPointerType -> int64_t here is well-formed, since the standard requires | ||
| this conversion to follow 2's complement rules. This is just a no-op. */ | ||
| integer = NixInt(payload[1]); | ||
| } | ||
|
|
||
| void getStorage(bool & boolean) const noexcept | ||
| { | ||
| Payload payload = loadPayload(); | ||
| boolean = payload[1]; | ||
| } | ||
|
|
||
| void getStorage(Null & null) const noexcept {} | ||
|
|
||
| void getStorage(NixFloat & fpoint) const noexcept | ||
| { | ||
| Payload payload = loadPayload(); | ||
| fpoint = std::bit_cast<NixFloat>(payload[1]); | ||
| } | ||
|
|
||
| void getStorage(ExternalValueBase *& external) const noexcept | ||
| { | ||
| Payload payload = loadPayload(); | ||
| external = std::bit_cast<ExternalValueBase *>(payload[1]); | ||
| } | ||
|
|
||
| void getStorage(PrimOp *& primOp) const noexcept | ||
| { | ||
| Payload payload = loadPayload(); | ||
| primOp = std::bit_cast<PrimOp *>(payload[1]); | ||
| } | ||
|
|
||
| void getStorage(Bindings *& attrs) const noexcept | ||
| { | ||
| Payload payload = loadPayload(); | ||
| attrs = std::bit_cast<Bindings *>(payload[1]); | ||
| } | ||
|
|
||
| void getStorage(List & list) const noexcept | ||
| { | ||
| Payload payload = loadPayload(); | ||
| list.elems = untagPointer<decltype(list.elems)>(payload[0]); | ||
| list.size = payload[1]; | ||
| } | ||
|
|
||
| void getStorage(StringWithContext & string) const noexcept | ||
| { | ||
| Payload payload = loadPayload(); | ||
| string.context = untagPointer<decltype(string.context)>(payload[0]); | ||
| string.c_str = std::bit_cast<const char *>(payload[1]); | ||
| } | ||
|
|
||
| void getStorage(Path & path) const noexcept | ||
| { | ||
| Payload payload = loadPayload(); | ||
| path.accessor = untagPointer<decltype(path.accessor)>(payload[0]); | ||
| path.path = std::bit_cast<const char *>(payload[1]); | ||
| } | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the optimization done by this PR.
AFAIK there is no performance difference between
MOVDQAandMOVDQUon modern hardware, if the memory is aligned. There won't be any unaligned accesses.I doubt that atomic stores are sufficient in the multi threaded evaluator. Acquire/release semantic is required. I don't think that there is any way around something like
LOCK CMPXCHG16B(maybe for writes), whileMOVDQAcould be used for reads.Is it guaranteed that the compiler cannot reorder memory accesses across intrinsics?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, libatomic uses an
mfencefor stores. I was waiting for @edolstra for feedback for what would be required for the parallel eval branch.Yes, DCAS is also necessary for thunk locking.
Basically the whole idea for this patch is to show that this hackery (https://github.com/DeterminateSystems/nix-src/blob/766f43aa6acb1b3578db488c19fbbedf04ed9f24/src/libexpr/include/nix/expr/value.hh#L526-L528, https://github.com/DeterminateSystems/nix-src/blob/766f43aa6acb1b3578db488c19fbbedf04ed9f24/src/libexpr/include/nix/expr/value.hh#L471-L472) can be avoided. Instead of splitting the
ValueStoragein this dubious way we'd instead have CMPXCHG16B in places of (https://github.com/DeterminateSystems/nix-src/blob/766f43aa6acb1b3578db488c19fbbedf04ed9f24/src/libexpr/include/nix/expr/eval-inline.hh#L106-L113)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't really an optimization, more of a laying groundwork to facilitate parallel eval. It's missing some pieces like release semantics for stores, which would really just be achieved with
__atomic_compare_exchangein places where thunks get updated. SoupdatePayloaddoesn't really need to have release semantics, but the atomic load is crucial and can't be achieved by other means. Standard atomics go through libatomic's IFUNCs, which is really bad for performance as noted by Eelco.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compiler builtins could be used, e.g.
std::atomic<unsigned __int128>, but this produces function calls.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I mention above, yes. Default implementaion of 16 byte atomics is useless.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed no
LOCK CMPXCHG16Bormfenceinstruction is needed becauseMOVDQApairs have acquire/release semantics when the feature flagCPUID.01H:ECX.AVX[bit 28]is set andmemory_order_seq_cstsemantic is not required.Thus, dynamic dispatch is required because of older CPUs.
Hence your implementation is OK from the hardware point of view, except that the compiler needs to know about this, i.e. that no memory accesses are reordered across the atomic memory accesses and that the atomic memory accesses aren't optimized out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the idea is to not support parallel eval without AVX. Seems like even a 10 year old potato like ivy bridge has it, so it's not a big deal.
We can plonk down [
std::atomic_thread_fence(std::memory_order_acq_rel)] std::memory_order_acq_rel. On x86 that should be purely an optimization barrier without any instructions being emittted.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @NaN-git btw, your expertise in x86 and memory models is much appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AVX specifically? Put another way, would recent feature-full versions of ARM processors (Graviton, Apple's M-series, etc.) be supported?