On Intel Lunar Lake, when TBLIS 2.0 beta oversubscribes threads, causes the performance to collapse. Profiling shows most execution time stuck in BLIS’ bli_thrcomm_barrier_atomic.
TBLIS 1.3 does not suffer from the same issue.
In both cases:
-- mutex type selected: pthread_spinlock
-- barrier type selected: spin_barrier
-- thread model selected: openmp
And haswell kernel was selected.
File to reproduce: https://github.com/DiamonDinoia/nda/blob/nda-tensor-merge-master/benchmarks/tensor_contract_tblis.cpp
PS: my educated guess is that the spinlock sits on the same cache line as some data causing false sharing.