Substantial drop in performance with new version, especially on ARM

Hello again,

I've been playing around with the new version of TBLIS and, to my surprise, observed a substantial performance drop between its old version and the new one. 

Here's the piece of code I've been using for my benchmarks:

```cpp
#include <iostream>
#include <chrono>
#include <random>

#include "omp.h"

#define TBLIS_ENABLE_COMPAT 1

#include "tblis/frame/base/aligned_allocator.hpp"

#include "tblis/tblis.h"

//Function to set TBLIS thread count
extern "C" {
    void tblis_set_num_threads(unsigned num_threads);
    unsigned tblis_get_num_threads();
}

using Clock = std::chrono::steady_clock;
using Seconds = std::chrono::duration<double>;

// Aligned allocator for better SIMD/cacheline behavior
using Al64 = tblis::aligned_allocator<double, 64>;

// Convenient alias
using Tensor = tblis::tensor<double, Al64>;


int main() {
    
    // Force TBLIS to use all threads ( if needed ?)
    //tblis_set_num_threads(omp_get_max_threads());
    std::cout << "TBLIS using " << tblis_get_num_threads() << " threads\n";
    
    // Dimensions
    const long Bn = 40;   // batch b
    const long I  = 80;   // i
    const long J  = 100;  // j
    const long K  = 60;   // k
    const long P  = 60;   // p
    const long Q  = 80;   // q
    const long R  = 40;   // r
    
    // Declare the tensors
    Tensor A({Bn, I, K, P}, MArray::COLUMN_MAJOR);                              // A[b,i,k,p]
    Tensor B({Bn, K, J, Q, R}, MArray::COLUMN_MAJOR);                           // B[b,k,j,q,r]
    Tensor C({P,  Q, R}, MArray::COLUMN_MAJOR);                                 // C[p,q,r]
    Tensor BC({Bn, K, J, P}, MArray::uninitialized, MArray::COLUMN_MAJOR);      // temp: BC[b,k,j,p]
    Tensor D ({Bn, I, J, P}, MArray::uninitialized, MArray::COLUMN_MAJOR);      // result: D[b,i,j,p]
    
    
    // Fill A,B,C with random doubles
#pragma omp parallel
    {
        std::mt19937_64 rng(42 + omp_get_thread_num()*1337);
        std::uniform_real_distribution<double> dist(0.0,1.0);
#pragma omp for
        for (std::size_t t=0; t<A.size(); ++t) A.data()[t]=dist(rng);
#pragma omp for
        for (std::size_t t=0; t<B.size(); ++t) B.data()[t]=dist(rng);
#pragma omp for
        for (std::size_t t=0; t<C.size(); ++t) C.data()[t]=dist(rng);
    }
    
    double total_time_TBLIS = 0;
    
    int MaxIter = 40;
    
    // Loop for average evaluation time
    for(int i=0; i<MaxIter;i++){
        
        auto t0 = Clock::now();
        
        // Contract B and C over (q,r):  BC[b,k,j,p] = sum_{q,r} B[b,k,j,q,r] * C[p,q,r]
        tblis::mult<double>(1.0,
                            B,  "bkjqr",
                            C,  "pqr",
                            0.0,
                            BC, "bkjp");
        
        //Contract A and BC over (k):   D[b,i,j,p]  = sum_{k}   A[b,i,k,p]   * BC[b,k,j,p]
        tblis::mult<double>(1.0,
                            A,  "bikp",
                            BC, "bkjp",
                            0.0,
                            D,  "bijp");
        
        auto t1 = Clock::now();
        
        total_time_TBLIS += Seconds(t1 - t0).count();
    }
    
    double time_TBLIS = total_time_TBLIS/MaxIter;
    
    std::cout << "TBLIS time average: " << time_TBLIS << " s\n";
    
    return 0;
}
``` 

I've been running this code on two different machines:
#### Mac
OS: macOS Sequoia 15.5
Chip: Apple M3 Pro
Total Number of Cores: 11 (5 performance and 6 efficiency)
Memory: 18 GB

#### Linux
OS: Ubuntu 24.04. 3 LTS
Chip: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
Total Number of Cores: 8
Memory: 16 GB

So one ARM and one x86 architecture. Here's the result of my benchmarks:

<div align="center">

| Version | Arch | Threads|Average Time (s) |
|---------|---------|---------|----------|
| old     | ARM       | 11|0.734512    |
| new     | ARM       | 11|2.92892     |
| old     | x86       |4| 1.75475    |
| new     | x86       | 4|2.21599     |

</div>

From my understanding, the former version of TBLIS didn't come with microkernels for ARM architectures. While one of the advantages of the newer version is to use BLIS' plugins to generates those microkernels. I was therefore expecting an interesting performance gain on my Mac machine but the actual opposite seems to happen.
On my Mac the new version is about 4 times slower while on my linux machine it's still 1.3 times slower.

### **Is this an abnormal behavior or a misunderstanding from me ?**

Both version have been install without specifying dependancies (letting the `./configure` download and build BLIS), with GCC-14 on Linux and GCC-15 on Mac.

#### Old version:
```text
commit 4de1919dfe194f5e47dfc93660ce4206d8e12c4e (HEAD -> master, origin/master, origin/HEAD)
Merge: 7d979b6 7436875
Author: Devin Matthews <damatthews@smu.edu>
Date:   Sat Apr 22 14:57:23 2023 -0500
```
#### New version:
```text
commit 9b95712966cb8804be51c62bfd6207957f62bc6f (HEAD -> develop, origin/develop, origin/HEAD)
Author: Christopher Hillenbrand <chillenb.lists@gmail.com>
Date:   Mon Aug 18 22:16:35 2025 -0400
```

## Quick word on threading
I observed that by default TBLIS only uses 4 threads out of the 8 available on my Intel chip. If I try to manually constraint the library to use all the threads: `tblis_set_num_threads(omp_get_max_threads());` the average evaluation time remains the same.
<div align="center">

| Version | Arch | Threads|Average Time (s) |
|---------|---------|---------|----------|
| new     | x86       |4| 2.21599    |
| new     | x86       | 8|2.16684     |

</div>
I tried to do the same analysis with smaller dimensions (to stay away from memory-bandwidth issues) and observed the same phenomenon. Can someone explain to me what is happening here ?

Thank you again for all your work !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Substantial drop in performance with new version, especially on ARM #76

Mac

Linux

Is this an abnormal behavior or a misunderstanding from me ?

Old version:

New version:

Quick word on threading

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Substantial drop in performance with new version, especially on ARM #76

Description

Mac

Linux

Is this an abnormal behavior or a misunderstanding from me ?

Old version:

New version:

Quick word on threading

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions