Skip to content

Substantial drop in performance with new version, especially on ARM #76

@JMartinon

Description

@JMartinon

Hello again,

I've been playing around with the new version of TBLIS and, to my surprise, observed a substantial performance drop between its old version and the new one.

Here's the piece of code I've been using for my benchmarks:

#include <iostream>
#include <chrono>
#include <random>

#include "omp.h"

#define TBLIS_ENABLE_COMPAT 1

#include "tblis/frame/base/aligned_allocator.hpp"

#include "tblis/tblis.h"

//Function to set TBLIS thread count
extern "C" {
    void tblis_set_num_threads(unsigned num_threads);
    unsigned tblis_get_num_threads();
}

using Clock = std::chrono::steady_clock;
using Seconds = std::chrono::duration<double>;

// Aligned allocator for better SIMD/cacheline behavior
using Al64 = tblis::aligned_allocator<double, 64>;

// Convenient alias
using Tensor = tblis::tensor<double, Al64>;


int main() {
    
    // Force TBLIS to use all threads ( if needed ?)
    //tblis_set_num_threads(omp_get_max_threads());
    std::cout << "TBLIS using " << tblis_get_num_threads() << " threads\n";
    
    // Dimensions
    const long Bn = 40;   // batch b
    const long I  = 80;   // i
    const long J  = 100;  // j
    const long K  = 60;   // k
    const long P  = 60;   // p
    const long Q  = 80;   // q
    const long R  = 40;   // r
    
    // Declare the tensors
    Tensor A({Bn, I, K, P}, MArray::COLUMN_MAJOR);                              // A[b,i,k,p]
    Tensor B({Bn, K, J, Q, R}, MArray::COLUMN_MAJOR);                           // B[b,k,j,q,r]
    Tensor C({P,  Q, R}, MArray::COLUMN_MAJOR);                                 // C[p,q,r]
    Tensor BC({Bn, K, J, P}, MArray::uninitialized, MArray::COLUMN_MAJOR);      // temp: BC[b,k,j,p]
    Tensor D ({Bn, I, J, P}, MArray::uninitialized, MArray::COLUMN_MAJOR);      // result: D[b,i,j,p]
    
    
    // Fill A,B,C with random doubles
#pragma omp parallel
    {
        std::mt19937_64 rng(42 + omp_get_thread_num()*1337);
        std::uniform_real_distribution<double> dist(0.0,1.0);
#pragma omp for
        for (std::size_t t=0; t<A.size(); ++t) A.data()[t]=dist(rng);
#pragma omp for
        for (std::size_t t=0; t<B.size(); ++t) B.data()[t]=dist(rng);
#pragma omp for
        for (std::size_t t=0; t<C.size(); ++t) C.data()[t]=dist(rng);
    }
    
    double total_time_TBLIS = 0;
    
    int MaxIter = 40;
    
    // Loop for average evaluation time
    for(int i=0; i<MaxIter;i++){
        
        auto t0 = Clock::now();
        
        // Contract B and C over (q,r):  BC[b,k,j,p] = sum_{q,r} B[b,k,j,q,r] * C[p,q,r]
        tblis::mult<double>(1.0,
                            B,  "bkjqr",
                            C,  "pqr",
                            0.0,
                            BC, "bkjp");
        
        //Contract A and BC over (k):   D[b,i,j,p]  = sum_{k}   A[b,i,k,p]   * BC[b,k,j,p]
        tblis::mult<double>(1.0,
                            A,  "bikp",
                            BC, "bkjp",
                            0.0,
                            D,  "bijp");
        
        auto t1 = Clock::now();
        
        total_time_TBLIS += Seconds(t1 - t0).count();
    }
    
    double time_TBLIS = total_time_TBLIS/MaxIter;
    
    std::cout << "TBLIS time average: " << time_TBLIS << " s\n";
    
    return 0;
}

I've been running this code on two different machines:

Mac

OS: macOS Sequoia 15.5
Chip: Apple M3 Pro
Total Number of Cores: 11 (5 performance and 6 efficiency)
Memory: 18 GB

Linux

OS: Ubuntu 24.04. 3 LTS
Chip: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
Total Number of Cores: 8
Memory: 16 GB

So one ARM and one x86 architecture. Here's the result of my benchmarks:

Version Arch Threads Average Time (s)
old ARM 11 0.734512
new ARM 11 2.92892
old x86 4 1.75475
new x86 4 2.21599

From my understanding, the former version of TBLIS didn't come with microkernels for ARM architectures. While one of the advantages of the newer version is to use BLIS' plugins to generates those microkernels. I was therefore expecting an interesting performance gain on my Mac machine but the actual opposite seems to happen.
On my Mac the new version is about 4 times slower while on my linux machine it's still 1.3 times slower.

Is this an abnormal behavior or a misunderstanding from me ?

Both version have been install without specifying dependancies (letting the ./configure download and build BLIS), with GCC-14 on Linux and GCC-15 on Mac.

Old version:

commit 4de1919dfe194f5e47dfc93660ce4206d8e12c4e (HEAD -> master, origin/master, origin/HEAD)
Merge: 7d979b6 7436875
Author: Devin Matthews <damatthews@smu.edu>
Date:   Sat Apr 22 14:57:23 2023 -0500

New version:

commit 9b95712966cb8804be51c62bfd6207957f62bc6f (HEAD -> develop, origin/develop, origin/HEAD)
Author: Christopher Hillenbrand <chillenb.lists@gmail.com>
Date:   Mon Aug 18 22:16:35 2025 -0400

Quick word on threading

I observed that by default TBLIS only uses 4 threads out of the 8 available on my Intel chip. If I try to manually constraint the library to use all the threads: tblis_set_num_threads(omp_get_max_threads()); the average evaluation time remains the same.

Version Arch Threads Average Time (s)
new x86 4 2.21599
new x86 8 2.16684
I tried to do the same analysis with smaller dimensions (to stay away from memory-bandwidth issues) and observed the same phenomenon. Can someone explain to me what is happening here ?

Thank you again for all your work !

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions