-
Notifications
You must be signed in to change notification settings - Fork 37
Description
Hello again,
I've been playing around with the new version of TBLIS and, to my surprise, observed a substantial performance drop between its old version and the new one.
Here's the piece of code I've been using for my benchmarks:
#include <iostream>
#include <chrono>
#include <random>
#include "omp.h"
#define TBLIS_ENABLE_COMPAT 1
#include "tblis/frame/base/aligned_allocator.hpp"
#include "tblis/tblis.h"
//Function to set TBLIS thread count
extern "C" {
void tblis_set_num_threads(unsigned num_threads);
unsigned tblis_get_num_threads();
}
using Clock = std::chrono::steady_clock;
using Seconds = std::chrono::duration<double>;
// Aligned allocator for better SIMD/cacheline behavior
using Al64 = tblis::aligned_allocator<double, 64>;
// Convenient alias
using Tensor = tblis::tensor<double, Al64>;
int main() {
// Force TBLIS to use all threads ( if needed ?)
//tblis_set_num_threads(omp_get_max_threads());
std::cout << "TBLIS using " << tblis_get_num_threads() << " threads\n";
// Dimensions
const long Bn = 40; // batch b
const long I = 80; // i
const long J = 100; // j
const long K = 60; // k
const long P = 60; // p
const long Q = 80; // q
const long R = 40; // r
// Declare the tensors
Tensor A({Bn, I, K, P}, MArray::COLUMN_MAJOR); // A[b,i,k,p]
Tensor B({Bn, K, J, Q, R}, MArray::COLUMN_MAJOR); // B[b,k,j,q,r]
Tensor C({P, Q, R}, MArray::COLUMN_MAJOR); // C[p,q,r]
Tensor BC({Bn, K, J, P}, MArray::uninitialized, MArray::COLUMN_MAJOR); // temp: BC[b,k,j,p]
Tensor D ({Bn, I, J, P}, MArray::uninitialized, MArray::COLUMN_MAJOR); // result: D[b,i,j,p]
// Fill A,B,C with random doubles
#pragma omp parallel
{
std::mt19937_64 rng(42 + omp_get_thread_num()*1337);
std::uniform_real_distribution<double> dist(0.0,1.0);
#pragma omp for
for (std::size_t t=0; t<A.size(); ++t) A.data()[t]=dist(rng);
#pragma omp for
for (std::size_t t=0; t<B.size(); ++t) B.data()[t]=dist(rng);
#pragma omp for
for (std::size_t t=0; t<C.size(); ++t) C.data()[t]=dist(rng);
}
double total_time_TBLIS = 0;
int MaxIter = 40;
// Loop for average evaluation time
for(int i=0; i<MaxIter;i++){
auto t0 = Clock::now();
// Contract B and C over (q,r): BC[b,k,j,p] = sum_{q,r} B[b,k,j,q,r] * C[p,q,r]
tblis::mult<double>(1.0,
B, "bkjqr",
C, "pqr",
0.0,
BC, "bkjp");
//Contract A and BC over (k): D[b,i,j,p] = sum_{k} A[b,i,k,p] * BC[b,k,j,p]
tblis::mult<double>(1.0,
A, "bikp",
BC, "bkjp",
0.0,
D, "bijp");
auto t1 = Clock::now();
total_time_TBLIS += Seconds(t1 - t0).count();
}
double time_TBLIS = total_time_TBLIS/MaxIter;
std::cout << "TBLIS time average: " << time_TBLIS << " s\n";
return 0;
}I've been running this code on two different machines:
Mac
OS: macOS Sequoia 15.5
Chip: Apple M3 Pro
Total Number of Cores: 11 (5 performance and 6 efficiency)
Memory: 18 GB
Linux
OS: Ubuntu 24.04. 3 LTS
Chip: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
Total Number of Cores: 8
Memory: 16 GB
So one ARM and one x86 architecture. Here's the result of my benchmarks:
| Version | Arch | Threads | Average Time (s) |
|---|---|---|---|
| old | ARM | 11 | 0.734512 |
| new | ARM | 11 | 2.92892 |
| old | x86 | 4 | 1.75475 |
| new | x86 | 4 | 2.21599 |
From my understanding, the former version of TBLIS didn't come with microkernels for ARM architectures. While one of the advantages of the newer version is to use BLIS' plugins to generates those microkernels. I was therefore expecting an interesting performance gain on my Mac machine but the actual opposite seems to happen.
On my Mac the new version is about 4 times slower while on my linux machine it's still 1.3 times slower.
Is this an abnormal behavior or a misunderstanding from me ?
Both version have been install without specifying dependancies (letting the ./configure download and build BLIS), with GCC-14 on Linux and GCC-15 on Mac.
Old version:
commit 4de1919dfe194f5e47dfc93660ce4206d8e12c4e (HEAD -> master, origin/master, origin/HEAD)
Merge: 7d979b6 7436875
Author: Devin Matthews <damatthews@smu.edu>
Date: Sat Apr 22 14:57:23 2023 -0500
New version:
commit 9b95712966cb8804be51c62bfd6207957f62bc6f (HEAD -> develop, origin/develop, origin/HEAD)
Author: Christopher Hillenbrand <chillenb.lists@gmail.com>
Date: Mon Aug 18 22:16:35 2025 -0400
Quick word on threading
I observed that by default TBLIS only uses 4 threads out of the 8 available on my Intel chip. If I try to manually constraint the library to use all the threads: tblis_set_num_threads(omp_get_max_threads()); the average evaluation time remains the same.
| Version | Arch | Threads | Average Time (s) |
|---|---|---|---|
| new | x86 | 4 | 2.21599 |
| new | x86 | 8 | 2.16684 |
Thank you again for all your work !