You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If you have ever doubted the raw power and speed of compiled systems languages, try running a 400x400 matrix multiplication benchmark across different environments. The results speak for themselves,
Recently, I ran exactly this test—multiplying large matrices in Python, C++, and C. The results were a stark reminder of how language architecture impacts execution time:
Python: ~19.86 seconds
C++: ~1.01 seconds
C: ~0.37 seconds
A nearly 20-second wait in Python versus a fraction of a second in C and C++. It’s a classic demonstration of performance tiers, but it also raises a fascinating question: Why is C still beating C++ by a factor of three in this scenario?
Let’s break down the engineering reality behind these numbers.
Python : Interpreted vs. Compiled
Python is a fantastic language for rapid prototyping, data science, and scripting. However, when it comes to nested loops running heavy math, standard Python hits a bottleneck.
Because Python is an interpreted language, it translates code line-by-line at runtime. Furthermore, Python is dynamically typed. Every time Python executes an operation like A[i][j], it has to look up the object type, check its references, and allocate memory on the fly.
C and C++, by contrast, are statically typed and compiled directly down to machine code before execution. The CPU knows exactly how many bytes an integer requires and executes raw instructions at full hardware speed.
(Note: This is exactly why heavy data science tools in Python use libraries like NumPy, which are actually written in C under!)
The C vs. C++ : It’s Not the Compiler
When looking at the benchmark, a common piece of misinformation often pops up: "C is faster than C++ because C compiles directly to machine code, while C++ has to convert to assembly first."
This is entirely false. Both C and C++ compile using the exact same pipeline. Whether you use gcc or g++, the compiler translates your code into an Intermediate Representation, optimizes it, generates assembly code tailored to your CPU architecture, and outputs a binary file of 1s and 0s.
If you write identical raw loops and arrays in C and C++, the compiler will often output the exact same machine code, resulting in identical speeds.
Real thing : Memory Layout and CPU Caching
The real reason for the speed gap between C (0.37s) and C++ (1.01s) comes down to how data structures manage your computer’s RAM.
In the C++ benchmark code, the matrices were allocated using nested vectors:
C++
std::vector<std::vector<longlong>> A;
While std::vector is incredibly convenient and safe, a vector-of-vectors creates a fragmented layout in memory. It allocates an array of pointers, where each pointer points to a completely different, randomly allocated row in RAM.
When your CPU tries to perform matrix multiplication, it relies heavily on its hyper-fast L1 and L2 caches. It wants to grab a massive, continuous block of data from the RAM all at once.
In C: The matrix is typically declared as a traditional 2D array (long long A[SIZE][SIZE]), storing the entire matrix as one continuous block of memory. The CPU fetches it instantly with minimal overhead.
In C++ (with nested vectors): The CPU has to constantly jump around to different memory addresses to find the next row. This results in CPU cache misses, which severely throttle performance.
Additionally, std::vector automatically initializes every single slot with zeros out of safety before you even begin your math loops, adding extra overhead that raw C arrays skip.
The Takeaway: Control is Performance
C isn't faster than C++ because of how it compiles; it's faster because its limitations force you to write simpler, memory-contiguous code.
C++ gives you incredible abstractions like std::vector, but performance optimization requires understanding the hidden costs of those abstractions. By simply flattening that C++ vector into a single 1D array (std::vector A(SIZE * SIZE)) to force contiguous memory, C++ will instantly match the lightning-fast speed of C.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
If you have ever doubted the raw power and speed of compiled systems languages, try running a 400x400 matrix multiplication benchmark across different environments. The results speak for themselves,
Recently, I ran exactly this test—multiplying large matrices in Python, C++, and C. The results were a stark reminder of how language architecture impacts execution time:
Python: ~19.86 seconds
C++: ~1.01 seconds
C: ~0.37 seconds
A nearly 20-second wait in Python versus a fraction of a second in C and C++. It’s a classic demonstration of performance tiers, but it also raises a fascinating question: Why is C still beating C++ by a factor of three in this scenario?
Let’s break down the engineering reality behind these numbers.
Python : Interpreted vs. Compiled
Python is a fantastic language for rapid prototyping, data science, and scripting. However, when it comes to nested loops running heavy math, standard Python hits a bottleneck.
Because Python is an interpreted language, it translates code line-by-line at runtime. Furthermore, Python is dynamically typed. Every time Python executes an operation like A[i][j], it has to look up the object type, check its references, and allocate memory on the fly.
C and C++, by contrast, are statically typed and compiled directly down to machine code before execution. The CPU knows exactly how many bytes an integer requires and executes raw instructions at full hardware speed.
(Note: This is exactly why heavy data science tools in Python use libraries like NumPy, which are actually written in C under!)
The C vs. C++ : It’s Not the Compiler
When looking at the benchmark, a common piece of misinformation often pops up: "C is faster than C++ because C compiles directly to machine code, while C++ has to convert to assembly first."
This is entirely false. Both C and C++ compile using the exact same pipeline. Whether you use gcc or g++, the compiler translates your code into an Intermediate Representation, optimizes it, generates assembly code tailored to your CPU architecture, and outputs a binary file of 1s and 0s.
If you write identical raw loops and arrays in C and C++, the compiler will often output the exact same machine code, resulting in identical speeds.
Real thing : Memory Layout and CPU Caching
The real reason for the speed gap between C (0.37s) and C++ (1.01s) comes down to how data structures manage your computer’s RAM.
In the C++ benchmark code, the matrices were allocated using nested vectors:
C++
While std::vector is incredibly convenient and safe, a vector-of-vectors creates a fragmented layout in memory. It allocates an array of pointers, where each pointer points to a completely different, randomly allocated row in RAM.
When your CPU tries to perform matrix multiplication, it relies heavily on its hyper-fast L1 and L2 caches. It wants to grab a massive, continuous block of data from the RAM all at once.
In C: The matrix is typically declared as a traditional 2D array (long long A[SIZE][SIZE]), storing the entire matrix as one continuous block of memory. The CPU fetches it instantly with minimal overhead.
In C++ (with nested vectors): The CPU has to constantly jump around to different memory addresses to find the next row. This results in CPU cache misses, which severely throttle performance.
Additionally, std::vector automatically initializes every single slot with zeros out of safety before you even begin your math loops, adding extra overhead that raw C arrays skip.
The Takeaway: Control is Performance
C isn't faster than C++ because of how it compiles; it's faster because its limitations force you to write simpler, memory-contiguous code.
C++ gives you incredible abstractions like std::vector, but performance optimization requires understanding the hidden costs of those abstractions. By simply flattening that C++ vector into a single 1D array (std::vector A(SIZE * SIZE)) to force contiguous memory, C++ will instantly match the lightning-fast speed of C.
Beta Was this translation helpful? Give feedback.
All reactions