-
Notifications
You must be signed in to change notification settings - Fork 27
Description
This issue is to capture some performance analysis of the C implementation versus the Rust implementation, as well as to explore the effect of David Tolnay's proposed improvements.
These tests were run on SmartOS on a single-socket Haswell server (Xeon E3-1270 v3) running at 3.50GHz. All of the tests were run bound to a processor set containing a single core; all were bound to one logical CPU within that core, with the other logical CPU forced to be idle. cpustat was used to gather CPU performance counter data, with one number denoting one run with pic0 programmed to that CPU performance counter.
The input file (~30MB compressed) contains 3.9M state changes, and in the default config will generate a ~6MB SVG.
-
statemap-gcc is the node.js/C hybrid compiled with GCC 7.3.0.
-
%Δ is the percentage delta relative to statemap-gcc
-
statemap-clang is the node.js/C hybrid compiled with clang 6.0.1.
-
statemap-rust is the Rust implementation compiled with 1.29.0 (
--release) but without any additional optimizations -
statemap-dtolnay-deser is dtolnay's deserializing optimization, as outlined in his pull request, also compiled with 1.29.0
-
statemap-dtolnay-deser-nostring is dtolnay's deserlializing optimization and his elimination of transiet string allocation to parse the
timefield ofStatemapInputDatum, also compiled with 1.29.0
First, and to get it out of the way, here is the GCC-compiled C version relative to the Clang-compiled C version:
| Counter | statemap-gcc | statemap-clang | %Δ |
|---|---|---|---|
| cpu_clk_unhalted.thread_p | 32166437125 | 30479632976 | -5.2% |
| inst_retired.any_p | 49110875829 | 50526767701 | 2.9% |
| cpu_clk_unhalted.ref_p | 918870673 | 870925736 | -5.2% |
| mem_uops_retired.stlb_miss_loads | 8651386 | 8461734 | -2.2% |
| mem_uops_retired.stlb_miss_stores | 268802 | 240486 | -10.5% |
| mem_uops_retired.lock_loads | 7791528 | 6192277 | -20.5% |
| mem_uops_retired.split_loads | 107969 | 121187 | 12.2% |
| mem_uops_retired.split_stores | 196934 | 157650 | -19.9% |
| mem_uops_retired.all_loads | 11977544999 | 11862421792 | -1.0% |
| mem_uops_retired.all_stores | 3911589945 | 3717098293 | -5.0% |
| mem_load_uops_retired.l1_hit | 9337365435 | 9285834500 | -0.6% |
| mem_load_uops_retired.l2_hit | 1205703362 | 1214733389 | 0.7% |
| mem_load_uops_retired.l3_hit | 66771301 | 67133832 | 0.5% |
| mem_load_uops_retired.l1_miss | 1276311911 | 1290571092 | 1.1% |
| mem_load_uops_retired.l2_miss | 69671774 | 70013995 | 0.5% |
| mem_load_uops_retired.l3_miss | 2544750 | 2123972 | -16.5% |
| mem_load_uops_retired.hit_lfb | 1393631815 | 1376606034 | -1.2% |
| mem_load_uops_l3_hit_retired.xsnp_miss | 435 | 547 | 25.7% |
| mem_load_uops_l3_hit_retired.xsnp_hit | 1269 | 1140 | -10.2% |
| mem_load_uops_l3_hit_retired.xsnp_hitm | 820 | 1100 | 34.1% |
| mem_load_uops_l3_hit_retired.xsnp_none | 67846758 | 67620920 | -0.3% |
| mem_load_uops_l3_miss_retired.local_dram | 2543699 | 2082507 | -18.1% |
There is a significant delta here (a 5% improvement in run-time), which appears to to be due to fewer instructions (1.4B fewer), but also better memory behavior, with CPI dropping from 0.65 to 0.60.
Now here is the C version relative to the Rust version:
| Counter | statemap-gcc | statemap-rust | %Δ |
|---|---|---|---|
| cpu_clk_unhalted.thread_p | 32166437125 | 23127271226 | -28.1% |
| inst_retired.any_p | 49110875829 | 48752136699 | -0.7% |
| cpu_clk_unhalted.ref_p | 918870673 | 660493684 | -28.1% |
| mem_uops_retired.stlb_miss_loads | 8651386 | 2353178 | -72.8% |
| mem_uops_retired.stlb_miss_stores | 268802 | 1000684 | 272.3% |
| mem_uops_retired.lock_loads | 7791528 | 51737 | -99.3% |
| mem_uops_retired.split_loads | 107969 | 52745125 | 48752.1% |
| mem_uops_retired.split_stores | 196934 | 41814301 | 21132.6% |
| mem_uops_retired.all_loads | 11977544999 | 9035048050 | -24.6% |
| mem_uops_retired.all_stores | 3911589945 | 6627038769 | 69.4% |
| mem_load_uops_retired.l1_hit | 9337365435 | 8756546174 | -6.2% |
| mem_load_uops_retired.l2_hit | 1205703362 | 70967580 | -94.1% |
| mem_load_uops_retired.l3_hit | 66771301 | 33323740 | -50.1% |
| mem_load_uops_retired.l1_miss | 1276311911 | 105524579 | -91.7% |
| mem_load_uops_retired.l2_miss | 69671774 | 34616966 | -50.3% |
| mem_load_uops_retired.l3_miss | 2544750 | 1364435 | -46.4% |
| mem_load_uops_retired.hit_lfb | 1393631815 | 157897686 | -88.7% |
| mem_load_uops_l3_hit_retired.xsnp_miss | 435 | 526 | 20.9% |
| mem_load_uops_l3_hit_retired.xsnp_hit | 1269 | 740 | -41.7% |
| mem_load_uops_l3_hit_retired.xsnp_hitm | 820 | 517 | -37.0% |
| mem_load_uops_l3_hit_retired.xsnp_none | 67846758 | 33376449 | -50.8% |
| mem_load_uops_l3_miss_retired.local_dram | 2543699 | 1301381 | -48.8% |
The Rust version is issuing a remarkably similar number of instructions (within less than one percent!), but with a decidedly different mix: just three quarters of the loads of the C version and (interestingly) many more stores. The CPI drops from 0.65 to 0.47, indicating much better memory behavior -- and indeed the L1 misses, L2 misses and L3 misses are all way down. The L1 hits as an absolute number are actually quite high relative to the loads, giving Rust a 96.9% L1 hit rate versus the C version's 77.9% hit rate. Rust also lives much better in the L2, where it has half the L2 misses of the C version.
Now for dtolnay's deserializing improvement (that is, leaning on serde_json::StreamDeserializer instead of scanning ahead for the end of the current JSON blob). This improvement ditches the double lexing. We would expect the number of loads to be reduced by roughly the size of the input, but because these happen within the current JSON blob being parsed (and because these blobs by and large fit in the L1), we would expect these to mainly be L1/L2 hits. Here's the data:
| Counter | statemap-rust | %Δ | statemap-dt-deser | %Δ |
|---|---|---|---|---|
| cpu_clk_unhalted.thread_p | 23127271226 | -28.1% | 21390540509 | -33.5% |
| inst_retired.any_p | 48752136699 | -0.7% | 43365716806 | -11.7% |
| cpu_clk_unhalted.ref_p | 660493684 | -28.1% | 610826326 | -33.5% |
| mem_uops_retired.stlb_miss_loads | 2353178 | -72.8% | 2344946 | -72.9% |
| mem_uops_retired.stlb_miss_stores | 1000684 | 272.3% | 1017404 | 278.5% |
| mem_uops_retired.lock_loads | 51737 | -99.3% | 35968 | -99.5% |
| mem_uops_retired.split_loads | 52745125 | 48752.1% | 54414644 | 50298.4% |
| mem_uops_retired.split_stores | 41814301 | 21132.6% | 43441378 | 21958.9% |
| mem_uops_retired.all_loads | 9035048050 | -24.6% | 8736708559 | -27.1% |
| mem_uops_retired.all_stores | 6627038769 | 69.4% | 6576131085 | 68.1% |
| mem_load_uops_retired.l1_hit | 8756546174 | -6.2% | 8469374960 | -9.3% |
| mem_load_uops_retired.l2_hit | 70967580 | -94.1% | 71849934 | -94.0% |
| mem_load_uops_retired.l3_hit | 33323740 | -50.1% | 33118819 | -50.4% |
| mem_load_uops_retired.l1_miss | 105524579 | -91.7% | 106343931 | -91.7% |
| mem_load_uops_retired.l2_miss | 34616966 | -50.3% | 34514347 | -50.5% |
| mem_load_uops_retired.l3_miss | 1364435 | -46.4% | 1222417 | -52.0% |
| mem_load_uops_retired.hit_lfb | 157897686 | -88.7% | 159649530 | -88.5% |
| mem_load_uops_l3_hit_retired.xsnp_miss | 526 | 20.9% | 408 | -6.2% |
| mem_load_uops_l3_hit_retired.xsnp_hit | 740 | -41.7% | 731 | -42.4% |
| mem_load_uops_l3_hit_retired.xsnp_hitm | 517 | -37.0% | 538 | -34.4% |
| mem_load_uops_l3_hit_retired.xsnp_none | 33376449 | -50.8% | 33313705 | -50.9% |
| mem_load_uops_l3_miss_retired.local_dram | 1301381 | -48.8% | 1241504 | -51.2% |
This is roughly what we would expect, though the L3 misses have been reduced by more than we would expect. Note, too, that while the loads have dropped by ~298M, the total retired instruction count has dropped by 5.3B. This is very significant! And our clock drop by 1.7B cycles shows that the instructions that we took away were (as we expected) well behaved in terms of memory -- they had a CPI of 0.32, which caused our CPI overall to actually rise (from 0.47 to 0.49). All of this was a long way of saying: the double lexing was well behaved with respect to memory, but it was a bunch of unnecessary work, and dtolnay's fix is a really nice win!
The second fix was to avoid String creation for StatemapInputDatum's time member, as it is after all going to be turned directly into a u64. We would expect this to eliminate an allocation followed by a deallocation:
| Counter | statemap-dt-deser | %Δ | statemap-dt-deser-nostring | %Δ |
|---|---|---|---|---|
| cpu_clk_unhalted.thread_p | 21390540509 | -33.5% | 21377152084 | -33.5% |
| inst_retired.any_p | 43365716806 | -11.7% | 42850414294 | -12.7% |
| cpu_clk_unhalted.ref_p | 610826326 | -33.5% | 610429113 | -33.6% |
| mem_uops_retired.stlb_miss_loads | 2344946 | -72.9% | 2310256 | -73.3% |
| mem_uops_retired.stlb_miss_stores | 1017404 | 278.5% | 1005312 | 274.0% |
| mem_uops_retired.lock_loads | 35968 | -99.5% | 52584 | -99.3% |
| mem_uops_retired.split_loads | 54414644 | 50298.4% | 40418209 | 37335.0% |
| mem_uops_retired.split_stores | 43441378 | 21958.9% | 36115193 | 18238.7% |
| mem_uops_retired.all_loads | 8736708559 | -27.1% | 8592065288 | -28.3% |
| mem_uops_retired.all_stores | 6576131085 | 68.1% | 6395531505 | 63.5% |
| mem_load_uops_retired.l1_hit | 8469374960 | -9.3% | 8320089097 | -10.9% |
| mem_load_uops_retired.l2_hit | 71849934 | -94.0% | 73450488 | -93.9% |
| mem_load_uops_retired.l3_hit | 33118819 | -50.4% | 34009737 | -49.1% |
| mem_load_uops_retired.l1_miss | 106343931 | -91.7% | 108820312 | -91.5% |
| mem_load_uops_retired.l2_miss | 34514347 | -50.5% | 35343590 | -49.3% |
| mem_load_uops_retired.l3_miss | 1222417 | -52.0% | 1276716 | -49.8% |
| mem_load_uops_retired.hit_lfb | 159649530 | -88.5% | 158752995 | -88.6% |
| mem_load_uops_l3_hit_retired.xsnp_miss | 408 | -6.2% | 247 | -43.2% |
| mem_load_uops_l3_hit_retired.xsnp_hit | 731 | -42.4% | 629 | -50.4% |
| mem_load_uops_l3_hit_retired.xsnp_hitm | 538 | -34.4% | 634 | -22.7% |
| mem_load_uops_l3_hit_retired.xsnp_none | 33313705 | -50.9% | 34020114 | -49.9% |
| mem_load_uops_l3_miss_retired.local_dram | 1241504 | -51.2% | 1368773 | -46.2% |
This is also a win -- 515M instructions: 144M loads and 180M stores. This win doesn't show up quite as much in overall wall clock time (where the delta will be in the noise, at least for this load), but it's clearly a win nonetheless and should be integrated.
For those that are curious, here is much more raw data, albeit from a different run. (That is, exact numbers will vary, but trends should be the same.)