understand performance of C versus Rust implementations

This issue is to capture some performance analysis of the C implementation versus the Rust implementation, as well as to explore the effect of [David Tolnay](https://github.com/dtolnay)'s [proposed improvements](https://github.com/joyent/statemap/pull/40).  

These tests were run on SmartOS on a single-socket Haswell server (Xeon E3-1270 v3) running at 3.50GHz.  All of the tests were run bound to a processor set containing a single core; all were bound to one logical CPU within that core, with the other logical CPU forced to be idle.  [cpustat](https://illumos.org/man/1m/cpustat) was used to gather CPU performance counter data, with one number denoting one run with `pic0` programmed to that CPU performance counter.  

The [input file](https://us-east.manta.joyent.com/bcantrill/public/statemap/pg-zfs.out.gz) (~30MB compressed) contains 3.9M state changes, and in the default config will generate a ~6MB SVG.

- *statemap-gcc* is the node.js/C hybrid compiled with GCC 7.3.0. 

- *%Δ* is the percentage delta relative to *statemap-gcc*

- *statemap-clang* is the node.js/C hybrid compiled with clang 6.0.1.

- *statemap-rust* is the Rust implementation compiled with 1.29.0 (`--release`) but without any additional optimizations

- *statemap-dtolnay-deser* is dtolnay's deserializing optimization, as outlined in [his pull request](https://github.com/joyent/statemap/pull/40), also compiled with 1.29.0 

- *statemap-dtolnay-deser-nostring* is dtolnay's deserlializing optimization *and* his elimination of transiet string allocation to parse the `time` field of `StatemapInputDatum`, also compiled with 1.29.0 

First, and to get it out of the way, here is the GCC-compiled C version relative to the Clang-compiled C version:

|*Counter*|*statemap-gcc*|*statemap-clang*|*%Δ*|
|-|-|-|-|
|cpu_clk_unhalted.thread_p|32166437125|30479632976|-5.2%|
|inst_retired.any_p|49110875829|50526767701|2.9%|
|cpu_clk_unhalted.ref_p|918870673|870925736|-5.2%|
|mem_uops_retired.stlb_miss_loads|8651386|8461734|-2.2%|
|mem_uops_retired.stlb_miss_stores|268802|240486|-10.5%|
|mem_uops_retired.lock_loads|7791528|6192277|-20.5%|
|mem_uops_retired.split_loads|107969|121187|12.2%|
|mem_uops_retired.split_stores|196934|157650|-19.9%|
|mem_uops_retired.all_loads|11977544999|11862421792|-1.0%|
|mem_uops_retired.all_stores|3911589945|3717098293|-5.0%|
|mem_load_uops_retired.l1_hit|9337365435|9285834500|-0.6%|
|mem_load_uops_retired.l2_hit|1205703362|1214733389|0.7%|
|mem_load_uops_retired.l3_hit|66771301|67133832|0.5%|
|mem_load_uops_retired.l1_miss|1276311911|1290571092|1.1%|
|mem_load_uops_retired.l2_miss|69671774|70013995|0.5%|
|mem_load_uops_retired.l3_miss|2544750|2123972|-16.5%|
|mem_load_uops_retired.hit_lfb|1393631815|1376606034|-1.2%|
|mem_load_uops_l3_hit_retired.xsnp_miss|435|547|25.7%|
|mem_load_uops_l3_hit_retired.xsnp_hit|1269|1140|-10.2%|
|mem_load_uops_l3_hit_retired.xsnp_hitm|820|1100|34.1%|
|mem_load_uops_l3_hit_retired.xsnp_none|67846758|67620920|-0.3%|
|mem_load_uops_l3_miss_retired.local_dram|2543699|2082507|-18.1%|

There is a significant delta here (a 5% improvement in run-time), which appears to to be due to fewer instructions (1.4B fewer), but also better memory behavior, with CPI dropping from 0.65 to 0.60.

Now here is the C version relative to the Rust version:

|*Counter*|*statemap-gcc*|*statemap-rust*|*%Δ*|
|-|-|-|-|
|cpu_clk_unhalted.thread_p|32166437125|23127271226|-28.1%|
|inst_retired.any_p|49110875829|48752136699|-0.7%|
|cpu_clk_unhalted.ref_p|918870673|660493684|-28.1%|
|mem_uops_retired.stlb_miss_loads|8651386|2353178|-72.8%|
|mem_uops_retired.stlb_miss_stores|268802|1000684|272.3%|
|mem_uops_retired.lock_loads|7791528|51737|-99.3%|
|mem_uops_retired.split_loads|107969|52745125|48752.1%|
|mem_uops_retired.split_stores|196934|41814301|21132.6%|
|mem_uops_retired.all_loads|11977544999|9035048050|-24.6%|
|mem_uops_retired.all_stores|3911589945|6627038769|69.4%|
|mem_load_uops_retired.l1_hit|9337365435|8756546174|-6.2%|
|mem_load_uops_retired.l2_hit|1205703362|70967580|-94.1%|
|mem_load_uops_retired.l3_hit|66771301|33323740|-50.1%|
|mem_load_uops_retired.l1_miss|1276311911|105524579|-91.7%|
|mem_load_uops_retired.l2_miss|69671774|34616966|-50.3%|
|mem_load_uops_retired.l3_miss|2544750|1364435|-46.4%|
|mem_load_uops_retired.hit_lfb|1393631815|157897686|-88.7%|
|mem_load_uops_l3_hit_retired.xsnp_miss|435|526|20.9%|
|mem_load_uops_l3_hit_retired.xsnp_hit|1269|740|-41.7%|
|mem_load_uops_l3_hit_retired.xsnp_hitm|820|517|-37.0%|
|mem_load_uops_l3_hit_retired.xsnp_none|67846758|33376449|-50.8%|
|mem_load_uops_l3_miss_retired.local_dram|2543699|1301381|-48.8%|

The Rust version is issuing a remarkably similar number of instructions (within less than one percent!), but with a decidedly different mix: just three quarters of the loads of the C version and (interestingly) many more stores.  The CPI drops from 0.65 to 0.47, indicating much better memory behavior -- and indeed the L1 misses, L2 misses and L3 misses are all way down.  The L1 hits as an absolute number are actually quite high relative to the loads, giving  Rust a 96.9% L1 hit rate versus the C version's 77.9% hit rate.  Rust also lives much better in the L2, where it has half the L2 misses of the C version.

Now for dtolnay's deserializing improvement (that is, leaning on [`serde_json::StreamDeserializer`](https://docs.serde.rs/serde_json/struct.StreamDeserializer.html) instead of scanning ahead for the end of the current JSON blob).  This improvement ditches the double lexing.  We would expect the number of loads to be reduced by roughly the size of the input, but because these happen within the current JSON blob being parsed (and because these blobs by and large fit in the L1), we would expect these to mainly be L1/L2 hits.  Here's the data:

|*Counter*|*statemap-rust*|*%Δ*|*statemap-dt-deser*|*%Δ*|
|-|-|-|-|-|
|cpu_clk_unhalted.thread_p|23127271226|-28.1%|21390540509|-33.5%|
|inst_retired.any_p|48752136699|-0.7%|43365716806|-11.7%|
|cpu_clk_unhalted.ref_p|660493684|-28.1%|610826326|-33.5%|
|mem_uops_retired.stlb_miss_loads|2353178|-72.8%|2344946|-72.9%|
|mem_uops_retired.stlb_miss_stores|1000684|272.3%|1017404|278.5%|
|mem_uops_retired.lock_loads|51737|-99.3%|35968|-99.5%|
|mem_uops_retired.split_loads|52745125|48752.1%|54414644|50298.4%|
|mem_uops_retired.split_stores|41814301|21132.6%|43441378|21958.9%|
|mem_uops_retired.all_loads|9035048050|-24.6%|8736708559|-27.1%|
|mem_uops_retired.all_stores|6627038769|69.4%|6576131085|68.1%|
|mem_load_uops_retired.l1_hit|8756546174|-6.2%|8469374960|-9.3%|
|mem_load_uops_retired.l2_hit|70967580|-94.1%|71849934|-94.0%|
|mem_load_uops_retired.l3_hit|33323740|-50.1%|33118819|-50.4%|
|mem_load_uops_retired.l1_miss|105524579|-91.7%|106343931|-91.7%|
|mem_load_uops_retired.l2_miss|34616966|-50.3%|34514347|-50.5%|
|mem_load_uops_retired.l3_miss|1364435|-46.4%|1222417|-52.0%|
|mem_load_uops_retired.hit_lfb|157897686|-88.7%|159649530|-88.5%|
|mem_load_uops_l3_hit_retired.xsnp_miss|526|20.9%|408|-6.2%|
|mem_load_uops_l3_hit_retired.xsnp_hit|740|-41.7%|731|-42.4%|
|mem_load_uops_l3_hit_retired.xsnp_hitm|517|-37.0%|538|-34.4%|
|mem_load_uops_l3_hit_retired.xsnp_none|33376449|-50.8%|33313705|-50.9%|
|mem_load_uops_l3_miss_retired.local_dram|1301381|-48.8%|1241504|-51.2%|

This is roughly what we would expect, though the L3 misses have been reduced by more than we would expect.  Note, too, that while the loads have dropped by ~298M, the total retired instruction count has dropped by 5.3B.  This is very significant! And our clock drop by 1.7B cycles shows that the instructions that we took away were (as we expected) well behaved in terms of memory -- they had a CPI of 0.32, which caused our CPI overall to actually rise (from 0.47 to 0.49).  All of this was a long way of saying: the double lexing was well behaved with respect to memory, but it was a bunch of unnecessary work, and dtolnay's fix is a really nice win!

The second fix was to avoid `String` creation for `StatemapInputDatum`'s `time` member, as it is after all going to be turned directly into a `u64`.  We would expect this to eliminate an allocation followed by a deallocation:

|*Counter*|*statemap-dt-deser*|*%Δ*|*statemap-dt-deser-nostring*|*%Δ*|
|-|-|-|-|-|
|cpu_clk_unhalted.thread_p|21390540509|-33.5%|21377152084|-33.5%|
|inst_retired.any_p|43365716806|-11.7%|42850414294|-12.7%|
|cpu_clk_unhalted.ref_p|610826326|-33.5%|610429113|-33.6%|
|mem_uops_retired.stlb_miss_loads|2344946|-72.9%|2310256|-73.3%|
|mem_uops_retired.stlb_miss_stores|1017404|278.5%|1005312|274.0%|
|mem_uops_retired.lock_loads|35968|-99.5%|52584|-99.3%|
|mem_uops_retired.split_loads|54414644|50298.4%|40418209|37335.0%|
|mem_uops_retired.split_stores|43441378|21958.9%|36115193|18238.7%|
|mem_uops_retired.all_loads|8736708559|-27.1%|8592065288|-28.3%|
|mem_uops_retired.all_stores|6576131085|68.1%|6395531505|63.5%|
|mem_load_uops_retired.l1_hit|8469374960|-9.3%|8320089097|-10.9%|
|mem_load_uops_retired.l2_hit|71849934|-94.0%|73450488|-93.9%|
|mem_load_uops_retired.l3_hit|33118819|-50.4%|34009737|-49.1%|
|mem_load_uops_retired.l1_miss|106343931|-91.7%|108820312|-91.5%|
|mem_load_uops_retired.l2_miss|34514347|-50.5%|35343590|-49.3%|
|mem_load_uops_retired.l3_miss|1222417|-52.0%|1276716|-49.8%|
|mem_load_uops_retired.hit_lfb|159649530|-88.5%|158752995|-88.6%|
|mem_load_uops_l3_hit_retired.xsnp_miss|408|-6.2%|247|-43.2%|
|mem_load_uops_l3_hit_retired.xsnp_hit|731|-42.4%|629|-50.4%|
|mem_load_uops_l3_hit_retired.xsnp_hitm|538|-34.4%|634|-22.7%|
|mem_load_uops_l3_hit_retired.xsnp_none|33313705|-50.9%|34020114|-49.9%|
|mem_load_uops_l3_miss_retired.local_dram|1241504|-51.2%|1368773|-46.2%|

This is also a win -- 515M instructions: 144M loads and 180M stores.  This win doesn't show up quite as much in overall wall clock time (where the delta will be in the noise, at least for this load), but it's clearly a win nonetheless and should be integrated.

For those that are curious, here is [much more raw data](https://gist.github.com/bcantrill/518f7d2723838c6197f7fe280544c711), albeit from a different run. (That is, exact numbers will vary, but trends should be the same.)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

understand performance of C versus Rust implementations #42

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Counter	statemap-gcc	statemap-clang	%Δ
cpu_clk_unhalted.thread_p	32166437125	30479632976	-5.2%
inst_retired.any_p	49110875829	50526767701	2.9%
cpu_clk_unhalted.ref_p	918870673	870925736	-5.2%
mem_uops_retired.stlb_miss_loads	8651386	8461734	-2.2%
mem_uops_retired.stlb_miss_stores	268802	240486	-10.5%
mem_uops_retired.lock_loads	7791528	6192277	-20.5%
mem_uops_retired.split_loads	107969	121187	12.2%
mem_uops_retired.split_stores	196934	157650	-19.9%
mem_uops_retired.all_loads	11977544999	11862421792	-1.0%
mem_uops_retired.all_stores	3911589945	3717098293	-5.0%
mem_load_uops_retired.l1_hit	9337365435	9285834500	-0.6%
mem_load_uops_retired.l2_hit	1205703362	1214733389	0.7%
mem_load_uops_retired.l3_hit	66771301	67133832	0.5%
mem_load_uops_retired.l1_miss	1276311911	1290571092	1.1%
mem_load_uops_retired.l2_miss	69671774	70013995	0.5%
mem_load_uops_retired.l3_miss	2544750	2123972	-16.5%
mem_load_uops_retired.hit_lfb	1393631815	1376606034	-1.2%
mem_load_uops_l3_hit_retired.xsnp_miss	435	547	25.7%
mem_load_uops_l3_hit_retired.xsnp_hit	1269	1140	-10.2%
mem_load_uops_l3_hit_retired.xsnp_hitm	820	1100	34.1%
mem_load_uops_l3_hit_retired.xsnp_none	67846758	67620920	-0.3%
mem_load_uops_l3_miss_retired.local_dram	2543699	2082507	-18.1%

Counter	statemap-gcc	statemap-rust	%Δ
cpu_clk_unhalted.thread_p	32166437125	23127271226	-28.1%
inst_retired.any_p	49110875829	48752136699	-0.7%
cpu_clk_unhalted.ref_p	918870673	660493684	-28.1%
mem_uops_retired.stlb_miss_loads	8651386	2353178	-72.8%
mem_uops_retired.stlb_miss_stores	268802	1000684	272.3%
mem_uops_retired.lock_loads	7791528	51737	-99.3%
mem_uops_retired.split_loads	107969	52745125	48752.1%
mem_uops_retired.split_stores	196934	41814301	21132.6%
mem_uops_retired.all_loads	11977544999	9035048050	-24.6%
mem_uops_retired.all_stores	3911589945	6627038769	69.4%
mem_load_uops_retired.l1_hit	9337365435	8756546174	-6.2%
mem_load_uops_retired.l2_hit	1205703362	70967580	-94.1%
mem_load_uops_retired.l3_hit	66771301	33323740	-50.1%
mem_load_uops_retired.l1_miss	1276311911	105524579	-91.7%
mem_load_uops_retired.l2_miss	69671774	34616966	-50.3%
mem_load_uops_retired.l3_miss	2544750	1364435	-46.4%
mem_load_uops_retired.hit_lfb	1393631815	157897686	-88.7%
mem_load_uops_l3_hit_retired.xsnp_miss	435	526	20.9%
mem_load_uops_l3_hit_retired.xsnp_hit	1269	740	-41.7%
mem_load_uops_l3_hit_retired.xsnp_hitm	820	517	-37.0%
mem_load_uops_l3_hit_retired.xsnp_none	67846758	33376449	-50.8%
mem_load_uops_l3_miss_retired.local_dram	2543699	1301381	-48.8%

Counter	statemap-dt-deser	%Δ	statemap-dt-deser-nostring	%Δ
cpu_clk_unhalted.thread_p	21390540509	-33.5%	21377152084	-33.5%
inst_retired.any_p	43365716806	-11.7%	42850414294	-12.7%
cpu_clk_unhalted.ref_p	610826326	-33.5%	610429113	-33.6%
mem_uops_retired.stlb_miss_loads	2344946	-72.9%	2310256	-73.3%
mem_uops_retired.stlb_miss_stores	1017404	278.5%	1005312	274.0%
mem_uops_retired.lock_loads	35968	-99.5%	52584	-99.3%
mem_uops_retired.split_loads	54414644	50298.4%	40418209	37335.0%
mem_uops_retired.split_stores	43441378	21958.9%	36115193	18238.7%
mem_uops_retired.all_loads	8736708559	-27.1%	8592065288	-28.3%
mem_uops_retired.all_stores	6576131085	68.1%	6395531505	63.5%
mem_load_uops_retired.l1_hit	8469374960	-9.3%	8320089097	-10.9%
mem_load_uops_retired.l2_hit	71849934	-94.0%	73450488	-93.9%
mem_load_uops_retired.l3_hit	33118819	-50.4%	34009737	-49.1%
mem_load_uops_retired.l1_miss	106343931	-91.7%	108820312	-91.5%
mem_load_uops_retired.l2_miss	34514347	-50.5%	35343590	-49.3%
mem_load_uops_retired.l3_miss	1222417	-52.0%	1276716	-49.8%
mem_load_uops_retired.hit_lfb	159649530	-88.5%	158752995	-88.6%
mem_load_uops_l3_hit_retired.xsnp_miss	408	-6.2%	247	-43.2%
mem_load_uops_l3_hit_retired.xsnp_hit	731	-42.4%	629	-50.4%
mem_load_uops_l3_hit_retired.xsnp_hitm	538	-34.4%	634	-22.7%
mem_load_uops_l3_hit_retired.xsnp_none	33313705	-50.9%	34020114	-49.9%
mem_load_uops_l3_miss_retired.local_dram	1241504	-51.2%	1368773	-46.2%

understand performance of C versus Rust implementations #42

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions