About 18x speedup in the build phase. #69

NathanPB · 2025-04-28T00:19:36Z

I managed to get a massive speedup for two very commonly used functions (auto_planting_window_doy_shape and lookup_wth). I brought simulations in Tanzania in a 28-core @ 4Ghz processor from ~620 seconds to ~35 seconds by implementing a simple lookup map in the io.py/find_closest_vector_coords function, followed by early returns if the lookup misses.

Also, the shapely dependency was removed, as it isn't necessary any longer (it was used solely in the algorithm that I replaced).

The performance improvements might not be as great for smaller simulations, in terms of amount of sites. The more sites in the simulation, the more it benefits from the changes.

Performance Reports

These performance reports were taken with Perf on my machine, running a sample GSSAT2 simulation in Tanzania, configured for 28 CPU-cores @ 4GHz, with 64GBs of standard DDR4 memory. I can provide the exact data and configuration files if needed.

Summary

This optimization significantly improves performance, reducing total runtime from 626 seconds to just 35 seconds—a 17.8x speedup. The changes also drastically cut CPU workload, executing 13x fewer instructions (85 trillion → 6.4 trillion) and requiring 19x fewer cycles (69.7 trillion → 3.7 trillion). Efficiency improved with higher instructions per cycle (1.22 → 1.75) and fewer stalled cycles (35.7% → 20.4%), while branch prediction became more accurate (0.94% → 0.55% misses). Overall, the optimizations deliver faster execution with far lower CPU overhead.

✨ Disclaimer: Paragraph written by AI.

Before

Performance counter stats for 'python -X perf -m pythia --clean-work-dir --setup test2/pythia-config.json' (5 runs):

     17,438,093.49 msec task-clock                       #   27.858 CPUs utilized               ( +-  0.08% )
           568,468      context-switches                 #   32.599 /sec                        ( +-  2.15% )
            16,541      cpu-migrations                   #    0.949 /sec                        ( +-  1.30% )
           341,541      page-faults                      #   19.586 /sec                        ( +-  0.52% )
85,137,469,787,762      instructions                     #    1.22  insn per cycle
                                                  #    0.29  stalled cycles per insn     ( +-  0.08% )
69,737,313,303,103      cycles                           #    3.999 GHz                         ( +-  0.08% )
24,862,264,765,548      stalled-cycles-frontend          #   35.65% frontend cycles idle        ( +-  0.25% )
18,281,479,410,162      branches                         #    1.048 G/sec                       ( +-  0.07% )
   172,136,315,036      branch-misses                    #    0.94% of all branches             ( +-  0.75% )

           625.953 +- 0.429 seconds time elapsed  ( +-  0.07% )

After

Performance counter stats for 'python -X perf -m pythia --clean-work-dir --setup test2/pythia-config.json' (3 runs):

        919,616.65 msec task-clock                       #   26.100 CPUs utilized               ( +-  0.14% )
           143,583      context-switches                 #  156.134 /sec                        ( +-  4.34% )
             8,313      cpu-migrations                   #    9.040 /sec                        ( +-  5.51% )
           259,900      page-faults                      #  282.618 /sec                        ( +-  0.11% )
 6,434,303,689,973      instructions                     #    1.75  insn per cycle
                                                  #    0.12  stalled cycles per insn     ( +-  0.01% )
 3,674,947,238,272      cycles                           #    3.996 GHz                         ( +-  0.14% )
   750,853,165,462      stalled-cycles-frontend          #   20.43% frontend cycles idle        ( +-  0.51% )
 1,430,992,318,806      branches                         #    1.556 G/sec                       ( +-  0.01% )
     7,936,273,176      branch-misses                    #    0.55% of all branches             ( +-  1.64% )

            35.234 +- 0.108 seconds time elapsed  ( +-  0.31% )

…sest_vector_coords function. The following report originates from running this function twice per pixel on Tanzania 10x10 km grid: BEFORE (041cd34) -> Performance counter stats for 'python -X perf -m pythia --clean-work-dir --setup test2/pythia-config.json' (5 runs): 17,438,093.49 msec task-clock # 27.858 CPUs utilized ( +- 0.08% ) 568,468 context-switches # 32.599 /sec ( +- 2.15% ) 16,541 cpu-migrations # 0.949 /sec ( +- 1.30% ) 341,541 page-faults # 19.586 /sec ( +- 0.52% ) 85,137,469,787,762 instructions # 1.22 insn per cycle # 0.29 stalled cycles per insn ( +- 0.08% ) 69,737,313,303,103 cycles # 3.999 GHz ( +- 0.08% ) 24,862,264,765,548 stalled-cycles-frontend # 35.65% frontend cycles idle ( +- 0.25% ) 18,281,479,410,162 branches # 1.048 G/sec ( +- 0.07% ) 172,136,315,036 branch-misses # 0.94% of all branches ( +- 0.75% ) 625.953 +- 0.429 seconds time elapsed ( +- 0.07% ) AFTER -> Performance counter stats for 'python -X perf -m pythia --clean-work-dir --setup test2/pythia-config.json' (5 runs): 4,436,558.37 msec task-clock # 27.514 CPUs utilized ( +- 0.22% ) 255,284 context-switches # 57.541 /sec ( +- 1.80% ) 14,824 cpu-migrations # 3.341 /sec ( +- 1.76% ) 223,269 page-faults # 50.325 /sec ( +- 0.11% ) 22,793,963,559,968 instructions # 1.28 insn per cycle # 0.27 stalled cycles per insn ( +- 0.15% ) 17,739,586,160,894 cycles # 3.999 GHz ( +- 0.22% ) 6,203,511,000,548 stalled-cycles-frontend # 34.97% frontend cycles idle ( +- 0.34% ) 4,890,648,814,890 branches # 1.102 G/sec ( +- 0.13% ) 42,420,497,493 branch-misses # 0.87% of all branches ( +- 0.64% ) 161.248 +- 0.403 seconds time elapsed ( +- 0.25% )

…vector_coords``. The following report originates from running this function twice per pixel on Tanzania 10x10 km grid: BEFORE (4ad13a6) -> Performance counter stats for 'python -X perf -m pythia --clean-work-dir --setup test2/pythia-config.json' (3 runs): 4,435,650.67 msec task-clock # 27.544 CPUs utilized ( +- 0.39% ) 220,607 context-switches # 49.735 /sec ( +- 3.27% ) 13,995 cpu-migrations # 3.155 /sec ( +- 0.83% ) 222,164 page-faults # 50.086 /sec ( +- 0.21% ) 22,727,605,668,167 instructions # 1.28 insn per cycle # 0.28 stalled cycles per insn ( +- 0.06% ) 17,736,864,542,617 cycles # 3.999 GHz ( +- 0.39% ) 6,306,883,751,606 stalled-cycles-frontend # 35.56% frontend cycles idle ( +- 0.86% ) 4,876,195,822,590 branches # 1.099 G/sec ( +- 0.06% ) 41,536,653,902 branch-misses # 0.85% of all branches ( +- 1.52% ) 161.038 +- 0.870 seconds time elapsed ( +- 0.54% ) AFTER -> Performance counter stats for 'python -X perf -m pythia --clean-work-dir --setup test2/pythia-config.json' (3 runs): 919,616.65 msec task-clock # 26.100 CPUs utilized ( +- 0.14% ) 143,583 context-switches # 156.134 /sec ( +- 4.34% ) 8,313 cpu-migrations # 9.040 /sec ( +- 5.51% ) 259,900 page-faults # 282.618 /sec ( +- 0.11% ) 6,434,303,689,973 instructions # 1.75 insn per cycle # 0.12 stalled cycles per insn ( +- 0.01% ) 3,674,947,238,272 cycles # 3.996 GHz ( +- 0.14% ) 750,853,165,462 stalled-cycles-frontend # 20.43% frontend cycles idle ( +- 0.51% ) 1,430,992,318,806 branches # 1.556 G/sec ( +- 0.01% ) 7,936,273,176 branch-misses # 0.55% of all branches ( +- 1.64% ) 35.234 +- 0.108 seconds time elapsed ( +- 0.31% )

NathanPB added 3 commits April 27, 2025 20:01

chore: remove shapely dependency

4ad13a6

wpavan merged commit e871f66 into DSSAT:main May 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About 18x speedup in the build phase. #69

About 18x speedup in the build phase. #69

NathanPB commented Apr 28, 2025 •

edited

Loading

About 18x speedup in the build phase. #69

About 18x speedup in the build phase. #69

Conversation

NathanPB commented Apr 28, 2025 • edited Loading

Performance Reports

Summary

Before

After

NathanPB commented Apr 28, 2025 •

edited

Loading