Skip to content

About 18x speedup in the build phase. #69

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 15, 2025

Conversation

NathanPB
Copy link
Contributor

@NathanPB NathanPB commented Apr 28, 2025

I managed to get a massive speedup for two very commonly used functions (auto_planting_window_doy_shape and lookup_wth). I brought simulations in Tanzania in a 28-core @ 4Ghz processor from ~620 seconds to ~35 seconds by implementing a simple lookup map in the io.py/find_closest_vector_coords function, followed by early returns if the lookup misses.

Also, the shapely dependency was removed, as it isn't necessary any longer (it was used solely in the algorithm that I replaced).

The performance improvements might not be as great for smaller simulations, in terms of amount of sites. The more sites in the simulation, the more it benefits from the changes.

Performance Reports

These performance reports were taken with Perf on my machine, running a sample GSSAT2 simulation in Tanzania, configured for 28 CPU-cores @ 4GHz, with 64GBs of standard DDR4 memory. I can provide the exact data and configuration files if needed.

Summary

This optimization significantly improves performance, reducing total runtime from 626 seconds to just 35 seconds—a 17.8x speedup. The changes also drastically cut CPU workload, executing 13x fewer instructions (85 trillion → 6.4 trillion) and requiring 19x fewer cycles (69.7 trillion → 3.7 trillion). Efficiency improved with higher instructions per cycle (1.22 → 1.75) and fewer stalled cycles (35.7% → 20.4%), while branch prediction became more accurate (0.94% → 0.55% misses). Overall, the optimizations deliver faster execution with far lower CPU overhead.

✨ Disclaimer: Paragraph written by AI.

Before

Performance counter stats for 'python -X perf -m pythia --clean-work-dir --setup test2/pythia-config.json' (5 runs):

     17,438,093.49 msec task-clock                       #   27.858 CPUs utilized               ( +-  0.08% )
           568,468      context-switches                 #   32.599 /sec                        ( +-  2.15% )
            16,541      cpu-migrations                   #    0.949 /sec                        ( +-  1.30% )
           341,541      page-faults                      #   19.586 /sec                        ( +-  0.52% )
85,137,469,787,762      instructions                     #    1.22  insn per cycle
                                                  #    0.29  stalled cycles per insn     ( +-  0.08% )
69,737,313,303,103      cycles                           #    3.999 GHz                         ( +-  0.08% )
24,862,264,765,548      stalled-cycles-frontend          #   35.65% frontend cycles idle        ( +-  0.25% )
18,281,479,410,162      branches                         #    1.048 G/sec                       ( +-  0.07% )
   172,136,315,036      branch-misses                    #    0.94% of all branches             ( +-  0.75% )

           625.953 +- 0.429 seconds time elapsed  ( +-  0.07% )

After

Performance counter stats for 'python -X perf -m pythia --clean-work-dir --setup test2/pythia-config.json' (3 runs):

        919,616.65 msec task-clock                       #   26.100 CPUs utilized               ( +-  0.14% )
           143,583      context-switches                 #  156.134 /sec                        ( +-  4.34% )
             8,313      cpu-migrations                   #    9.040 /sec                        ( +-  5.51% )
           259,900      page-faults                      #  282.618 /sec                        ( +-  0.11% )
 6,434,303,689,973      instructions                     #    1.75  insn per cycle
                                                  #    0.12  stalled cycles per insn     ( +-  0.01% )
 3,674,947,238,272      cycles                           #    3.996 GHz                         ( +-  0.14% )
   750,853,165,462      stalled-cycles-frontend          #   20.43% frontend cycles idle        ( +-  0.51% )
 1,430,992,318,806      branches                         #    1.556 G/sec                       ( +-  0.01% )
     7,936,273,176      branch-misses                    #    0.55% of all branches             ( +-  1.64% )

            35.234 +- 0.108 seconds time elapsed  ( +-  0.31% )

…sest_vector_coords function.

The following report originates from running this function twice per pixel on Tanzania 10x10 km grid:

BEFORE (041cd34) ->
Performance counter stats for 'python -X perf -m pythia --clean-work-dir --setup test2/pythia-config.json' (5 runs):

     17,438,093.49 msec task-clock                       #   27.858 CPUs utilized               ( +-  0.08% )
           568,468      context-switches                 #   32.599 /sec                        ( +-  2.15% )
            16,541      cpu-migrations                   #    0.949 /sec                        ( +-  1.30% )
           341,541      page-faults                      #   19.586 /sec                        ( +-  0.52% )
85,137,469,787,762      instructions                     #    1.22  insn per cycle
                                                  #    0.29  stalled cycles per insn     ( +-  0.08% )
69,737,313,303,103      cycles                           #    3.999 GHz                         ( +-  0.08% )
24,862,264,765,548      stalled-cycles-frontend          #   35.65% frontend cycles idle        ( +-  0.25% )
18,281,479,410,162      branches                         #    1.048 G/sec                       ( +-  0.07% )
   172,136,315,036      branch-misses                    #    0.94% of all branches             ( +-  0.75% )

           625.953 +- 0.429 seconds time elapsed  ( +-  0.07% )

AFTER ->
Performance counter stats for 'python -X perf -m pythia --clean-work-dir --setup test2/pythia-config.json' (5 runs):

      4,436,558.37 msec task-clock                       #   27.514 CPUs utilized               ( +-  0.22% )
           255,284      context-switches                 #   57.541 /sec                        ( +-  1.80% )
            14,824      cpu-migrations                   #    3.341 /sec                        ( +-  1.76% )
           223,269      page-faults                      #   50.325 /sec                        ( +-  0.11% )
22,793,963,559,968      instructions                     #    1.28  insn per cycle
                                                  #    0.27  stalled cycles per insn     ( +-  0.15% )
17,739,586,160,894      cycles                           #    3.999 GHz                         ( +-  0.22% )
 6,203,511,000,548      stalled-cycles-frontend          #   34.97% frontend cycles idle        ( +-  0.34% )
 4,890,648,814,890      branches                         #    1.102 G/sec                       ( +-  0.13% )
    42,420,497,493      branch-misses                    #    0.87% of all branches             ( +-  0.64% )

           161.248 +- 0.403 seconds time elapsed  ( +-  0.25% )
…vector_coords``.

The following report originates from running this function twice per pixel on Tanzania 10x10 km grid:

BEFORE (4ad13a6) ->
Performance counter stats for 'python -X perf -m pythia --clean-work-dir --setup test2/pythia-config.json' (3 runs):

      4,435,650.67 msec task-clock                       #   27.544 CPUs utilized               ( +-  0.39% )
           220,607      context-switches                 #   49.735 /sec                        ( +-  3.27% )
            13,995      cpu-migrations                   #    3.155 /sec                        ( +-  0.83% )
           222,164      page-faults                      #   50.086 /sec                        ( +-  0.21% )
22,727,605,668,167      instructions                     #    1.28  insn per cycle
                                                  #    0.28  stalled cycles per insn     ( +-  0.06% )
17,736,864,542,617      cycles                           #    3.999 GHz                         ( +-  0.39% )
 6,306,883,751,606      stalled-cycles-frontend          #   35.56% frontend cycles idle        ( +-  0.86% )
 4,876,195,822,590      branches                         #    1.099 G/sec                       ( +-  0.06% )
    41,536,653,902      branch-misses                    #    0.85% of all branches             ( +-  1.52% )

           161.038 +- 0.870 seconds time elapsed  ( +-  0.54% )

AFTER ->
Performance counter stats for 'python -X perf -m pythia --clean-work-dir --setup test2/pythia-config.json' (3 runs):

        919,616.65 msec task-clock                       #   26.100 CPUs utilized               ( +-  0.14% )
           143,583      context-switches                 #  156.134 /sec                        ( +-  4.34% )
             8,313      cpu-migrations                   #    9.040 /sec                        ( +-  5.51% )
           259,900      page-faults                      #  282.618 /sec                        ( +-  0.11% )
 6,434,303,689,973      instructions                     #    1.75  insn per cycle
                                                  #    0.12  stalled cycles per insn     ( +-  0.01% )
 3,674,947,238,272      cycles                           #    3.996 GHz                         ( +-  0.14% )
   750,853,165,462      stalled-cycles-frontend          #   20.43% frontend cycles idle        ( +-  0.51% )
 1,430,992,318,806      branches                         #    1.556 G/sec                       ( +-  0.01% )
     7,936,273,176      branch-misses                    #    0.55% of all branches             ( +-  1.64% )

            35.234 +- 0.108 seconds time elapsed  ( +-  0.31% )
@wpavan wpavan merged commit e871f66 into DSSAT:main May 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants