-
Notifications
You must be signed in to change notification settings - Fork 19
About 18x speedup in the build phase. #69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…sest_vector_coords function. The following report originates from running this function twice per pixel on Tanzania 10x10 km grid: BEFORE (041cd34) -> Performance counter stats for 'python -X perf -m pythia --clean-work-dir --setup test2/pythia-config.json' (5 runs): 17,438,093.49 msec task-clock # 27.858 CPUs utilized ( +- 0.08% ) 568,468 context-switches # 32.599 /sec ( +- 2.15% ) 16,541 cpu-migrations # 0.949 /sec ( +- 1.30% ) 341,541 page-faults # 19.586 /sec ( +- 0.52% ) 85,137,469,787,762 instructions # 1.22 insn per cycle # 0.29 stalled cycles per insn ( +- 0.08% ) 69,737,313,303,103 cycles # 3.999 GHz ( +- 0.08% ) 24,862,264,765,548 stalled-cycles-frontend # 35.65% frontend cycles idle ( +- 0.25% ) 18,281,479,410,162 branches # 1.048 G/sec ( +- 0.07% ) 172,136,315,036 branch-misses # 0.94% of all branches ( +- 0.75% ) 625.953 +- 0.429 seconds time elapsed ( +- 0.07% ) AFTER -> Performance counter stats for 'python -X perf -m pythia --clean-work-dir --setup test2/pythia-config.json' (5 runs): 4,436,558.37 msec task-clock # 27.514 CPUs utilized ( +- 0.22% ) 255,284 context-switches # 57.541 /sec ( +- 1.80% ) 14,824 cpu-migrations # 3.341 /sec ( +- 1.76% ) 223,269 page-faults # 50.325 /sec ( +- 0.11% ) 22,793,963,559,968 instructions # 1.28 insn per cycle # 0.27 stalled cycles per insn ( +- 0.15% ) 17,739,586,160,894 cycles # 3.999 GHz ( +- 0.22% ) 6,203,511,000,548 stalled-cycles-frontend # 34.97% frontend cycles idle ( +- 0.34% ) 4,890,648,814,890 branches # 1.102 G/sec ( +- 0.13% ) 42,420,497,493 branch-misses # 0.87% of all branches ( +- 0.64% ) 161.248 +- 0.403 seconds time elapsed ( +- 0.25% )
…vector_coords``. The following report originates from running this function twice per pixel on Tanzania 10x10 km grid: BEFORE (4ad13a6) -> Performance counter stats for 'python -X perf -m pythia --clean-work-dir --setup test2/pythia-config.json' (3 runs): 4,435,650.67 msec task-clock # 27.544 CPUs utilized ( +- 0.39% ) 220,607 context-switches # 49.735 /sec ( +- 3.27% ) 13,995 cpu-migrations # 3.155 /sec ( +- 0.83% ) 222,164 page-faults # 50.086 /sec ( +- 0.21% ) 22,727,605,668,167 instructions # 1.28 insn per cycle # 0.28 stalled cycles per insn ( +- 0.06% ) 17,736,864,542,617 cycles # 3.999 GHz ( +- 0.39% ) 6,306,883,751,606 stalled-cycles-frontend # 35.56% frontend cycles idle ( +- 0.86% ) 4,876,195,822,590 branches # 1.099 G/sec ( +- 0.06% ) 41,536,653,902 branch-misses # 0.85% of all branches ( +- 1.52% ) 161.038 +- 0.870 seconds time elapsed ( +- 0.54% ) AFTER -> Performance counter stats for 'python -X perf -m pythia --clean-work-dir --setup test2/pythia-config.json' (3 runs): 919,616.65 msec task-clock # 26.100 CPUs utilized ( +- 0.14% ) 143,583 context-switches # 156.134 /sec ( +- 4.34% ) 8,313 cpu-migrations # 9.040 /sec ( +- 5.51% ) 259,900 page-faults # 282.618 /sec ( +- 0.11% ) 6,434,303,689,973 instructions # 1.75 insn per cycle # 0.12 stalled cycles per insn ( +- 0.01% ) 3,674,947,238,272 cycles # 3.996 GHz ( +- 0.14% ) 750,853,165,462 stalled-cycles-frontend # 20.43% frontend cycles idle ( +- 0.51% ) 1,430,992,318,806 branches # 1.556 G/sec ( +- 0.01% ) 7,936,273,176 branch-misses # 0.55% of all branches ( +- 1.64% ) 35.234 +- 0.108 seconds time elapsed ( +- 0.31% )
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I managed to get a massive speedup for two very commonly used functions (
auto_planting_window_doy_shape
andlookup_wth
). I brought simulations in Tanzania in a 28-core @ 4Ghz processor from ~620 seconds to ~35 seconds by implementing a simple lookup map in theio.py/find_closest_vector_coords
function, followed by early returns if the lookup misses.Also, the
shapely
dependency was removed, as it isn't necessary any longer (it was used solely in the algorithm that I replaced).The performance improvements might not be as great for smaller simulations, in terms of amount of sites. The more sites in the simulation, the more it benefits from the changes.
Performance Reports
These performance reports were taken with Perf on my machine, running a sample GSSAT2 simulation in Tanzania, configured for 28 CPU-cores @ 4GHz, with 64GBs of standard DDR4 memory. I can provide the exact data and configuration files if needed.
Summary
This optimization significantly improves performance, reducing total runtime from 626 seconds to just 35 seconds—a 17.8x speedup. The changes also drastically cut CPU workload, executing 13x fewer instructions (85 trillion → 6.4 trillion) and requiring 19x fewer cycles (69.7 trillion → 3.7 trillion). Efficiency improved with higher instructions per cycle (1.22 → 1.75) and fewer stalled cycles (35.7% → 20.4%), while branch prediction became more accurate (0.94% → 0.55% misses). Overall, the optimizations deliver faster execution with far lower CPU overhead.
Before
After