|
8 | 8 | - Estimate the asymptotic complexity (big-O notation) of any R code that depends on some data size N: `references_best()` |
9 | 9 | - Compare time/memory of different git versions of R package code: `atime_versions()` |
10 | 10 | - Continuous performance testing of R packages: `atime_pkg()` |
| 11 | + - [{]{style="color: goldenrod"}[line_profiler](https://github.com/pyutils/line_profiler){style="color: goldenrod"}[}]{style="color: goldenrod"} - A module for doing line-by-line profiling of functions. kernprof is a convenient script for running either line_profiler or the Python standard library's cProfile or profile modules, depending on what is available. |
11 | 12 | - [{]{style="color: #990000"}[pipetime](https://cygei.github.io/pipetime/){style="color: #990000"}[}]{style="color: #990000"} - Measures and logs code execution time within pipes |
12 | 13 | - [{]{style="color: #990000"}[syrup](https://cran.r-project.org/web/packages/syrup/){style="color: #990000"}[}]{style="color: #990000"} - Measure Memory and CPU Usage for Parallel R Code |
13 | 14 | - [{]{style="color: #990000"}[memuse](https://cran.r-project.org/web/packages/memuse/){style="color: #990000"}[}]{style="color: #990000"} - Memory Estimation Utilities |
|
40 | 41 | - Function call overhead |
41 | 42 |
|
42 | 43 | - Yes, function calls have overhead. No, it doesn't matter for data pipelines. A function call costs \~100 nanoseconds. Your database query costs 500 milliseconds. Do the math. |
| 44 | +- Guidelines ([source](https://dagster.io/blog/when-and-when-not-to-optimize-data-pipelines)) |
| 45 | + - Don't optimize if you're: |
| 46 | + - Optimizing code that runs infrequently |
| 47 | + - Optimizing before measuring |
| 48 | + - Optimizing Python/R when the database is the bottleneck |
| 49 | + - Adding complexity to save \<10% runtime |
| 50 | + - Optimizing because it's "more elegant" |
| 51 | + - Using a "faster" library you don't understand |
| 52 | + - Parallel processing without checking if you're I/O-bound |
| 53 | + - Start optimizing if: |
| 54 | + - Pipeline misses SLA regularly |
| 55 | + - Profiling shows clear bottleneck (\>50% of runtime) |
| 56 | + - Bottleneck is CPU/memory-bound in your code |
| 57 | + - Optimization has clear ROI (time saved × frequency) |
| 58 | + - You've already fixed the database queries |
| 59 | + - Users are actually waiting for results |
43 | 60 | - Common valuable optiimizations ([source](https://dagster.io/blog/when-and-when-not-to-optimize-data-pipelines)) |
44 | 61 | - Database queries |
45 | 62 | - e.g. Adding an index, fixing a join, or using proper partitioning |
46 | 63 | - Unnecessary data loading |
47 | 64 | - Only load the data you need. |
48 | | - - e.g. Parquets partitioned by date allow you to only read that date's parquet file. Lots of data reading functions allow you to filter rows or select columns when reading in data. |
| 65 | + - e.g. Parquets partitioned by date allow you to only read that date's parquet file. Lots of data reading functions allow you to chunk, filter rows, or select columns when reading in data. |
49 | 66 | - Inefficient iteration patterns |
50 | 67 | - e.g. use vectorized functions wherever possible |
51 | 68 | - Serialization/deserialization in tight loops |
|
0 commit comments