Skip to content

Commit dbe7e65

Browse files
committed
json >> {orjson}; code-opt >> {line_profiler}; llms-mcp >> {plumber2mcp}; py-pol >> lazy eval ex
1 parent 67ee675 commit dbe7e65

File tree

6 files changed

+97
-10
lines changed

6 files changed

+97
-10
lines changed

qmd/apis.qmd

Lines changed: 21 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -745,19 +745,35 @@
745745
wait_exponential,
746746
retry_if_exception_type,
747747
)
748+
import time
749+
import requests
750+
751+
# Rate limiting: Don't exceed 100 requests/second
752+
RATE_LIMIT = 100
753+
MIN_INTERVAL = 1.0 / RATE_LIMIT # 0.01 seconds between requests
748754

749755
@retry(
750756
stop=stop_after_attempt(5),
751757
wait=wait_exponential(multiplier=1, min=1, max=60),
752-
retry=retry_if_exception_type((ConnectionError, TimeoutError, HTTPError)),
758+
retry=retry_if_exception_type((requests.HTTPError, ConnectionError)),
753759
reraise=True
754760
)
755-
756-
def fetch_from_api(url):
757-
"""Fetch with automatic exponential backoff on failures."""
761+
def fetch_with_backoff(url, last_request_time):
762+
"""Fetch with rate limiting and exponential backoff."""
763+
# Rate limiting: ensure minimum interval between requests
764+
elapsed = time.time() - last_request_time
765+
if elapsed < MIN_INTERVAL:
766+
time.sleep(MIN_INTERVAL - elapsed)
767+
758768
response = requests.get(url, timeout=10)
759769
response.raise_for_status()
760-
return response.json()
770+
return response.json(), time.time()
771+
772+
# Process requests with rate limiting
773+
last_request_time = 0
774+
for url in api_urls:
775+
data, last_request_time = fetch_with_backoff(url, last_request_time)
776+
process(data)
761777
```
762778

763779
- Backs off exponentially on failures, but add randomness (jitter) to prevent thundering herd problems.

qmd/big-data.qmd

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,22 @@
3737
#> 40.65 MB
3838
```
3939

40+
- Optimization Guidelines ([source](https://dagster.io/blog/when-and-when-not-to-optimize-data-pipelines))
41+
42+
- pandas/dplyrpolars, collapse, data.table (?), etc.: For CPU-bound transformations on medium data (1-50GB)
43+
- Benchmark: 3-5x faster for typical group-by operations
44+
- When to switch: pandas/dplyr operations taking \>30 minutes, but data fits on one machine
45+
- polars, etc.spark: For truly large-scale distributed processing (\>50GB)
46+
- Benchmark: Can process 1TB+ by distributing across cluster
47+
- When to switch: Need to process multi-TB datasets, running out of memory regularly
48+
- polars, etc. → SQL (duckdb, postgres): For aggregations and joins
49+
- Benchmark: 10-100x faster because computation happens in the warehouse
50+
- When to switch: Always, when the operation can be done in SQL
51+
- When *not* to move to faster and potentially more complicated optimization:
52+
- "This could be faster" (not a good reason)
53+
- Current solution works and meets SLA
54+
- You don't have the expertise to operate the new tool
55+
4056
- Benchmarks
4157
4258
- [Antico](https://github.com/AdrianAntico/Benchmarks?tab=readme-ov-file) (2024-06-21) - Tests [{collapse}]{style="color: #990000"}, [{duckdb}]{style="color: #990000"}, [{data.table}]{style="color: #990000"}, [{polars}]{style="color: goldenrod"}, and [{pandas}]{style="color: goldenrod"}

qmd/code-optimization.qmd

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
- Estimate the asymptotic complexity (big-O notation) of any R code that depends on some data size N: `references_best()`
99
- Compare time/memory of different git versions of R package code: `atime_versions()`
1010
- Continuous performance testing of R packages: `atime_pkg()`
11+
- [{]{style="color: goldenrod"}[line_profiler](https://github.com/pyutils/line_profiler){style="color: goldenrod"}[}]{style="color: goldenrod"} - A module for doing line-by-line profiling of functions. kernprof is a convenient script for running either line_profiler or the Python standard library's cProfile or profile modules, depending on what is available.
1112
- [{]{style="color: #990000"}[pipetime](https://cygei.github.io/pipetime/){style="color: #990000"}[}]{style="color: #990000"} - Measures and logs code execution time within pipes
1213
- [{]{style="color: #990000"}[syrup](https://cran.r-project.org/web/packages/syrup/){style="color: #990000"}[}]{style="color: #990000"} - Measure Memory and CPU Usage for Parallel R Code
1314
- [{]{style="color: #990000"}[memuse](https://cran.r-project.org/web/packages/memuse/){style="color: #990000"}[}]{style="color: #990000"} - Memory Estimation Utilities
@@ -40,12 +41,28 @@
4041
- Function call overhead
4142

4243
- Yes, function calls have overhead. No, it doesn't matter for data pipelines. A function call costs \~100 nanoseconds. Your database query costs 500 milliseconds. Do the math.
44+
- Guidelines ([source](https://dagster.io/blog/when-and-when-not-to-optimize-data-pipelines))
45+
- Don't optimize if you're:
46+
- Optimizing code that runs infrequently
47+
- Optimizing before measuring
48+
- Optimizing Python/R when the database is the bottleneck
49+
- Adding complexity to save \<10% runtime
50+
- Optimizing because it's "more elegant"
51+
- Using a "faster" library you don't understand
52+
- Parallel processing without checking if you're I/O-bound
53+
- Start optimizing if:
54+
- Pipeline misses SLA regularly
55+
- Profiling shows clear bottleneck (\>50% of runtime)
56+
- Bottleneck is CPU/memory-bound in your code
57+
- Optimization has clear ROI (time saved × frequency)
58+
- You've already fixed the database queries
59+
- Users are actually waiting for results
4360
- Common valuable optiimizations ([source](https://dagster.io/blog/when-and-when-not-to-optimize-data-pipelines))
4461
- Database queries
4562
- e.g. Adding an index, fixing a join, or using proper partitioning
4663
- Unnecessary data loading
4764
- Only load the data you need.
48-
- e.g. Parquets partitioned by date allow you to only read that date's parquet file. Lots of data reading functions allow you to filter rows or select columns when reading in data.
65+
- e.g. Parquets partitioned by date allow you to only read that date's parquet file. Lots of data reading functions allow you to chunk, filter rows, or select columns when reading in data.
4966
- Inefficient iteration patterns
5067
- e.g. use vectorized functions wherever possible
5168
- Serialization/deserialization in tight loops

qmd/json.qmd

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,14 @@
44

55
- Packages
66

7-
- [{]{style="color: #990000"}[yyjsonr](https://coolbutuseless.github.io/package/yyjsonr/index.html){style="color: #990000"}[}]{style="color: #990000"} - A fast JSON parser/serializer, which converts R data to/from JSON and NDJSON. It is around 2x to 10x faster than jsonlite at both reading and writing JSON.
7+
- [{]{style="color: #990000"}[dir2json](https://parmsam.github.io/dir2json-r/){style="color: #990000"}[}]{style="color: #990000"} - A utility for converting directories into JSON format and decoding JSON back into directory structures
8+
- Handles a variety of file types within the directory, including text and binary files (e.g., images, PDFs),
9+
- [{]{style="color: goldenrod"}[orjson](https://pypi.org/project/orjson/){style="color: goldenrod"}[}]{style="color: goldenrod"} - A fast, correct JSON library for Python. It benchmarks as the fastest Python library for JSON and is more correct than the standard json library or other third-party python libraries. It serializes dataclass, datetime, numpy, and UUID instances natively.
810
- [{]{style="color: #990000"}[RcppSimdJson](https://dirk.eddelbuettel.com/code/rcpp.simdjson.html){style="color: #990000"}[}]{style="color: #990000"} - Comparable to {yyjsonr} in performance.
911
- Might be faster than yyjsonr for very large / nested data
1012
- [{]{style="color: #990000"}[rlowdb](https://cran.r-project.org/web/packages/rlowdb/index.html){style="color: #990000"}[}]{style="color: #990000"} - A lightweight, file-based JSON database. Inspired by '[LowDB](https://github.com/typicode/lowdb)' in 'JavaScript', it generates an intuitive interface for storing, retrieving, updating, and querying structured data without requiring a full-fledged database system.
1113
- [{]{style="color: #990000"}[unnest](https://vspinu.github.io/unnest/){style="color: #990000"}[}]{style="color: #990000"} - A zero-dependency R package for a very fast single-copy and single-pass unnesting of hierarchical data structures.
12-
- [{]{style="color: #990000"}[dir2json](https://parmsam.github.io/dir2json-r/){style="color: #990000"}[}]{style="color: #990000"} - A utility for converting directories into JSON format and decoding JSON back into directory structures
13-
- Handles a variety of file types within the directory, including text and binary files (e.g., images, PDFs),
14+
- [{]{style="color: #990000"}[yyjsonr](https://coolbutuseless.github.io/package/yyjsonr/index.html){style="color: #990000"}[}]{style="color: #990000"} - A fast JSON parser/serializer, which converts R data to/from JSON and NDJSON. It is around 2x to 10x faster than jsonlite at both reading and writing JSON.
1415

1516
- Also see
1617

qmd/llms-mcp.qmd

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,14 @@
1313
- [{]{style="color: #990000"}[mcptools](https://posit-dev.github.io/mcptools/){style="color: #990000"}[}]{style="color: #990000"} - Allows MCP-enabled tools like Claude Desktop, Claude Code, and VS Code GitHub Copilot can run R code *in the sessions you have running* to answer your questions
1414
- Works well with [{btw}]{style="color: #990000"}
1515
- [{]{style="color: #990000"}[mcpr](https://mcpr.opifex.org/){style="color: #990000"}[}]{style="color: #990000"} - Enables R applications to expose capabilities (tools, resources, and prompts) to AI models through a standard JSON-RPC 2.0 interface. It also provides client functionality to connect to and interact with MCP servers
16-
- [{]{style="color: goldenrod"}[fastapi_mcp](https://fastapi-mcp.tadata.com/getting-started/welcome){style="color: goldenrod"}[}]{style="color: goldenrod"} - Exposes FastAPI endpoints as Model Context Protocol (MCP) tools with Auth
16+
- [{]{style="color: goldenrod"}[fastapi_mcp](https://fastapi-mcp.tadata.com/getting-started/welcome){style="color: goldenrod"}[}]{style="color: goldenrod"} - Exposes FastAPI endpoints as Model Context Protocol (MCP) tools with Auth (i.e. takes an existing FastAPI application and essentially “turning it into” an MCP server)
17+
- [{]{style="color: #990000"}[plumber2mcp](https://arman.aksoy.org/plumber2mcp/){style="color: #990000"}[}]{style="color: #990000"} - Takes a plumber API and exposes its endpoints as MCP utilities (Same as [{fastapi.mcp}]{style="color: goldenrod"})
18+
- By adding MCP support to your Plumber API, you make your R functions available as:
19+
- Tools: AI assistants can call your API endpoints directly
20+
21+
- Resources: AI assistants can read documentation, data, and analysis results
22+
23+
- Prompts: AI assistants can use pre-defined templates to guide interactions
1724
- Resources
1825
- [Docs](https://modelcontextprotocol.io/introduction)
1926
- [Model Context Protocol servers](https://github.com/modelcontextprotocol/servers) - Links to official mcp servers and community-based servers

qmd/python-polars.qmd

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -126,6 +126,12 @@
126126
| Conditional | `filter(df, x > 4)` | `df.filter(pl.col("x") > 4 )` |
127127
| Sort rows | `arrange(df, x)` | `df.sort("x")` |
128128

129+
- [Example]{.ribbon-highlight}: Basic
130+
131+
``` python
132+
filtered_df = df_pl.filter(pl.col("region") == "Europe")
133+
```
134+
129135
## Mutate {#sec-py-polars-mut .unnumbered}
130136

131137
- Quick Reference
@@ -522,6 +528,12 @@
522528
|:-----------------------|:-----------------------|:-----------------------|
523529
| Join dataframes | `left_join(df1, df2, by=x)` | `df1.join(df2, on="x", how="left")` |
524530

531+
- [Example]{.ribbon-highlight}: Basic Left Join
532+
533+
``` python
534+
merged_df = df_pl.join(pop_df_pl, on="country", how="left")
535+
```
536+
525537
- [Example]{.ribbon-highlight}: Left Join
526538

527539
::: panel-tabset
@@ -917,6 +929,24 @@
917929
- `partition_by` splits the dataframe into a list of dataframes (like `split` in base R or `group_split` in [{dplyr}]{style="color: #990000"}). Since [as_dict = True]{.arg-text}, it's actually a dictionary of dataframes
918930
- This can also be done within the unsplit dataframe using `map_batch` and `map_elements` but it's less efficient. (See the previous section in the article for details)
919931

932+
## Lazy Evaluation {#sec-py-polars-leval .unnumbered}
933+
934+
- [Example]{.ribbon-highlight}: Basic
935+
936+
``` python
937+
import polars as pl
938+
939+
df_lazy = (
940+
pl.scan_csv("sales.csv")
941+
.filter(pl.col("amount") > 100)
942+
.groupby("segment")
943+
.agg(pl.col("amount").mean())
944+
.sort("amount")
945+
)
946+
947+
result = df_lazy.collect()
948+
```
949+
920950
- [Example]{.ribbon-highlight}: Lazily aggregate Delta lake files in AWS S3 ([source](https://dataengineeringcentral.substack.com/p/650gb-of-data-delta-lake-on-s3-polars))
921951

922952
``` python

0 commit comments

Comments
 (0)