json >> {orjson}; code-opt >> {line_profiler}; llms-mcp >> {plumber2mcp}; py-pol >> lazy eval ex

ercbk · ercbk · commit dbe7e65bbc0c · 2025-11-22T14:26:41.000-05:00
diff --git a/qmd/apis.qmd b/qmd/apis.qmd
@@ -745,19 +745,35 @@
             wait_exponential,
             retry_if_exception_type,
         )
+        import time
+        import requests
+
+        # Rate limiting: Don't exceed 100 requests/second
+        RATE_LIMIT = 100
+        MIN_INTERVAL = 1.0 / RATE_LIMIT  # 0.01 seconds between requests
 
         @retry(
             stop=stop_after_attempt(5),
             wait=wait_exponential(multiplier=1, min=1, max=60),
-            retry=retry_if_exception_type((ConnectionError, TimeoutError, HTTPError)),
+            retry=retry_if_exception_type((requests.HTTPError, ConnectionError)),
             reraise=True
         )
-
-        def fetch_from_api(url):
-            """Fetch with automatic exponential backoff on failures."""
+        def fetch_with_backoff(url, last_request_time):
+            """Fetch with rate limiting and exponential backoff."""
+            # Rate limiting: ensure minimum interval between requests
+            elapsed = time.time() - last_request_time
+            if elapsed < MIN_INTERVAL:
+                time.sleep(MIN_INTERVAL - elapsed)
+            
             response = requests.get(url, timeout=10)
             response.raise_for_status()
-            return response.json()
+            return response.json(), time.time()
+
+        # Process requests with rate limiting
+        last_request_time = 0
+        for url in api_urls:
+            data, last_request_time = fetch_with_backoff(url, last_request_time)
+            process(data)
         ```
 
         -   Backs off exponentially on failures, but add randomness (jitter) to prevent thundering herd problems.
diff --git a/qmd/big-data.qmd b/qmd/big-data.qmd
@@ -37,6 +37,22 @@
     #> 40.65 MB
     ```
 
+-   Optimization Guidelines ([source](https://dagster.io/blog/when-and-when-not-to-optimize-data-pipelines))
+
+    -   pandas/dplyr → polars, collapse, data.table (?), etc.: For CPU-bound transformations on medium data (1-50GB)
+        -   Benchmark: 3-5x faster for typical group-by operations
+        -   When to switch: pandas/dplyr operations taking \>30 minutes, but data fits on one machine
+    -   polars, etc. → spark: For truly large-scale distributed processing (\>50GB)
+        -   Benchmark: Can process 1TB+ by distributing across cluster
+        -   When to switch: Need to process multi-TB datasets, running out of memory regularly
+    -   polars, etc. → SQL (duckdb, postgres): For aggregations and joins
+        -   Benchmark: 10-100x faster because computation happens in the warehouse
+        -   When to switch: Always, when the operation can be done in SQL
+    -   When *not* to move to faster and potentially more complicated optimization:
+        -   "This could be faster" (not a good reason)
+        -   Current solution works and meets SLA
+        -   You don't have the expertise to operate the new tool
+
 -   Benchmarks
 
     -   [Antico](https://github.com/AdrianAntico/Benchmarks?tab=readme-ov-file) (2024-06-21) - Tests [{collapse}]{style="color: #990000"}, [{duckdb}]{style="color: #990000"}, [{data.table}]{style="color: #990000"}, [{polars}]{style="color: goldenrod"}, and [{pandas}]{style="color: goldenrod"}
diff --git a/qmd/code-optimization.qmd b/qmd/code-optimization.qmd
@@ -8,6 +8,7 @@
         -   Estimate the asymptotic complexity (big-O notation) of any R code that depends on some data size N: `references_best()`
         -   Compare time/memory of different git versions of R package code: `atime_versions()`
         -   Continuous performance testing of R packages: `atime_pkg()`
+    -   [{]{style="color: goldenrod"}[line_profiler](https://github.com/pyutils/line_profiler){style="color: goldenrod"}[}]{style="color: goldenrod"} - A module for doing line-by-line profiling of functions. kernprof is a convenient script for running either line_profiler or the Python standard library's cProfile or profile modules, depending on what is available.
     -   [{]{style="color: #990000"}[pipetime](https://cygei.github.io/pipetime/){style="color: #990000"}[}]{style="color: #990000"} - Measures and logs code execution time within pipes
     -   [{]{style="color: #990000"}[syrup](https://cran.r-project.org/web/packages/syrup/){style="color: #990000"}[}]{style="color: #990000"} - Measure Memory and CPU Usage for Parallel R Code
     -   [{]{style="color: #990000"}[memuse](https://cran.r-project.org/web/packages/memuse/){style="color: #990000"}[}]{style="color: #990000"} - Memory Estimation Utilities
@@ -40,12 +41,28 @@
     -   Function call overhead
 
         -   Yes, function calls have overhead. No, it doesn't matter for data pipelines. A function call costs \~100 nanoseconds. Your database query costs 500 milliseconds. Do the math.
+-   Guidelines ([source](https://dagster.io/blog/when-and-when-not-to-optimize-data-pipelines))
+    -   Don't optimize if you're:
+        -   Optimizing code that runs infrequently
+        -   Optimizing before measuring
+        -   Optimizing Python/R when the database is the bottleneck
+        -   Adding complexity to save \<10% runtime
+        -   Optimizing because it's "more elegant"
+        -   Using a "faster" library you don't understand
+        -   Parallel processing without checking if you're I/O-bound
+    -   Start optimizing if:
+        -   Pipeline misses SLA regularly
+        -   Profiling shows clear bottleneck (\>50% of runtime)
+        -   Bottleneck is CPU/memory-bound in your code
+        -   Optimization has clear ROI (time saved × frequency)
+        -   You've already fixed the database queries
+        -   Users are actually waiting for results
 -   Common valuable optiimizations ([source](https://dagster.io/blog/when-and-when-not-to-optimize-data-pipelines))
     -   Database queries
         -   e.g. Adding an index, fixing a join, or using proper partitioning
     -   Unnecessary data loading
         -   Only load the data you need.
-        -   e.g. Parquets partitioned by date allow you to only read that date's parquet file. Lots of data reading functions allow you to filter rows or select columns when reading in data.
+        -   e.g. Parquets partitioned by date allow you to only read that date's parquet file. Lots of data reading functions allow you to chunk, filter rows, or select columns when reading in data.
     -   Inefficient iteration patterns
         -   e.g. use vectorized functions wherever possible
     -   Serialization/deserialization in tight loops
diff --git a/qmd/json.qmd b/qmd/json.qmd
@@ -4,13 +4,14 @@
 
 -   Packages
 
-    -   [{]{style="color: #990000"}[yyjsonr](https://coolbutuseless.github.io/package/yyjsonr/index.html){style="color: #990000"}[}]{style="color: #990000"} - A fast JSON parser/serializer, which converts R data to/from JSON and NDJSON. It is around 2x to 10x faster than jsonlite at both reading and writing JSON.
+    -   [{]{style="color: #990000"}[dir2json](https://parmsam.github.io/dir2json-r/){style="color: #990000"}[}]{style="color: #990000"} - A utility for converting directories into JSON format and decoding JSON back into directory structures
+        -   Handles a variety of file types within the directory, including text and binary files (e.g., images, PDFs),
+    -   [{]{style="color: goldenrod"}[orjson](https://pypi.org/project/orjson/){style="color: goldenrod"}[}]{style="color: goldenrod"} - A fast, correct JSON library for Python. It benchmarks as the fastest Python library for JSON and is more correct than the standard json library or other third-party python libraries. It serializes dataclass, datetime, numpy, and UUID instances natively.
     -   [{]{style="color: #990000"}[RcppSimdJson](https://dirk.eddelbuettel.com/code/rcpp.simdjson.html){style="color: #990000"}[}]{style="color: #990000"} - Comparable to {yyjsonr} in performance.
         -   Might be faster than yyjsonr for very large / nested data
     -   [{]{style="color: #990000"}[rlowdb](https://cran.r-project.org/web/packages/rlowdb/index.html){style="color: #990000"}[}]{style="color: #990000"} - A lightweight, file-based JSON database. Inspired by '[LowDB](https://github.com/typicode/lowdb)' in 'JavaScript', it generates an intuitive interface for storing, retrieving, updating, and querying structured data without requiring a full-fledged database system.
     -   [{]{style="color: #990000"}[unnest](https://vspinu.github.io/unnest/){style="color: #990000"}[}]{style="color: #990000"} - A zero-dependency R package for a very fast single-copy and single-pass unnesting of hierarchical data structures.
-    -   [{]{style="color: #990000"}[dir2json](https://parmsam.github.io/dir2json-r/){style="color: #990000"}[}]{style="color: #990000"} - A utility for converting directories into JSON format and decoding JSON back into directory structures
-        -   Handles a variety of file types within the directory, including text and binary files (e.g., images, PDFs),
+    -   [{]{style="color: #990000"}[yyjsonr](https://coolbutuseless.github.io/package/yyjsonr/index.html){style="color: #990000"}[}]{style="color: #990000"} - A fast JSON parser/serializer, which converts R data to/from JSON and NDJSON. It is around 2x to 10x faster than jsonlite at both reading and writing JSON.
 
 -   Also see
 
diff --git a/qmd/llms-mcp.qmd b/qmd/llms-mcp.qmd
@@ -13,7 +13,14 @@
     -   [{]{style="color: #990000"}[mcptools](https://posit-dev.github.io/mcptools/){style="color: #990000"}[}]{style="color: #990000"} - Allows MCP-enabled tools like Claude Desktop, Claude Code, and VS Code GitHub Copilot can run R code *in the sessions you have running* to answer your questions
         -   Works well with [{btw}]{style="color: #990000"}
     -   [{]{style="color: #990000"}[mcpr](https://mcpr.opifex.org/){style="color: #990000"}[}]{style="color: #990000"} - Enables R applications to expose capabilities (tools, resources, and prompts) to AI models through a standard JSON-RPC 2.0 interface. It also provides client functionality to connect to and interact with MCP servers
-    -   [{]{style="color: goldenrod"}[fastapi_mcp](https://fastapi-mcp.tadata.com/getting-started/welcome){style="color: goldenrod"}[}]{style="color: goldenrod"} - Exposes FastAPI endpoints as Model Context Protocol (MCP) tools with Auth
+    -   [{]{style="color: goldenrod"}[fastapi_mcp](https://fastapi-mcp.tadata.com/getting-started/welcome){style="color: goldenrod"}[}]{style="color: goldenrod"} - Exposes FastAPI endpoints as Model Context Protocol (MCP) tools with Auth (i.e. takes an existing FastAPI application and essentially “turning it into” an MCP server)
+    -   [{]{style="color: #990000"}[plumber2mcp](https://arman.aksoy.org/plumber2mcp/){style="color: #990000"}[}]{style="color: #990000"} - Takes a plumber API and exposes its endpoints as MCP utilities (Same as [{fastapi.mcp}]{style="color: goldenrod"})
+        -   By adding MCP support to your Plumber API, you make your R functions available as:
+            -   Tools: AI assistants can call your API endpoints directly
+
+            -   Resources: AI assistants can read documentation, data, and analysis results
+
+            -   Prompts: AI assistants can use pre-defined templates to guide interactions
 -   Resources
     -   [Docs](https://modelcontextprotocol.io/introduction)
     -   [Model Context Protocol servers](https://github.com/modelcontextprotocol/servers) - Links to official mcp servers and community-based servers
diff --git a/qmd/python-polars.qmd b/qmd/python-polars.qmd
@@ -126,6 +126,12 @@
     | Conditional | `filter(df, x > 4)` | `df.filter(pl.col("x") > 4 )` |
     | Sort rows | `arrange(df, x)` | `df.sort("x")` |
 
+-   [Example]{.ribbon-highlight}: Basic
+
+    ``` python
+    filtered_df = df_pl.filter(pl.col("region") == "Europe")
+    ```
+
 ## Mutate {#sec-py-polars-mut .unnumbered}
 
 -   Quick Reference
@@ -522,6 +528,12 @@
     |:-----------------------|:-----------------------|:-----------------------|
     | Join dataframes | `left_join(df1, df2, by=x)` | `df1.join(df2, on="x", how="left")` |
 
+-   [Example]{.ribbon-highlight}: Basic Left Join
+
+    ``` python
+    merged_df = df_pl.join(pop_df_pl, on="country", how="left")
+    ```
+
 -   [Example]{.ribbon-highlight}: Left Join
 
     ::: panel-tabset
@@ -917,6 +929,24 @@
     -   `partition_by` splits the dataframe into a list of dataframes (like `split` in base R or `group_split` in [{dplyr}]{style="color: #990000"}). Since [as_dict = True]{.arg-text}, it's actually a dictionary of dataframes
     -   This can also be done within the unsplit dataframe using `map_batch` and `map_elements` but it's less efficient. (See the previous section in the article for details)
 
+## Lazy Evaluation {#sec-py-polars-leval .unnumbered}
+
+-   [Example]{.ribbon-highlight}: Basic
+
+    ``` python
+    import polars as pl
+
+    df_lazy = (
+        pl.scan_csv("sales.csv")
+          .filter(pl.col("amount") > 100)
+          .groupby("segment")
+          .agg(pl.col("amount").mean())
+          .sort("amount")
+    )
+
+    result = df_lazy.collect()
+    ```
+
 -   [Example]{.ribbon-highlight}: Lazily aggregate Delta lake files in AWS S3 ([source](https://dataengineeringcentral.substack.com/p/650gb-of-data-delta-lake-on-s3-polars))
 
     ``` python