alg-mark >> prop score usage; alg-rec >> ERR; db-vec >> Weaviate, FAISS

ercbk · ercbk · commit c024c6224ac6 · 2025-12-25T10:01:20.000-05:00
diff --git a/qmd/algorithms-learn-to-rank.qmd b/qmd/algorithms-learn-to-rank.qmd
@@ -35,7 +35,7 @@
 ## Diagnostics {#sec-alg-ltr-diag .unnumbered}
 
 -   Misc
-    -   Also see [Algorithms, Recommendation](Algorithms,%20Recommendation) \>\> Metrics
+    -   Also see [Algorithms, Recommendation \>\> Metrics](algorithms-recommendation.qmd#sec-alg-recom-metrics){style="color: green"}
     -   Use binary relevance metrics if the goal is to assign a binary relevance score to each document.
     -   Use graded relevance metrics if the goal is to set a continuous relevance score for each document.
 -   Mean Average Precision (MAP)\
@@ -45,7 +45,8 @@
     -   Issues
         -   It does not consider the ranking of the retrieved items, only the presence or absence of relevant documents.
         -   It may not be appropriate for datasets where the relevance of items is not binary, as it does not consider an item's degree of relevance.
--   Mean Reciprocal Rank (MRR)![](./_resources/Algorithms,_Learn-to-Rank.resources/image.2.png){.lightbox width="116"}
+-   Mean Reciprocal Rank (MRR)\
+    ![](./_resources/Algorithms,_Learn-to-Rank.resources/image.2.png){.lightbox width="116"}
     -   Where Q is the total number of queries and r is the relevance score
     -   Issue: considers only the first relevant document for the given query
 -   Normalized Discounted Cumulative Gain (NDCG)![](./_resources/Algorithms,_Learn-to-Rank.resources/image.3.png){.lightbox width="469"}
diff --git a/qmd/algorithms-marketing.qmd b/qmd/algorithms-marketing.qmd
@@ -16,17 +16,32 @@
 
 ## Propensity Model {#sec-alg-mark-prop .unnumbered}
 
--   Uses GA data for your website to model probabilities of a customer purchasing
+-   Models probabilities of a customer purchasing, so you can use efficiently target customers more likely to buy.
 -   Helps marketers to decrease cost per acquisition (CPA) and increase ROI
     -   You might want to have a different marketing approach with a customer that is very close to buying than with one who might not even have heard of your product.
     -   Also if you have a limited media budget , you can focus it on customers that have a high likelihood to buy and not spend too much on the ones that are long shots
+-   Don't use unless there's a real, measurable cost from targeting broadly ([source](https://betterthanrandom.substack.com/p/do-you-really-need-a-lead-scoring))
+    -   Example: Sales Team
+        -   You have ten account executives and five thousand accounts to cover for opportunities. There is simply no way they can touch everyone. A lead scoring model helps allocate scarce human effort to the places where it is most likely to pay off.
+        -   Cost: sales reps’ time.
+        -   There are only so many humans, only so many calls they can make, only so many hours in the quarter. The cost is visible on the P&L (Profit and Loss).
+    -   Example: Product Team
+        -   Using propensity scores in some product surfaces and drive upsell banners, e.g. “Hey, you might love feature X."
+        -   Costs
+            -   Opportunity Cost because that banner space is finite, and if we use it to show an upsell nudge, we cannot show potentially more relevant information such as onboarding tips, recent bug fixes, or relevant industry news.
+            -   User Experience Cost because irrelevant messages can be annoying and degrade user satisfaction with the product.
+        -   Issues
+            -   Nobody really understands the notion of opportunity cost, it requires nuance.
+            -   User experience cost has a very laggy transmission cycle
+        -   Cost is unclear and likely negligible in trying to reach all customers.
 -   [Example]{.ribbon-highlight}: Using Google Analytics data
     -   Notes from [Scoring Customer Propensity using Machine Learning Models on Google Analytics Data](https://medium.com/artefact-engineering-and-data-science/scoring-customer-propensity-using-machine-learning-models-on-google-analytics-data-ba1126469c1f)
     -   Data
         -   Used GA360 so the raw data is nested at the session-level
             -   See [Google, Analytics \>\> Misc](google-analytics-reports.qmd#sec-goog-anal-rep-misc){style="color: green"} \>\> "Google Analytics data in BigQuery" for more details on this type of data
         -   After processing you want 1 row per customer
         -   GA keeps data for 3 months by default
+        -   Product usage logs, interactions with marketing materials, CRM records, past purchase history, support tickets, firmographic information
     -   Create features\
         ![](./_resources/Algorithms,_Marketing.resources/1-shiin_SbsaPP8btmh8E3Mg.png)
         -   General Features - metrics that give general information about a session
diff --git a/qmd/algorithms-recommendation.qmd b/qmd/algorithms-recommendation.qmd
@@ -139,6 +139,25 @@
     -   [Example]{.ribbon-highlight}: 3 queries\
         ![](./_resources/Algorithms,_Recommendation.resources/Screenshot%20(991).png)
 
+-   **Expected Reciprocal Rank (ERR)**\
+    $$
+    \begin{align}
+    &\text{ERR} = \sum_{r=1}^n \frac{1}{r} \cdot R_r \cdot \prod_{i=1}^{r-1} 1-R_i \\
+    &\text{where} \;\; R_i = \frac{2^{l_i} - 1}{2^{l_m}}
+    \end{align}
+    $$
+
+    -   See [article](https://towardsdatascience.com/why-map-and-mrr-fail-for-search-ranking-and-what-to-use-instead/) for more details
+    -   Computes the expected reciprocal rank of this stopping position where the user is satisfied
+    -   Assumes a cascade user model wherein a user does the following:
+        -   Scans list from top to bottom
+        -   At each rank $i$,
+            -   With probability $R_i$, the user is satisfied and stops
+            -   With probability $1-R_i$, the user continues looking ahead
+    -   $l_m$ is the maximum possible label value
+    -   $\frac{2^{l_i} - 1}{2^{l_m}}$ is "graded relevance", so a result can partially satisfy the user
+    -   ERR allows for multiple relevant items to contribute. Early high-quality items reduce the contribution of later items
+
 -   **Normalized Discounted Cumulative Gain (NDCG)** (Mhaskar, 2015)
 
     $$
diff --git a/qmd/apis-build.qmd b/qmd/apis-build.qmd
@@ -35,9 +35,10 @@
 -   Alternatives to a standard syncronous API are needed for situations where there are high volumes of requests or endpoints with long running tasks.
     -   e.g. Training machine learning models or performing batch ETL jobs
 -   Notes from
-    -   [Performance Optimization for Plumber APIs: Async How async programming can improve API performance.](https://joekirincic.com/posts/performance-optimization-for-plumber-apis-async/)
-    -   [Performance Optimization for Plumber APIs: Long Running Jobs](https://joekirincic.com/posts/performance-optimization-for-plumber-apis-long-running-jobs/) ([Github](https://github.com/joekirincic/performance-optimization-for-plumber-apis))
+    -   [Performance Optimization for Plumber APIs: Async How async programming can improve API performance.](https://joekirincic.com/posts/performance-optimization-for-plumber-apis-async/) ([Github](https://github.com/joekirincic/performance-optimization-for-plumber-apis/tree/main/async))
+    -   [Performance Optimization for Plumber APIs: Long Running Jobs](https://joekirincic.com/posts/performance-optimization-for-plumber-apis-long-running-jobs/) ([Github](https://github.com/joekirincic/performance-optimization-for-plumber-apis/tree/main/long-running-jobs))
 -   [Asynchronous Programming (Async)]{.underline}
+    -   See article repo for code
     -   Executes multiple computations/requests without waiting for each one to finish
     -   Single threaded programming languages like R and Python handle this by starting background R/Python processes in addition to the main one.
     -   Async typically benefits apps with a lot of I/O-bound tasks rather than CPU-bound tasks
@@ -63,6 +64,7 @@
         -   Is the endpoint primarily I/O-bound?
         -   Does it take a long time to execute (e.g. 100-500ms or more)?
 -   [Polling Pattern]{.underline}
+    -   See article repo for code
     -   AKA [async request-reply pattern](https://learn.microsoft.com/en-us/azure/architecture/patterns/async-request-reply)
     -   Take your endpoint and turn it into four: one to accept the job, one to check status, one to retrieve results, and one to cancel. 
     -   Issues
@@ -78,6 +80,7 @@
         -   You don’t own the client or can’t easily make changes to it. The added complexity webhooks introduce, including backend changes and authentication, may make them a no-go for the client we’re integrating with. In these cases, polling is simpler to implement for any client.
         -   Updates occur at a high frequency. If our tasks get large amounts of updates, instead of ruthlessly pinging our app for updates, the client can submit requests in batches, thereby controlling the toll polling places on our app.
 -   [Webhooks]{.underline}
+    -   See article repo for code
     -   An event-driven strategy for communicating with other web services.
         -   When the client requests some work to be done, it also provides a URL for the server to post the task result to when the task is complete.
     -   Two Endpoints: Submit Task and Cancel Task
diff --git a/qmd/db-vector.qmd b/qmd/db-vector.qmd
@@ -26,6 +26,7 @@
 
 ## Brands {#sec-db-vect-bran .unnumbered}
 
+-   Weaviate, FAISS
 -   Qdrant - open source, free, and easy to use ([example](https://towardsdatascience.com/how-i-turned-my-companys-docs-into-a-searchable-database-with-openai-4f2d34bd8736))
 -   [Chroma](https://www.trychroma.com/) - Can be used as a local in-memory ([example](https://towardsdatascience.com/implementing-a-sales-support-agent-with-langchain-63c4761193e7))
     -   [Chroma image](https://github.com/cwensel/chroma-embedded) with built-in support for multiple state-of-the-art embedding models, enabling superior semantic search across PDFs, source code, and documentation with store-optimized chunking strategies