Skip to content

Commit c024c62

Browse files
committed
alg-mark >> prop score usage; alg-rec >> ERR; db-vec >> Weaviate, FAISS
1 parent 78581dd commit c024c62

File tree

5 files changed

+44
-5
lines changed

5 files changed

+44
-5
lines changed

qmd/algorithms-learn-to-rank.qmd

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@
3535
## Diagnostics {#sec-alg-ltr-diag .unnumbered}
3636

3737
- Misc
38-
- Also see [Algorithms, Recommendation](Algorithms,%20Recommendation) \>\> Metrics
38+
- Also see [Algorithms, Recommendation \>\> Metrics](algorithms-recommendation.qmd#sec-alg-recom-metrics){style="color: green"}
3939
- Use binary relevance metrics if the goal is to assign a binary relevance score to each document.
4040
- Use graded relevance metrics if the goal is to set a continuous relevance score for each document.
4141
- Mean Average Precision (MAP)\
@@ -45,7 +45,8 @@
4545
- Issues
4646
- It does not consider the ranking of the retrieved items, only the presence or absence of relevant documents.
4747
- It may not be appropriate for datasets where the relevance of items is not binary, as it does not consider an item's degree of relevance.
48-
- Mean Reciprocal Rank (MRR)![](./_resources/Algorithms,_Learn-to-Rank.resources/image.2.png){.lightbox width="116"}
48+
- Mean Reciprocal Rank (MRR)\
49+
![](./_resources/Algorithms,_Learn-to-Rank.resources/image.2.png){.lightbox width="116"}
4950
- Where Q is the total number of queries and r is the relevance score
5051
- Issue: considers only the first relevant document for the given query
5152
- Normalized Discounted Cumulative Gain (NDCG)![](./_resources/Algorithms,_Learn-to-Rank.resources/image.3.png){.lightbox width="469"}

qmd/algorithms-marketing.qmd

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,17 +16,32 @@
1616

1717
## Propensity Model {#sec-alg-mark-prop .unnumbered}
1818

19-
- Uses GA data for your website to model probabilities of a customer purchasing
19+
- Models probabilities of a customer purchasing, so you can use efficiently target customers more likely to buy.
2020
- Helps marketers to decrease cost per acquisition (CPA) and increase ROI
2121
- You might want to have a different marketing approach with a customer that is very close to buying than with one who might not even have heard of your product.
2222
- Also if you have a limited media budget , you can focus it on customers that have a high likelihood to buy and not spend too much on the ones that are long shots
23+
- Don't use unless there's a real, measurable cost from targeting broadly ([source](https://betterthanrandom.substack.com/p/do-you-really-need-a-lead-scoring))
24+
- Example: Sales Team
25+
- You have ten account executives and five thousand accounts to cover for opportunities. There is simply no way they can touch everyone. A lead scoring model helps allocate scarce human effort to the places where it is most likely to pay off.
26+
- Cost: sales reps’ time.
27+
- There are only so many humans, only so many calls they can make, only so many hours in the quarter. The cost is visible on the P&L (Profit and Loss).
28+
- Example: Product Team
29+
- Using propensity scores in some product surfaces and drive upsell banners, e.g. “Hey, you might love feature X."
30+
- Costs
31+
- Opportunity Cost because that banner space is finite, and if we use it to show an upsell nudge, we cannot show potentially more relevant information such as onboarding tips, recent bug fixes, or relevant industry news.
32+
- User Experience Cost because irrelevant messages can be annoying and degrade user satisfaction with the product.
33+
- Issues
34+
- Nobody really understands the notion of opportunity cost, it requires nuance.
35+
- User experience cost has a very laggy transmission cycle
36+
- Cost is unclear and likely negligible in trying to reach all customers.
2337
- [Example]{.ribbon-highlight}: Using Google Analytics data
2438
- Notes from [Scoring Customer Propensity using Machine Learning Models on Google Analytics Data](https://medium.com/artefact-engineering-and-data-science/scoring-customer-propensity-using-machine-learning-models-on-google-analytics-data-ba1126469c1f)
2539
- Data
2640
- Used GA360 so the raw data is nested at the session-level
2741
- See [Google, Analytics \>\> Misc](google-analytics-reports.qmd#sec-goog-anal-rep-misc){style="color: green"} \>\> "Google Analytics data in BigQuery" for more details on this type of data
2842
- After processing you want 1 row per customer
2943
- GA keeps data for 3 months by default
44+
- Product usage logs, interactions with marketing materials, CRM records, past purchase history, support tickets, firmographic information
3045
- Create features\
3146
![](./_resources/Algorithms,_Marketing.resources/1-shiin_SbsaPP8btmh8E3Mg.png)
3247
- General Features - metrics that give general information about a session

qmd/algorithms-recommendation.qmd

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -139,6 +139,25 @@
139139
- [Example]{.ribbon-highlight}: 3 queries\
140140
![](./_resources/Algorithms,_Recommendation.resources/Screenshot%20(991).png)
141141

142+
- **Expected Reciprocal Rank (ERR)**\
143+
$$
144+
\begin{align}
145+
&\text{ERR} = \sum_{r=1}^n \frac{1}{r} \cdot R_r \cdot \prod_{i=1}^{r-1} 1-R_i \\
146+
&\text{where} \;\; R_i = \frac{2^{l_i} - 1}{2^{l_m}}
147+
\end{align}
148+
$$
149+
150+
- See [article](https://towardsdatascience.com/why-map-and-mrr-fail-for-search-ranking-and-what-to-use-instead/) for more details
151+
- Computes the expected reciprocal rank of this stopping position where the user is satisfied
152+
- Assumes a cascade user model wherein a user does the following:
153+
- Scans list from top to bottom
154+
- At each rank $i$,
155+
- With probability $R_i$, the user is satisfied and stops
156+
- With probability $1-R_i$, the user continues looking ahead
157+
- $l_m$ is the maximum possible label value
158+
- $\frac{2^{l_i} - 1}{2^{l_m}}$ is "graded relevance", so a result can partially satisfy the user
159+
- ERR allows for multiple relevant items to contribute. Early high-quality items reduce the contribution of later items
160+
142161
- **Normalized Discounted Cumulative Gain (NDCG)** (Mhaskar, 2015)
143162

144163
$$

qmd/apis-build.qmd

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -35,9 +35,10 @@
3535
- Alternatives to a standard syncronous API are needed for situations where there are high volumes of requests or endpoints with long running tasks.
3636
- e.g. Training machine learning models or performing batch ETL jobs
3737
- Notes from
38-
- [Performance Optimization for Plumber APIs: Async How async programming can improve API performance.](https://joekirincic.com/posts/performance-optimization-for-plumber-apis-async/)
39-
- [Performance Optimization for Plumber APIs: Long Running Jobs](https://joekirincic.com/posts/performance-optimization-for-plumber-apis-long-running-jobs/) ([Github](https://github.com/joekirincic/performance-optimization-for-plumber-apis))
38+
- [Performance Optimization for Plumber APIs: Async How async programming can improve API performance.](https://joekirincic.com/posts/performance-optimization-for-plumber-apis-async/) ([Github](https://github.com/joekirincic/performance-optimization-for-plumber-apis/tree/main/async))
39+
- [Performance Optimization for Plumber APIs: Long Running Jobs](https://joekirincic.com/posts/performance-optimization-for-plumber-apis-long-running-jobs/) ([Github](https://github.com/joekirincic/performance-optimization-for-plumber-apis/tree/main/long-running-jobs))
4040
- [Asynchronous Programming (Async)]{.underline}
41+
- See article repo for code
4142
- Executes multiple computations/requests without waiting for each one to finish
4243
- Single threaded programming languages like R and Python handle this by starting background R/Python processes in addition to the main one.
4344
- Async typically benefits apps with a lot of I/O-bound tasks rather than CPU-bound tasks
@@ -63,6 +64,7 @@
6364
- Is the endpoint primarily I/O-bound?
6465
- Does it take a long time to execute (e.g. 100-500ms or more)?
6566
- [Polling Pattern]{.underline}
67+
- See article repo for code
6668
- AKA [async request-reply pattern](https://learn.microsoft.com/en-us/azure/architecture/patterns/async-request-reply)
6769
- Take your endpoint and turn it into four: one to accept the job, one to check status, one to retrieve results, and one to cancel. 
6870
- Issues
@@ -78,6 +80,7 @@
7880
- You don’t own the client or can’t easily make changes to it. The added complexity webhooks introduce, including backend changes and authentication, may make them a no-go for the client we’re integrating with. In these cases, polling is simpler to implement for any client.
7981
- Updates occur at a high frequency. If our tasks get large amounts of updates, instead of ruthlessly pinging our app for updates, the client can submit requests in batches, thereby controlling the toll polling places on our app.
8082
- [Webhooks]{.underline}
83+
- See article repo for code
8184
- An event-driven strategy for communicating with other web services.
8285
- When the client requests some work to be done, it also provides a URL for the server to post the task result to when the task is complete.
8386
- Two Endpoints: Submit Task and Cancel Task

qmd/db-vector.qmd

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@
2626

2727
## Brands {#sec-db-vect-bran .unnumbered}
2828

29+
- Weaviate, FAISS
2930
- Qdrant - open source, free, and easy to use ([example](https://towardsdatascience.com/how-i-turned-my-companys-docs-into-a-searchable-database-with-openai-4f2d34bd8736))
3031
- [Chroma](https://www.trychroma.com/) - Can be used as a local in-memory ([example](https://towardsdatascience.com/implementing-a-sales-support-agent-with-langchain-63c4761193e7))
3132
- [Chroma image](https://github.com/cwensel/chroma-embedded) with built-in support for multiple state-of-the-art embedding models, enabling superior semantic search across PDFs, source code, and documentation with store-optimized chunking strategies

0 commit comments

Comments
 (0)