Skip to content

Commit 350ba71

Browse files
committedFeb 23, 2024
docs: Update article with suggestions
1 parent fa89010 commit 350ba71

File tree

2 files changed

+14
-2
lines changed

2 files changed

+14
-2
lines changed
 

‎.vscode/settings.json

+8
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
{
2+
"grammarly.files.include": [
3+
"**/README.md",
4+
"**/readme.md",
5+
"**/*.txt",
6+
"**/social_media_retrieval.md"
7+
]
8+
}

‎docs/use_cases/social_media_retrieval.md

+6-2
Original file line numberDiff line numberDiff line change
@@ -87,10 +87,14 @@ Before diving into the code, let's look over a LinkedIn post to address the chal
8787
As you can see, during our preprocessing step, we have to take care of the following aspects that are not compatible with the embedding model:
8888
- emojis
8989
- bold, italic text
90-
- URLs
9190
- other non-ASCII characters
91+
- URLs
9292
- exceed the context window of the embedding model
9393

94+
For example, the emojis, bolded and italic text are represented by Unicode characters that are not available in the vocabulary of the embedding model. Thus, these items cannot be tokenized and passed to the model. That is why we have to remove them or normalize them to something that can be parsed by the tokenizer. The same principle applies to all other non-ASCII characters.
95+
96+
The URLs' value does not provide much value but uselessly fills the context window. Still, it is valuable to know there is a URL in the sentence. That is why we will replace all the URLs with a `[URL]` token. By doing so, we get the best of both worlds.
97+
9498
## 3. Settings
9599

96100
It is good practice to have a single place to configure your application. We used `pydantic` to quickly implement an `AppSettings` class that contains all the default settings and can be overwritten by other files such as `.env` or `yaml`.
@@ -390,7 +394,7 @@ Ultimately, you load the serialized data to the vector DB.
390394

391395
Here, we will focus on preprocessing a user's query, searching the vector DB, and postprocessing the retrieved posts for maximum results.
392396

393-
To design the retrieval step, we implemented a `QdrantVectorDBRetriever` class to expose all the necessary features for our retrieval client.
397+
To design the retrieval step, we implemented a `QdrantVectorDBRetriever` class to expose all the necessary features for our retrieval client. In the following sections, we will dive into implementing each class method.
394398

395399
```python
396400
class QdrantVectorDBRetriever:

0 commit comments

Comments
 (0)
Please sign in to comment.