Skip to content

Commit e8dba71

Browse files
authored
Merge pull request #18 from davidhou17/DOCSP-49388
(DOCSP-49388): Create notebooks that use Voyage AI models
2 parents c643674 + 595a955 commit e8dba71

File tree

5 files changed

+843
-5
lines changed

5 files changed

+843
-5
lines changed

create-embeddings/README.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,5 +11,7 @@ new data or from data you already have in MongoDB Atlas.
1111
|----------|-------------|
1212
| [open-source-new-data](https://github.com/mongodb/docs-notebooks/blob/main/create-embeddings/open-source-new-data.ipynb) | Generate embeddings from new data using an open-source embedding model |
1313
| [open-source-existing-data](https://github.com/mongodb/docs-notebooks/blob/main/create-embeddings/open-source-existing-data.ipynb) | Generate embeddings from existing data in Atlas using an open-source embedding model |
14-
| [openai-new-data](https://github.com/mongodb/docs-notebooks/blob/main/create-embeddings/openai-new-data.ipynb) | Generate embeddings from new data using an OpenAI embedding model |
15-
| [openai-existing-data](https://github.com/mongodb/docs-notebooks/blob/main/create-embeddings/openai-existing-data.ipynb) | Generate embeddings from existing data in Atlas using an OpenAI embedding model |
14+
| [voyage-new-data](https://github.com/mongodb/docs-notebooks/blob/main/create-embeddings/voyage-new-data.ipynb) | Generate embeddings from new data using an embedding model from Voyage AI |
15+
| [voyage-existing-data](https://github.com/mongodb/docs-notebooks/blob/main/create-embeddings/voyage-existing-data.ipynb) | Generate embeddings from existing data in Atlas using an embedding model from Voyage AI |
16+
| [openai-new-data](https://github.com/mongodb/docs-notebooks/blob/main/create-embeddings/openai-new-data.ipynb) | Generate embeddings from new data using an embedding model from OpenAI |
17+
| [openai-existing-data](https://github.com/mongodb/docs-notebooks/blob/main/create-embeddings/voyage-existing-data.ipynb) | Generate embeddings from existing data in Atlas using an embedding model from OpenAI |
Lines changed: 271 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,271 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Atlas Vector Search - Create Embeddings - Voyage AI - Existing Data"
8+
]
9+
},
10+
{
11+
"cell_type": "markdown",
12+
"metadata": {},
13+
"source": [
14+
"This notebook is a companion to the [Create Embeddings](https://www.mongodb.com/docs/atlas/atlas-vector-search/create-embeddings/) page. Refer to the page for set-up instructions and detailed explanations.\n",
15+
"\n",
16+
"This notebook takes you through how to generate embeddings from **existing data in Atlas** by using the ``voyage-3`` model from Voyage AI.\n",
17+
"\n",
18+
"<a target=\"_blank\" href=\"https://colab.research.google.com/github/mongodb/docs-notebooks/blob/main/create-embeddings/voyage-existing-data.ipynb\">\n",
19+
" <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n",
20+
"</a>"
21+
]
22+
},
23+
{
24+
"cell_type": "code",
25+
"execution_count": null,
26+
"metadata": {
27+
"vscode": {
28+
"languageId": "shellscript"
29+
}
30+
},
31+
"outputs": [],
32+
"source": [
33+
"pip install --quiet --upgrade voyageai pymongo"
34+
]
35+
},
36+
{
37+
"cell_type": "markdown",
38+
"metadata": {
39+
"vscode": {
40+
"languageId": "shellscript"
41+
}
42+
},
43+
"source": [
44+
"## Use an Embedding Model"
45+
]
46+
},
47+
{
48+
"cell_type": "code",
49+
"execution_count": null,
50+
"metadata": {},
51+
"outputs": [],
52+
"source": [
53+
"import os\n",
54+
"import voyageai\n",
55+
"\n",
56+
"# Specify your Voyage API key and embedding model\n",
57+
"os.environ[\"VOYAGE_API_KEY\"] = \"<api-key>\"\n",
58+
"model = \"voyage-3\"\n",
59+
"vo = voyageai.Client()\n",
60+
"\n",
61+
"# Define a function to generate embeddings\n",
62+
"def get_embedding(data, input_type = \"document\"):\n",
63+
" embeddings = vo.embed(\n",
64+
" data, model = model, input_type = input_type\n",
65+
" ).embeddings\n",
66+
" return embeddings[0]\n",
67+
"\n",
68+
"# Generate an embedding\n",
69+
"embedding = get_embedding(\"foo\")\n",
70+
"print(embedding)"
71+
]
72+
},
73+
{
74+
"cell_type": "markdown",
75+
"metadata": {},
76+
"source": [
77+
"### (Optional) Compress your embeddings\n",
78+
"\n",
79+
"Optionally, run the following code to define a function that converts your embeddings into BSON `binData` vectors for [efficient storage and retrieval](https://www.mongodb.com/docs/atlas/atlas-vector-search/create-embeddings/#vector-compression)."
80+
]
81+
},
82+
{
83+
"cell_type": "code",
84+
"execution_count": null,
85+
"metadata": {},
86+
"outputs": [],
87+
"source": [
88+
"from bson.binary import Binary \n",
89+
"from bson.binary import BinaryVectorDtype\n",
90+
"\n",
91+
"# Define a function to generate BSON vectors\n",
92+
"def generate_bson_vector(vector, vector_dtype):\n",
93+
" return Binary.from_vector(vector, vector_dtype)\n",
94+
"\n",
95+
"# Generate BSON vector from the sample float32 embedding\n",
96+
"bson_float32_embedding = generate_bson_vector(embedding, BinaryVectorDtype.FLOAT32)\n",
97+
"\n",
98+
"# Print the converted embedding\n",
99+
"print(f\"The converted BSON embedding is: {bson_float32_embedding}\")"
100+
]
101+
},
102+
{
103+
"cell_type": "markdown",
104+
"metadata": {},
105+
"source": [
106+
"## Generate Embeddings"
107+
]
108+
},
109+
{
110+
"cell_type": "code",
111+
"execution_count": null,
112+
"metadata": {},
113+
"outputs": [],
114+
"source": [
115+
"import pymongo\n",
116+
"\n",
117+
"# Connect to your Atlas cluster\n",
118+
"mongo_client = pymongo.MongoClient(\"<connection-string>\")\n",
119+
"db = mongo_client[\"sample_airbnb\"]\n",
120+
"collection = db[\"listingsAndReviews\"]\n",
121+
"\n",
122+
"# Define a filter to exclude documents with null or empty 'summary' fields\n",
123+
"filter = { 'summary': { '$exists': True, \"$nin\": [ None, \"\" ] } }\n",
124+
"\n",
125+
"# Get a subset of documents in the collection\n",
126+
"documents = collection.find(filter, {'_id': 1, 'summary': 1}).limit(50)"
127+
]
128+
},
129+
{
130+
"cell_type": "code",
131+
"execution_count": null,
132+
"metadata": {},
133+
"outputs": [],
134+
"source": [
135+
"from pymongo import UpdateOne\n",
136+
"\n",
137+
"# Generate the list of bulk write operations\n",
138+
"operations = []\n",
139+
"for doc in documents:\n",
140+
" summary = doc[\"summary\"]\n",
141+
" # Generate embeddings for this document\n",
142+
" embedding = get_embedding(summary)\n",
143+
"\n",
144+
" # Uncomment the following line to convert to BSON vectors\n",
145+
" # embedding = generate_bson_vector(embedding, BinaryVectorDtype.FLOAT32)\n",
146+
"\n",
147+
" # Add the update operation to the list\n",
148+
" operations.append(UpdateOne(\n",
149+
" {\"_id\": doc[\"_id\"]},\n",
150+
" {\"$set\": {\n",
151+
" \"embedding\": embedding\n",
152+
" }}\n",
153+
" ))\n",
154+
"\n",
155+
"# Execute the bulk write operation\n",
156+
"if operations:\n",
157+
" result = collection.bulk_write(operations)\n",
158+
" updated_doc_count = result.modified_count\n",
159+
"\n",
160+
"print(f\"Updated {updated_doc_count} documents.\")"
161+
]
162+
},
163+
{
164+
"cell_type": "markdown",
165+
"metadata": {},
166+
"source": [
167+
"## Index and Query Your Embeddings"
168+
]
169+
},
170+
{
171+
"cell_type": "code",
172+
"execution_count": null,
173+
"metadata": {},
174+
"outputs": [],
175+
"source": [
176+
"from pymongo.operations import SearchIndexModel\n",
177+
"import time\n",
178+
"\n",
179+
"# Create your index model, then create the search index\n",
180+
"search_index_model = SearchIndexModel(\n",
181+
" definition = {\n",
182+
" \"fields\": [\n",
183+
" {\n",
184+
" \"type\": \"vector\",\n",
185+
" \"path\": \"embedding\",\n",
186+
" \"similarity\": \"dotProduct\",\n",
187+
" \"numDimensions\": 1024\n",
188+
" }\n",
189+
" ]\n",
190+
" },\n",
191+
" name=\"vector_index\",\n",
192+
" type=\"vectorSearch\"\n",
193+
")\n",
194+
"result = collection.create_search_index(model=search_index_model)\n",
195+
"\n",
196+
"# Wait for initial sync to complete\n",
197+
"print(\"Polling to check if the index is ready. This may take up to a minute.\")\n",
198+
"predicate=None\n",
199+
"if predicate is None:\n",
200+
" predicate = lambda index: index.get(\"queryable\") is True\n",
201+
"\n",
202+
"while True:\n",
203+
" indices = list(collection.list_search_indexes(result))\n",
204+
" if len(indices) and predicate(indices[0]):\n",
205+
" break\n",
206+
" time.sleep(5)\n",
207+
"print(result + \" is ready for querying.\")"
208+
]
209+
},
210+
{
211+
"cell_type": "code",
212+
"execution_count": null,
213+
"metadata": {},
214+
"outputs": [],
215+
"source": [
216+
"# Generate embedding for the search query\n",
217+
"query_embedding = get_embedding(\"beach house\", input_type=\"query\")\n",
218+
"\n",
219+
"# Sample vector search pipeline\n",
220+
"pipeline = [\n",
221+
" {\n",
222+
" \"$vectorSearch\": {\n",
223+
" \"index\": \"vector_index\",\n",
224+
" \"queryVector\": query_embedding,\n",
225+
" \"path\": \"embedding\",\n",
226+
" \"exact\": True,\n",
227+
" \"limit\": 5\n",
228+
" }\n",
229+
" }, \n",
230+
" {\n",
231+
" \"$project\": {\n",
232+
" \"_id\": 0, \n",
233+
" \"summary\": 1,\n",
234+
" \"score\": {\n",
235+
" \"$meta\": \"vectorSearchScore\"\n",
236+
" }\n",
237+
" }\n",
238+
" }\n",
239+
"]\n",
240+
"\n",
241+
"# Execute the search\n",
242+
"results = collection.aggregate(pipeline)\n",
243+
"\n",
244+
"# Print results\n",
245+
"for i in results:\n",
246+
" print(i)\n"
247+
]
248+
}
249+
],
250+
"metadata": {
251+
"kernelspec": {
252+
"display_name": "Python 3",
253+
"language": "python",
254+
"name": "python3"
255+
},
256+
"language_info": {
257+
"codemirror_mode": {
258+
"name": "ipython",
259+
"version": 3
260+
},
261+
"file_extension": ".py",
262+
"mimetype": "text/x-python",
263+
"name": "python",
264+
"nbconvert_exporter": "python",
265+
"pygments_lexer": "ipython3",
266+
"version": "3.10.12"
267+
}
268+
},
269+
"nbformat": 4,
270+
"nbformat_minor": 2
271+
}

0 commit comments

Comments
 (0)