Skip to content

Commit 96dc180

Browse files
Hugoberryeyurtsevbaskaryan
authored
community[minor]: Add DuckDB as a vectorstore (langchain-ai#18916)
DuckDB has a cosine similarity function along list and array data types, which can be used as a vector store. - **Description:** The latest version of DuckDB features a cosine similarity function, which can be used with its support for list or array column types. This PR surfaces this functionality to langchain. - **Dependencies:** duckdb 0.10.0 - **Twitter handle:** @igocrite --------- Co-authored-by: Eugene Yurtsev <[email protected]> Co-authored-by: Bagatur <[email protected]>
1 parent fa6397d commit 96dc180

File tree

5 files changed

+533
-0
lines changed

5 files changed

+533
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# DuckDB\n",
8+
"This notebook shows how to use `DuckDB` as a vector store."
9+
]
10+
},
11+
{
12+
"cell_type": "code",
13+
"execution_count": null,
14+
"metadata": {},
15+
"outputs": [],
16+
"source": [
17+
"! pip install duckdb"
18+
]
19+
},
20+
{
21+
"cell_type": "markdown",
22+
"metadata": {},
23+
"source": [
24+
"We want to use OpenAIEmbeddings so we have to get the OpenAI API Key. "
25+
]
26+
},
27+
{
28+
"cell_type": "code",
29+
"execution_count": 2,
30+
"metadata": {},
31+
"outputs": [],
32+
"source": [
33+
"import getpass\n",
34+
"import os\n",
35+
"\n",
36+
"os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")"
37+
]
38+
},
39+
{
40+
"cell_type": "code",
41+
"execution_count": null,
42+
"metadata": {},
43+
"outputs": [],
44+
"source": [
45+
"from langchain.embeddings import OpenAIEmbeddings\n",
46+
"from langchain.vectorstores import DuckDB"
47+
]
48+
},
49+
{
50+
"cell_type": "code",
51+
"execution_count": null,
52+
"metadata": {},
53+
"outputs": [],
54+
"source": [
55+
"from langchain.document_loaders import TextLoader\n",
56+
"from langchain_text_splitters import CharacterTextSplitter\n",
57+
"\n",
58+
"loader = TextLoader(\"../../modules/state_of_the_union.txt\")\n",
59+
"documents = loader.load()\n",
60+
"\n",
61+
"documents = CharacterTextSplitter().split_documents(documents)\n",
62+
"embeddings = OpenAIEmbeddings()"
63+
]
64+
},
65+
{
66+
"cell_type": "code",
67+
"execution_count": null,
68+
"metadata": {},
69+
"outputs": [],
70+
"source": [
71+
"docsearch = DuckDB.from_documents(documents, embeddings)\n",
72+
"\n",
73+
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
74+
"docs = docsearch.similarity_search(query)"
75+
]
76+
},
77+
{
78+
"cell_type": "code",
79+
"execution_count": null,
80+
"metadata": {},
81+
"outputs": [],
82+
"source": [
83+
"print(docs[0].page_content)"
84+
]
85+
}
86+
],
87+
"metadata": {
88+
"kernelspec": {
89+
"display_name": "Python 3",
90+
"language": "python",
91+
"name": "python3"
92+
},
93+
"language_info": {
94+
"codemirror_mode": {
95+
"name": "ipython",
96+
"version": 3
97+
},
98+
"file_extension": ".py",
99+
"mimetype": "text/x-python",
100+
"name": "python",
101+
"nbconvert_exporter": "python",
102+
"pygments_lexer": "ipython3",
103+
"version": "3.12.2"
104+
}
105+
},
106+
"nbformat": 4,
107+
"nbformat_minor": 2
108+
}

libs/community/langchain_community/vectorstores/__init__.py

+1
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,7 @@
5151
"DocArrayHnswSearch": "langchain_community.vectorstores.docarray",
5252
"DocArrayInMemorySearch": "langchain_community.vectorstores.docarray",
5353
"DocumentDBVectorSearch": "langchain_community.vectorstores.documentdb",
54+
"DuckDB": "langchain_community.vectorstores.duckdb",
5455
"ElasticKnnSearch": "langchain_community.vectorstores.elastic_vector_search",
5556
"ElasticVectorSearch": "langchain_community.vectorstores.elastic_vector_search",
5657
"ElasticsearchStore": "langchain_community.vectorstores.elasticsearch",
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,263 @@
1+
# mypy: disable-error-code=func-returns-value
2+
from __future__ import annotations
3+
4+
import json
5+
import uuid
6+
from typing import Any, Iterable, List, Optional, Type
7+
8+
from langchain_core.documents import Document
9+
from langchain_core.embeddings import Embeddings
10+
from langchain_core.vectorstores import VST, VectorStore
11+
12+
13+
class DuckDB(VectorStore):
14+
"""`DuckDB` vector store.
15+
16+
This class provides a vector store interface for adding texts and performing
17+
similarity searches using DuckDB.
18+
19+
For more information about DuckDB, see: https://duckdb.org/
20+
21+
This integration requires the `duckdb` Python package.
22+
You can install it with `pip install duckdb`.
23+
24+
*Security Notice*: The default DuckDB configuration is not secure.
25+
26+
By **default**, DuckDB can interact with files across the entire file system,
27+
which includes abilities to read, write, and list files and directories.
28+
It can also access some python variables present in the global namespace.
29+
30+
When using this DuckDB vectorstore, we suggest that you initialize the
31+
DuckDB connection with a secure configuration.
32+
33+
For example, you can set `enable_external_access` to `false` in the connection
34+
configuration to disable external access to the DuckDB connection.
35+
36+
You can view the DuckDB configuration options here:
37+
38+
https://duckdb.org/docs/configuration/overview.html
39+
40+
Please review other relevant security considerations in the DuckDB
41+
documentation. (e.g., "autoinstall_known_extensions": "false",
42+
"autoload_known_extensions": "false")
43+
44+
See https://python.langchain.com/docs/security for more information.
45+
46+
Args:
47+
connection: Optional DuckDB connection
48+
embedding: The embedding function or model to use for generating embeddings.
49+
vector_key: The column name for storing vectors. Defaults to `embedding`.
50+
id_key: The column name for storing unique identifiers. Defaults to `id`.
51+
text_key: The column name for storing text. Defaults to `text`.
52+
table_name: The name of the table to use for storing embeddings. Defaults to
53+
`embeddings`.
54+
55+
Example:
56+
.. code-block:: python
57+
58+
import duckdb
59+
conn = duckdb.connect(database=':memory:',
60+
config={
61+
# Sample configuration to restrict some DuckDB capabilities
62+
# List is not exhaustive. Please review DuckDB documentation.
63+
"enable_external_access": "false",
64+
"autoinstall_known_extensions": "false",
65+
"autoload_known_extensions": "false"
66+
}
67+
)
68+
embedding_function = ... # Define or import your embedding function here
69+
vector_store = DuckDB(conn, embedding_function)
70+
vector_store.add_texts(['text1', 'text2'])
71+
result = vector_store.similarity_search('text1')
72+
"""
73+
74+
def __init__(
75+
self,
76+
*,
77+
connection: Optional[Any] = None,
78+
embedding: Embeddings,
79+
vector_key: str = "embedding",
80+
id_key: str = "id",
81+
text_key: str = "text",
82+
table_name: str = "vectorstore",
83+
):
84+
"""Initialize with DuckDB connection and setup for vector storage."""
85+
try:
86+
import duckdb
87+
except ImportError:
88+
raise ImportError(
89+
"Could not import duckdb package. "
90+
"Please install it with `pip install duckdb`."
91+
)
92+
self.duckdb = duckdb
93+
self._embedding = embedding
94+
self._vector_key = vector_key
95+
self._id_key = id_key
96+
self._text_key = text_key
97+
self._table_name = table_name
98+
99+
if self._embedding is None:
100+
raise ValueError("An embedding function or model must be provided.")
101+
102+
if connection is None:
103+
import warnings
104+
105+
warnings.warn(
106+
"No DuckDB connection provided. A new connection will be created."
107+
"This connection is running in memory and no data will be persisted."
108+
"To persist data, specify `connection=duckdb.connect(...)` when using "
109+
"the API. Please review the documentation of the vectorstore for "
110+
"security recommendations on configuring the connection."
111+
)
112+
113+
self._connection = connection or self.duckdb.connect(
114+
database=":memory:", config={"enable_external_access": "false"}
115+
)
116+
self._ensure_table()
117+
self._table = self._connection.table(self._table_name)
118+
119+
@property
120+
def embeddings(self) -> Optional[Embeddings]:
121+
"""Returns the embedding object used by the vector store."""
122+
return self._embedding
123+
124+
def add_texts(
125+
self,
126+
texts: Iterable[str],
127+
metadatas: Optional[List[dict]] = None,
128+
**kwargs: Any,
129+
) -> List[str]:
130+
"""Turn texts into embedding and add it to the database using Pandas DataFrame
131+
132+
Args:
133+
texts: Iterable of strings to add to the vectorstore.
134+
metadatas: Optional list of metadatas associated with the texts.
135+
kwargs: Additional parameters including optional 'ids' to associate
136+
with the texts.
137+
138+
Returns:
139+
List of ids of the added texts.
140+
"""
141+
142+
# Extract ids from kwargs or generate new ones if not provided
143+
ids = kwargs.pop("ids", [str(uuid.uuid4()) for _ in texts])
144+
145+
# Embed texts and create documents
146+
ids = ids or [str(uuid.uuid4()) for _ in texts]
147+
embeddings = self._embedding.embed_documents(list(texts))
148+
for idx, text in enumerate(texts):
149+
embedding = embeddings[idx]
150+
# Serialize metadata if present, else default to None
151+
metadata = (
152+
json.dumps(metadatas[idx])
153+
if metadatas and idx < len(metadatas)
154+
else None
155+
)
156+
self._connection.execute(
157+
f"INSERT INTO {self._table_name} VALUES (?,?,?,?)",
158+
[ids[idx], text, embedding, metadata],
159+
)
160+
return ids
161+
162+
def similarity_search(
163+
self, query: str, k: int = 4, **kwargs: Any
164+
) -> List[Document]:
165+
"""Performs a similarity search for a given query string.
166+
167+
Args:
168+
query: The query string to search for.
169+
k: The number of similar texts to return.
170+
171+
Returns:
172+
A list of Documents most similar to the query.
173+
"""
174+
embedding = self._embedding.embed_query(query) # type: ignore
175+
list_cosine_similarity = self.duckdb.FunctionExpression(
176+
"list_cosine_similarity",
177+
self.duckdb.ColumnExpression(self._vector_key),
178+
self.duckdb.ConstantExpression(embedding),
179+
)
180+
docs = (
181+
self._table.select(
182+
*[
183+
self.duckdb.StarExpression(exclude=[]),
184+
list_cosine_similarity.alias("similarity"),
185+
]
186+
)
187+
.order("similarity desc")
188+
.limit(k)
189+
.select(
190+
self.duckdb.StarExpression(exclude=["similarity", self._vector_key])
191+
)
192+
.fetchdf()
193+
)
194+
return [
195+
Document(
196+
page_content=docs[self._text_key][idx],
197+
metadata=json.loads(docs["metadata"][idx])
198+
if docs["metadata"][idx]
199+
else {},
200+
)
201+
for idx in range(len(docs))
202+
]
203+
204+
@classmethod
205+
def from_texts(
206+
cls: Type[VST],
207+
texts: List[str],
208+
embedding: Embeddings,
209+
metadatas: Optional[List[dict]] = None,
210+
**kwargs: Any,
211+
) -> DuckDB:
212+
"""Creates an instance of DuckDB and populates it with texts and
213+
their embeddings.
214+
215+
Args:
216+
texts: List of strings to add to the vector store.
217+
embedding: The embedding function or model to use for generating embeddings.
218+
metadatas: Optional list of metadata dictionaries associated with the texts.
219+
**kwargs: Additional keyword arguments including:
220+
- connection: DuckDB connection. If not provided, a new connection will
221+
be created.
222+
- vector_key: The column name for storing vectors. Default "vector".
223+
- id_key: The column name for storing unique identifiers. Default "id".
224+
- text_key: The column name for storing text. Defaults to "text".
225+
- table_name: The name of the table to use for storing embeddings.
226+
Defaults to "embeddings".
227+
228+
Returns:
229+
An instance of DuckDB with the provided texts and their embeddings added.
230+
"""
231+
232+
# Extract kwargs for DuckDB instance creation
233+
connection = kwargs.get("connection", None)
234+
vector_key = kwargs.get("vector_key", "vector")
235+
id_key = kwargs.get("id_key", "id")
236+
text_key = kwargs.get("text_key", "text")
237+
table_name = kwargs.get("table_name", "embeddings")
238+
239+
# Create an instance of DuckDB
240+
instance = DuckDB(
241+
connection=connection,
242+
embedding=embedding,
243+
vector_key=vector_key,
244+
id_key=id_key,
245+
text_key=text_key,
246+
table_name=table_name,
247+
)
248+
# Add texts and their embeddings to the DuckDB vector store
249+
instance.add_texts(texts, metadatas=metadatas, **kwargs)
250+
251+
return instance
252+
253+
def _ensure_table(self) -> None:
254+
"""Ensures the table for storing embeddings exists."""
255+
create_table_sql = f"""
256+
CREATE TABLE IF NOT EXISTS {self._table_name} (
257+
{self._id_key} VARCHAR PRIMARY KEY,
258+
{self._text_key} VARCHAR,
259+
{self._vector_key} FLOAT[],
260+
metadata VARCHAR
261+
)
262+
"""
263+
self._connection.execute(create_table_sql)

0 commit comments

Comments
 (0)