| Actor | Actor badge |
|---|---|
| Chroma | |
| Milvus | |
| OpenSearch | |
| PGVector | |
| Pinecone | |
| Qdrant | |
| Weaviate |
The Apify Vector Database Integrations facilitate the transfer of data from Apify Actors to a vector database. This process includes data processing, optional splitting into chunks, embedding computation, and data storage
These integrations support incremental updates, ensuring that only changed data is updated. This reduces unnecessary embedding computation and storage operations, making it ideal for search and retrieval augmented generation (RAG) use cases.
This repository contains Actors for different vector databases.
- Retrieve a dataset as output from an Actor.
- [Optional] Split text data into chunks using langchain.
- [Optional] Update only changed data.
- Compute embeddings, e.g. using OpenAI or Cohere.
- Save data into the database.
- Add database to docker-compose.yml for local testing (if the database is available in docker).
version: '3.8'
services:
pgvector-container:
image: pgvector/pgvector:pg16
environment:
- POSTGRES_PASSWORD=password
- POSTGRES_DB=apify
ports:
- "5432:5432"
-
Add postgres dependency to
pyproject.toml:poetry add --group=pgvector "langchain_postgres"and mark the group pgvector as optional (in
pyproject.toml):[tool.poetry.group.postgres] optional = true
-
Create a new Actor in the
actorsdirectory, e.g.actors/pgvectorand add the following files:README.md- the Actor documentation.actor/actor.json- the Actor definition.actor/input_schema.json- the Actor input schema
-
Create a pydantic model for the Actor input schema. Edit Makefile to generate the input schema from the model:
datamodel-codegen --input $(DIRS_WITH_ACTORS)/pgvector/.actor/input_schema.json --output $(DIRS_WITH_CODE)/src/models/pgvector_input_model.py --input-file-type jsonschema --field-constraints
and then run
make pydantic-model
-
Import the created model in
src/models/__init__.py:from .pgvector_input_model import PgvectorIntegration ``
-
Create a new module (
pgvector.py) in thevector_storesdirectory, e.g.vector_stores/pgvectorand implement all classPGVectorDatabaseand all required methods. -
Add PGVector into
SupportedVectorStoresin theconstants.pyclass SupportedVectorStores(str, enum.Enum): pgvector = "pgvector"
-
Add PGVectorDatabase into
entrypoint.pyif actor_type == SupportedVectorStores.pgvector.value: await run_actor(PgvectorIntegration(**actor_input), actor_input)
-
Add
PGVectorDatabaseandPgvectorIntegrationinto_types.pyActorInputsDb: TypeAlias = ChromaIntegration | PgvectorIntegration | PineconeIntegration | QdrantIntegration VectorDb: TypeAlias = ChromaDatabase | PGVectorDatabase | PineconeDatabase | QdrantDatabase
-
Add
PGVectorDatabaseintovector_stores/vcs.pyif isinstance(actor_input, PgvectorIntegration): from .vector_stores.pgvector import PGVectorDatabase return PGVectorDatabase(actor_input, embeddings)
-
Add
PGVectorDatabasefixture intotests/conftets.py@pytest.fixture() def db_pgvector(crawl_1: list[Document]) -> PGVectorDatabase: db = PGVectorDatabase( actor_input=PgvectorIntegration( postgresSqlConnectionStr=os.getenv("POSTGRESQL_CONNECTION_STR"), postgresCollectionName=INDEX_NAME, embeddingsProvider="OpenAI", embeddingsApiKey=os.getenv("OPENAI_API_KEY"), datasetFields=["text"], ), embeddings=embeddings, ) db.unit_test_wait_for_index = 0 db.delete_all() # Insert initially crawled objects db.add_documents(documents=crawl_1, ids=[d.metadata["id"] for d in crawl_1]) yield db db.delete_all()
-
Add the
db_pgvectorfixture intotests/test_vector_stores.pyDATABASE_FIXTURES = ["db_pinecone", "db_chroma", "db_qdrant", "db_pgvector"]
-
Update README.md in the
actors/pgvectordirectory -
Add the
pgvectorto the README.md in the root directory -
Run tests
make test -
Run the Actor locally
export ACTOR_PATH_IN_DOCKER_CONTEXT=actors/pgvector apify run -p -
Setup Actor on Apify platform at
https://console.apify.comBuild configuration
Git URL: https://github.com/apify/store-vector-db Branch: master Folder: actors/pgvector -
Test the Actor on the Apify platform