How to use a single PostgreSQL database for AnythingLLM database + vector store? #2057

ericthomas1 · 2024-08-06T22:45:42Z

Hello,

Our team has been able to deploy AnythingLLM as an Azure Web App, and by making this change in schema.prisma...:

 datasource db {
   provider = "postgresql"
   url      = env("DATABASE_URL")
 }

...we're able to connect the Web App to an Azure Database for PostgreSQL database.

We want to use this single PostgreSQL database as...:

...the AnythingLLM database (this is now working) and
...the /app/server/storage (this is currently mapped to an Azure SMB Fileshare) and
...the vector store (using pgvector extension) (we're currently using Pinecone)

This means that app storage is split into 3 different locations. We want to simplify this but can't quite figure out how to get the /app/server/storage nor the vector store to use this single PostgreSQL database.

Could you assist?

Thank you

The text was updated successfully, but these errors were encountered:

timothycarambat · 2024-08-07T00:17:28Z

/app/server/storage is not for databases - it is where static/cached assets are stored - it is not data intended to be in a database.

PGVector support is already an issue that is TBD (#873) because it frankly isn't as easy to just upsert records since everyone's table schema will vary and that detail adds a ton of complexity

ericthomas1 · 2024-08-07T00:27:29Z

Thank you for the reply and details.

I'm not following how everyone's table schema will vary.

Example:

When we select Pinecone as the vector store in AnythingLLM, the schema is set (somehow)
Can we add this same schema to schema.prism and have it initialized in the PostgreSQL database as part of setup?

timothycarambat · 2024-08-07T00:36:12Z

What inevitable will occur is people are going to ask "I want table to specifically be a column on x table" which makes the config more complex (since it will also be easier to mess up).

We can just do some like await client.query('CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3))');
and create a sub-table for every collection we can use for indexing which would work in practice the same way as every other provider - which is what we wanted to do but when we discussed that a long time ago when the issue first opened the few people we spoke to all wanted the embeddings to be on specific tables, not AnythingLLM CRUD'ing tables to support embeddings. We could obviously have one giant table and query const result = await client.query('SELECT * FROM items WHERE namespace="namespace" ORDER BY embedding <-> $1 LIMIT 5', [pgvector.toSql([1, 2, 3])]);

ericthomas1 · 2024-08-07T00:51:38Z

Hm...yeah, I don't have the mental model of the guts to speak to a solution. I can help organize if its useful.

Sounds like you're saying:

Option 1:

Create small, child tables for each workspace ~~document (collection?)~~ that is chunked and embedded (processed) by AnythingLLM.
Each child table's vector column is indexed (not sure how AnythingLLM would search across indexes for this option.)

Option 2:

Create one large table for all documents that are processed by AnythingLLM
Index the vector column of this table

Did I get these options correctly identified?

I'm not exactly sure what ...wanted the embeddings to be on specific tables... refers to.

timothycarambat · 2024-08-07T01:09:24Z

That is correct, we would create a embedding table for each workspace or have one giant embedding table that has a namespace column and we can index and query based on that table + namespace column value since we will always know what it is.

ericthomas1 · 2024-08-07T13:40:05Z

How do these potential Postgresql options compare with how AnythingLLM handles using Pincecone-as-a-vector-store?

I'm fairly certain AnythingLLM natively scrapes (docs / websites / GitHub repos / etc.), chunks and gets embeddings for each text chunk.
- It has to have a (temp?) schema for this process.
- I'm guessing this schema is sent to Pinecone (in this case)

If so, seems that Option 1 would be a good way to alternatively store this schema in Postgresql.

Some (naïve) rationale:

When a workspace is deleted, the corresponding vectors could be easily deleted as well. (DROP TABLE).
More centralized storage management (AnythingLLM database and vector store are in the same resource)
Further leverages open-source tooling (Postgresql)
Saves costs (when compared to Pinecone / other proprietary vector stores)

timothycarambat · 2024-08-07T15:45:31Z

Not that im against supporting the PGVector stuff there is a very good reason we dont make everyone use PG as their database.

We currently use LanceDB as the vectordb and SQLite as the regular DB. Both of these require no services or additional setup and are file-based. Operations around using these solutions run on disk and the user never even knows they are there.

Setting up a PG instance is easy, but it doesn't port to desktop. The master branch is actually an upstream branch of what winds up in Desktop and asking users to start a PG instance to use the desktop app or us to run their db in the cloud is simply a non-starter for privacy and costs.

PGVector in the case where someone uses Pinecone certainly saves money, but even PG costs more to run than saving a file to disk like LanceDB :)

As stated, we do intend to support PGVector since its still an open issue and it would add to consolidation in your specific stack, but it is not the best solution for every single person. Using PG+vector would have advantages for more replicable instances of AnythingLLM and centralized storage, which is enough of a reason to support it.

ericthomas1 · 2024-08-07T22:22:48Z

Totally understand.

RE: Defaults:

The issue our org faces is that we haven't found a way to deploy a centralized AnythingLLM instance to an Azure Web App w/ Docker when using the default LanceDB + SQLite.
Issues faced:
- LanceDB: Being an in-memory vector store, working with a large multi-user implementation could require some serious compute.
- SQLite: There is a bug in Azure Blob Storage Fileshares that doesn't allow SQLite

RE: Desktop:

I have 0 idea what it would take or how the filesize would be affected, but is there an option to add a postgresql component to the Desktop version?
In similar fashion to how the SQLite is embedded and managed

RE: Cost:

For larger orgs, one needs to factor in:
- data scaling/privacy/security
- sourcing costs (legal, SOC2 reviews, due diligence, etc.)
- multi-user scenarios + compute
- administration of an extra service
- time-to-value (the above takes a lot of time)
- etc.
Having a single, open source vector store (postgresql), running in the org's Azure tenant, that doesn't require large compute for multi-user scenarios, is quite desirable ;)

We are very excited about the tools your team has created. We are also working hard to realize them in our environment. Thank you so much for the work you do.

Chat soon!

timothycarambat · 2024-08-07T23:04:45Z

Totally understand why you would want this and I'm fully on the same page. We for sure would not extend this functionality into the desktop client since that really isn't what the desktop client is for - since it's sole purpose is on-device single-player mode.

The docker/self-hosted version however makes perfect sense for the reasons and conversation above!

SeaDude · 2024-08-08T06:00:18Z

Hello @timothycarambat,

I see that #873 was created exactly 5 months ago. Is there a plan for this enhancement to AnythingLLM?

Thank you

hhland · 2025-02-28T07:16:38Z

I have a try in gitee, it seems works

ericthomas1 changed the title ~~PostgreSQL state management + vector store for AnythingLLM?~~ How to use a **single PostgreSQL database** for AnythingLLM database + vector store? Aug 6, 2024

timothycarambat closed this as completed Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use a single PostgreSQL database for AnythingLLM database + vector store? #2057

How to use a single PostgreSQL database for AnythingLLM database + vector store? #2057

ericthomas1 commented Aug 6, 2024 •

edited

Loading

timothycarambat commented Aug 7, 2024

ericthomas1 commented Aug 7, 2024

timothycarambat commented Aug 7, 2024

ericthomas1 commented Aug 7, 2024 •

edited

Loading

timothycarambat commented Aug 7, 2024

ericthomas1 commented Aug 7, 2024

timothycarambat commented Aug 7, 2024 •

edited

Loading

ericthomas1 commented Aug 7, 2024

timothycarambat commented Aug 7, 2024

SeaDude commented Aug 8, 2024

hhland commented Feb 28, 2025

How to use a **single PostgreSQL database** for AnythingLLM database + vector store? #2057

How to use a **single PostgreSQL database** for AnythingLLM database + vector store? #2057

Comments

ericthomas1 commented Aug 6, 2024 • edited Loading

timothycarambat commented Aug 7, 2024

ericthomas1 commented Aug 7, 2024

timothycarambat commented Aug 7, 2024

ericthomas1 commented Aug 7, 2024 • edited Loading

timothycarambat commented Aug 7, 2024

ericthomas1 commented Aug 7, 2024

timothycarambat commented Aug 7, 2024 • edited Loading

ericthomas1 commented Aug 7, 2024

timothycarambat commented Aug 7, 2024

SeaDude commented Aug 8, 2024

hhland commented Feb 28, 2025

How to use a single PostgreSQL database for AnythingLLM database + vector store? #2057

How to use a single PostgreSQL database for AnythingLLM database + vector store? #2057

ericthomas1 commented Aug 6, 2024 •

edited

Loading

ericthomas1 commented Aug 7, 2024 •

edited

Loading

timothycarambat commented Aug 7, 2024 •

edited

Loading