Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use a **single PostgreSQL database** for AnythingLLM database + vector store? #2057

Closed
ericthomas1 opened this issue Aug 6, 2024 · 11 comments

Comments

@ericthomas1
Copy link

ericthomas1 commented Aug 6, 2024

Hello,

Our team has been able to deploy AnythingLLM as an Azure Web App, and by making this change in schema.prisma...:

 datasource db {
   provider = "postgresql"
   url      = env("DATABASE_URL")
 }

...we're able to connect the Web App to an Azure Database for PostgreSQL database.

We want to use this single PostgreSQL database as...:

  • ...the AnythingLLM database (this is now working) and
  • ...the /app/server/storage (this is currently mapped to an Azure SMB Fileshare) and
  • ...the vector store (using pgvector extension) (we're currently using Pinecone)

This means that app storage is split into 3 different locations. We want to simplify this but can't quite figure out how to get the /app/server/storage nor the vector store to use this single PostgreSQL database.

Could you assist?

Thank you

@ericthomas1 ericthomas1 changed the title PostgreSQL state management + vector store for AnythingLLM? How to use a **single PostgreSQL database** for AnythingLLM database + vector store? Aug 6, 2024
@timothycarambat
Copy link
Member

/app/server/storage is not for databases - it is where static/cached assets are stored - it is not data intended to be in a database.

PGVector support is already an issue that is TBD (#873) because it frankly isn't as easy to just upsert records since everyone's table schema will vary and that detail adds a ton of complexity

@ericthomas1
Copy link
Author

Thank you for the reply and details.

I'm not following how everyone's table schema will vary.

Example:

  • When we select Pinecone as the vector store in AnythingLLM, the schema is set (somehow)
  • Can we add this same schema to schema.prism and have it initialized in the PostgreSQL database as part of setup?

@timothycarambat
Copy link
Member

What inevitable will occur is people are going to ask "I want table to specifically be a column on x table" which makes the config more complex (since it will also be easier to mess up).

We can just do some like await client.query('CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3))');
and create a sub-table for every collection we can use for indexing which would work in practice the same way as every other provider - which is what we wanted to do but when we discussed that a long time ago when the issue first opened the few people we spoke to all wanted the embeddings to be on specific tables, not AnythingLLM CRUD'ing tables to support embeddings. We could obviously have one giant table and query const result = await client.query('SELECT * FROM items WHERE namespace="namespace" ORDER BY embedding <-> $1 LIMIT 5', [pgvector.toSql([1, 2, 3])]);

@ericthomas1
Copy link
Author

ericthomas1 commented Aug 7, 2024

Hm...yeah, I don't have the mental model of the guts to speak to a solution. I can help organize if its useful.

Sounds like you're saying:

Option 1:

  • Create small, child tables for each workspace document (collection?) that is chunked and embedded (processed) by AnythingLLM.
  • Each child table's vector column is indexed (not sure how AnythingLLM would search across indexes for this option.)

Option 2:

  • Create one large table for all documents that are processed by AnythingLLM
  • Index the vector column of this table

Did I get these options correctly identified?

I'm not exactly sure what ...wanted the embeddings to be on specific tables... refers to.

@timothycarambat
Copy link
Member

That is correct, we would create a embedding table for each workspace or have one giant embedding table that has a namespace column and we can index and query based on that table + namespace column value since we will always know what it is.

@ericthomas1
Copy link
Author

How do these potential Postgresql options compare with how AnythingLLM handles using Pincecone-as-a-vector-store?

  • I'm fairly certain AnythingLLM natively scrapes (docs / websites / GitHub repos / etc.), chunks and gets embeddings for each text chunk.
    • It has to have a (temp?) schema for this process.
    • I'm guessing this schema is sent to Pinecone (in this case)

If so, seems that Option 1 would be a good way to alternatively store this schema in Postgresql.

Some (naïve) rationale:

  1. When a workspace is deleted, the corresponding vectors could be easily deleted as well. (DROP TABLE).
  2. More centralized storage management (AnythingLLM database and vector store are in the same resource)
  3. Further leverages open-source tooling (Postgresql)
  4. Saves costs (when compared to Pinecone / other proprietary vector stores)

@timothycarambat
Copy link
Member

timothycarambat commented Aug 7, 2024

Not that im against supporting the PGVector stuff there is a very good reason we dont make everyone use PG as their database.

We currently use LanceDB as the vectordb and SQLite as the regular DB. Both of these require no services or additional setup and are file-based. Operations around using these solutions run on disk and the user never even knows they are there.

Setting up a PG instance is easy, but it doesn't port to desktop. The master branch is actually an upstream branch of what winds up in Desktop and asking users to start a PG instance to use the desktop app or us to run their db in the cloud is simply a non-starter for privacy and costs.

PGVector in the case where someone uses Pinecone certainly saves money, but even PG costs more to run than saving a file to disk like LanceDB :)

As stated, we do intend to support PGVector since its still an open issue and it would add to consolidation in your specific stack, but it is not the best solution for every single person. Using PG+vector would have advantages for more replicable instances of AnythingLLM and centralized storage, which is enough of a reason to support it.

@ericthomas1
Copy link
Author

Totally understand.

RE: Defaults:

  • The issue our org faces is that we haven't found a way to deploy a centralized AnythingLLM instance to an Azure Web App w/ Docker when using the default LanceDB + SQLite.

  • Issues faced:

    • LanceDB: Being an in-memory vector store, working with a large multi-user implementation could require some serious compute.
    • SQLite: There is a bug in Azure Blob Storage Fileshares that doesn't allow SQLite

RE: Desktop:

  • I have 0 idea what it would take or how the filesize would be affected, but is there an option to add a postgresql component to the Desktop version?
  • In similar fashion to how the SQLite is embedded and managed

RE: Cost:

  • For larger orgs, one needs to factor in:
    • data scaling/privacy/security
    • sourcing costs (legal, SOC2 reviews, due diligence, etc.)
    • multi-user scenarios + compute
    • administration of an extra service
    • time-to-value (the above takes a lot of time)
    • etc.
  • Having a single, open source vector store (postgresql), running in the org's Azure tenant, that doesn't require large compute for multi-user scenarios, is quite desirable ;)

We are very excited about the tools your team has created. We are also working hard to realize them in our environment. Thank you so much for the work you do.

Chat soon!

@timothycarambat
Copy link
Member

Totally understand why you would want this and I'm fully on the same page. We for sure would not extend this functionality into the desktop client since that really isn't what the desktop client is for - since it's sole purpose is on-device single-player mode.

The docker/self-hosted version however makes perfect sense for the reasons and conversation above!

@SeaDude
Copy link

SeaDude commented Aug 8, 2024

Hello @timothycarambat,

I see that #873 was created exactly 5 months ago. Is there a plan for this enhancement to AnythingLLM?

Thank you

@hhland
Copy link

hhland commented Feb 28, 2025

I have a try in gitee, it seems works
Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants