Bug/Store RAG Document Metadata Subdocs in Separate Table #223

bedanley · 2025-01-15T17:44:55Z

Saving RAG SubDocument IDs in separate table to overcome Dynamo record limit. RAG Document records are now stored in 2 tables: 1 for metadata and 1 for subdocument chunks. This should allow for much larger documents to be managed and support quicker queries to retrieve RAG document metadata.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

lib/rag/index.ts

lambda/repository/rag_document_repo.py

lambda/models/domain_objects.py

dustins · 2025-01-15T18:54:06Z

lambda/models/domain_objects.py

+    chunk_size: int
+    chunk_overlap: int


Especially after yesterday's conversation of having different strategies, I could see these fields being their own object so we can easily represent multiple strategies. Something like what we ended up with for features maybe?

I'm looking into storing a map for the strategy. It may not make it into this release.

Updated to store map

lambda/repository/rag_document_repo.py

dustins · 2025-01-15T19:28:01Z

lambda/models/domain_objects.py

@@ -322,16 +341,48 @@ class RagDocument(BaseModel):
    document_name: str
    source: str
    username: str
-    sub_docs: List[str] = Field(default_factory=lambda: [])
+    subdocs: List[str] = Field(default_factory=lambda: [], exclude=True)


IMHO this feels kinda clunky having this on this object considering it isn't persisted or returned to the frontend. I'd just use the result of join_docs result directly instead of trying to store it on the RagDocument. That way you can stop worrying about that in the find* methods too.

Keeping for now since this is a convenient way to pass along the subdocs as part of the document. It isn't queried with the API unless explicitly requested.

lambda/models/domain_objects.py

dustins · 2025-01-15T22:11:18Z

lambda/repository/pipeline_ingest_documents.py

@@ -159,8 +159,7 @@ def handle_pipeline_ingest_documents(event: Dict[str, Any], context: Any) -> Dic
            document_name=key,
            source=docs[0][0].metadata.get("source"),
            subdocs=all_ids,
-            chunk_size=chunk_size,
-            chunk_overlap=chunk_overlap,
+            chunk_strategy={"chunk_size": str(chunk_size), "chunk_overlap": str(chunk_overlap)},


Only because this is kinda committing us to these values. I think we should have "name" for the type of strategy so we can use it as a discriminator. Also just a small thing but we might want to stop prefixing everything with chunk_ too since we're already in a chunk strategy object.

Removed prefix and added StrategyType to entries.

bedanley added 3 commits January 15, 2025 00:14

Move subdocs to separate table

17d844e

Update subdoc queries

86d0d19

cleanup

698b997

bedanley changed the title ~~Bug/rag metadata table~~ Bug/Store RAG Document Metadata Subdocs in Separate Table Jan 15, 2025

Merge branch 'develop' into bug/rag-metadata-table

536a2fe

estohlmann reviewed Jan 15, 2025

View reviewed changes

lib/rag/index.ts Outdated Show resolved Hide resolved

estohlmann reviewed Jan 15, 2025

View reviewed changes

lambda/repository/rag_document_repo.py Outdated Show resolved Hide resolved

estohlmann previously approved these changes Jan 15, 2025

View reviewed changes

cleanup

f3857a8

bedanley dismissed estohlmann’s stale review via f3857a8 January 15, 2025 18:48

dustins reviewed Jan 15, 2025

View reviewed changes

lambda/models/domain_objects.py Show resolved Hide resolved

Merge branch 'develop' into bug/rag-metadata-table

1eb9ee8

estohlmann previously approved these changes Jan 15, 2025

View reviewed changes

dustins reviewed Jan 15, 2025

View reviewed changes

lambda/models/domain_objects.py Outdated Show resolved Hide resolved

comments

fca747f

bedanley dismissed estohlmann’s stale review via fca747f January 15, 2025 21:17

bedanley added 2 commits January 15, 2025 21:52

Store chunking strategy as map

884c9ce

Add RAG pipeline write permissions

b9236ca

estohlmann previously approved these changes Jan 15, 2025

View reviewed changes

dustins reviewed Jan 15, 2025

View reviewed changes

Add chunk strategy type to map

4f20999

bedanley dismissed estohlmann’s stale review via 4f20999 January 15, 2025 22:19

precommit

cd259b9

estohlmann approved these changes Jan 15, 2025

View reviewed changes

bedanley merged commit f3fc00e into develop Jan 15, 2025
4 checks passed

bedanley deleted the bug/rag-metadata-table branch January 15, 2025 22:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug/Store RAG Document Metadata Subdocs in Separate Table #223

Bug/Store RAG Document Metadata Subdocs in Separate Table #223

bedanley commented Jan 15, 2025

dustins Jan 15, 2025

bedanley Jan 15, 2025

bedanley Jan 15, 2025

dustins Jan 15, 2025

bedanley Jan 15, 2025

dustins Jan 15, 2025

bedanley Jan 15, 2025

Bug/Store RAG Document Metadata Subdocs in Separate Table #223

Bug/Store RAG Document Metadata Subdocs in Separate Table #223

Conversation

bedanley commented Jan 15, 2025

dustins Jan 15, 2025

Choose a reason for hiding this comment

bedanley Jan 15, 2025

Choose a reason for hiding this comment

bedanley Jan 15, 2025

Choose a reason for hiding this comment

dustins Jan 15, 2025

Choose a reason for hiding this comment

bedanley Jan 15, 2025

Choose a reason for hiding this comment

dustins Jan 15, 2025

Choose a reason for hiding this comment

bedanley Jan 15, 2025

Choose a reason for hiding this comment