Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug/Store RAG Document Metadata Subdocs in Separate Table #223

Merged
merged 11 commits into from
Jan 15, 2025

Conversation

bedanley
Copy link
Contributor

Saving RAG SubDocument IDs in separate table to overcome Dynamo record limit. RAG Document records are now stored in 2 tables: 1 for metadata and 1 for subdocument chunks. This should allow for much larger documents to be managed and support quicker queries to retrieve RAG document metadata.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@bedanley bedanley changed the title Bug/rag metadata table Bug/Store RAG Document Metadata Subdocs in Separate Table Jan 15, 2025
lib/rag/index.ts Outdated Show resolved Hide resolved
estohlmann
estohlmann previously approved these changes Jan 15, 2025
estohlmann
estohlmann previously approved these changes Jan 15, 2025
Comment on lines 345 to 346
chunk_size: int
chunk_overlap: int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially after yesterday's conversation of having different strategies, I could see these fields being their own object so we can easily represent multiple strategies. Something like what we ended up with for features maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking into storing a map for the strategy. It may not make it into this release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to store map

lambda/repository/rag_document_repo.py Outdated Show resolved Hide resolved
@@ -322,16 +341,48 @@ class RagDocument(BaseModel):
document_name: str
source: str
username: str
sub_docs: List[str] = Field(default_factory=lambda: [])
subdocs: List[str] = Field(default_factory=lambda: [], exclude=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO this feels kinda clunky having this on this object considering it isn't persisted or returned to the frontend. I'd just use the result of join_docs result directly instead of trying to store it on the RagDocument. That way you can stop worrying about that in the find* methods too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping for now since this is a convenient way to pass along the subdocs as part of the document. It isn't queried with the API unless explicitly requested.

estohlmann
estohlmann previously approved these changes Jan 15, 2025
@@ -159,8 +159,7 @@ def handle_pipeline_ingest_documents(event: Dict[str, Any], context: Any) -> Dic
document_name=key,
source=docs[0][0].metadata.get("source"),
subdocs=all_ids,
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
chunk_strategy={"chunk_size": str(chunk_size), "chunk_overlap": str(chunk_overlap)},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only because this is kinda committing us to these values. I think we should have "name" for the type of strategy so we can use it as a discriminator. Also just a small thing but we might want to stop prefixing everything with chunk_ too since we're already in a chunk strategy object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed prefix and added StrategyType to entries.

@bedanley bedanley merged commit f3fc00e into develop Jan 15, 2025
4 checks passed
@bedanley bedanley deleted the bug/rag-metadata-table branch January 15, 2025 22:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants