-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug/Store RAG Document Metadata Subdocs in Separate Table #223
Conversation
lambda/models/domain_objects.py
Outdated
chunk_size: int | ||
chunk_overlap: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Especially after yesterday's conversation of having different strategies, I could see these fields being their own object so we can easily represent multiple strategies. Something like what we ended up with for features maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm looking into storing a map for the strategy. It may not make it into this release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to store map
@@ -322,16 +341,48 @@ class RagDocument(BaseModel): | |||
document_name: str | |||
source: str | |||
username: str | |||
sub_docs: List[str] = Field(default_factory=lambda: []) | |||
subdocs: List[str] = Field(default_factory=lambda: [], exclude=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO this feels kinda clunky having this on this object considering it isn't persisted or returned to the frontend. I'd just use the result of join_docs
result directly instead of trying to store it on the RagDocument. That way you can stop worrying about that in the find*
methods too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping for now since this is a convenient way to pass along the subdocs as part of the document. It isn't queried with the API unless explicitly requested.
@@ -159,8 +159,7 @@ def handle_pipeline_ingest_documents(event: Dict[str, Any], context: Any) -> Dic | |||
document_name=key, | |||
source=docs[0][0].metadata.get("source"), | |||
subdocs=all_ids, | |||
chunk_size=chunk_size, | |||
chunk_overlap=chunk_overlap, | |||
chunk_strategy={"chunk_size": str(chunk_size), "chunk_overlap": str(chunk_overlap)}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only because this is kinda committing us to these values. I think we should have "name" for the type of strategy so we can use it as a discriminator. Also just a small thing but we might want to stop prefixing everything with chunk_
too since we're already in a chunk strategy object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed prefix and added StrategyType to entries.
Saving RAG SubDocument IDs in separate table to overcome Dynamo record limit. RAG Document records are now stored in 2 tables: 1 for metadata and 1 for subdocument chunks. This should allow for much larger documents to be managed and support quicker queries to retrieve RAG document metadata.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.