-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docsite rag #176
Docsite rag #176
Conversation
Todo: test on full docs; finish DocsiteSearch according to index metadata structure |
Thank you @hanna-paasivirta! I've gotta tied up with a production issue but we'll get this reviewed and merged tomorrow :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really nice and clean, thank you! Even I have a fighting chance of understanding it.
I'd like to test before merging but it looks fantastic. Left a couple of questions in the meantime.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
) | ||
sleep_time = 30 | ||
logger.info(f"Waiting for {sleep_time}s to verify upload count") | ||
time.sleep(sleep_time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is sleep really the best way to handle this?
What happens if the update isn't complete after 30 seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a while loop and a max wait time instead. This works in the current implementation where we’re not expecting several indexing jobs running in parallel.
# Add docs | ||
for docs_type in docs_to_upload: | ||
# Download and process | ||
docsite_processor = DocsiteProcessor(docs_type=docs_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, just before we merge and forget about this, can we feed docs_to_ignore into the processor from here from user args?
And can we then update the readme with:
a) one example with default values (which will do everything)
b) one example which sets docs_to_upload and _docs_to_ignore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I forgot to commit this file, I've added it in now along with the readme.
Short Description
Replace the Search service with new embed_docsite and search_docsite services.
Fixes #172
Implementation Details
The embed_docsite service downloads, chunks, processes metadata and indexes OpenFn documentation. The service uses Pinecone as a vector database and OpenAI for text embeddings.
The search_docsite searches the documentation through the vector database using and input query.
AI Usage
Please disclose how you've used AI in this work (it's cool, we just want to know!):
You can read more details in our Responsible AI Policy