Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docsite rag #176

Merged
merged 23 commits into from
Mar 14, 2025
Merged

Docsite rag #176

merged 23 commits into from
Mar 14, 2025

Conversation

hanna-paasivirta
Copy link
Contributor

@hanna-paasivirta hanna-paasivirta commented Feb 21, 2025

Short Description

Replace the Search service with new embed_docsite and search_docsite services.

Fixes #172

Implementation Details

The embed_docsite service downloads, chunks, processes metadata and indexes OpenFn documentation. The service uses Pinecone as a vector database and OpenAI for text embeddings.

The search_docsite searches the documentation through the vector database using and input query.

AI Usage

Please disclose how you've used AI in this work (it's cool, we just want to know!):

  • Code generation (copilot but not intellisense)
  • Learning or fact checking
  • Strategy / design
  • Optimisation / refactoring
  • Translation / spellchecking / doc gen
  • Other
  • I have not used AI

You can read more details in our Responsible AI Policy

@hanna-paasivirta
Copy link
Contributor Author

Todo: test on full docs; finish DocsiteSearch according to index metadata structure

@hanna-paasivirta hanna-paasivirta marked this pull request as ready for review March 3, 2025 13:37
@josephjclark
Copy link
Collaborator

Thank you @hanna-paasivirta! I've gotta tied up with a production issue but we'll get this reviewed and merged tomorrow :)

Copy link
Collaborator

@josephjclark josephjclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice and clean, thank you! Even I have a fighting chance of understanding it.

I'd like to test before merging but it looks fantastic. Left a couple of questions in the meantime.

@josephjclark

This comment was marked as resolved.

@josephjclark

This comment was marked as resolved.

@hanna-paasivirta

This comment was marked as resolved.

@josephjclark

This comment was marked as resolved.

@josephjclark

This comment was marked as resolved.

)
sleep_time = 30
logger.info(f"Waiting for {sleep_time}s to verify upload count")
time.sleep(sleep_time)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is sleep really the best way to handle this?

What happens if the update isn't complete after 30 seconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a while loop and a max wait time instead. This works in the current implementation where we’re not expecting several indexing jobs running in parallel.

# Add docs
for docs_type in docs_to_upload:
# Download and process
docsite_processor = DocsiteProcessor(docs_type=docs_type)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, just before we merge and forget about this, can we feed docs_to_ignore into the processor from here from user args?

And can we then update the readme with:
a) one example with default values (which will do everything)
b) one example which sets docs_to_upload and _docs_to_ignore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I forgot to commit this file, I've added it in now along with the readme.

@josephjclark josephjclark changed the base branch from main to release/next March 13, 2025 12:37
@josephjclark josephjclark merged commit c0e391c into release/next Mar 14, 2025
@josephjclark josephjclark deleted the docsite-rag branch March 14, 2025 10:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Docsite search: Add a new docsite search RAG
2 participants