Docsite rag #176

hanna-paasivirta · 2025-02-21T19:07:24Z

Short Description

Replace the Search service with new embed_docsite and search_docsite services.

Fixes #172

Implementation Details

The embed_docsite service downloads, chunks, processes metadata and indexes OpenFn documentation. The service uses Pinecone as a vector database and OpenAI for text embeddings.

The search_docsite searches the documentation through the vector database using and input query.

AI Usage

Please disclose how you've used AI in this work (it's cool, we just want to know!):

You can read more details in our Responsible AI Policy

hanna-paasivirta · 2025-02-21T19:10:20Z

Todo: test on full docs; finish DocsiteSearch according to index metadata structure

josephjclark · 2025-03-03T16:49:09Z

Thank you @hanna-paasivirta! I've gotta tied up with a production issue but we'll get this reviewed and merged tomorrow :)

josephjclark

This is really nice and clean, thank you! Even I have a fighting chance of understanding it.

I'd like to test before merging but it looks fantastic. Left a couple of questions in the meantime.

services/embed_docsite/README.md

services/embed_docsite/embed_docsite.py

services/embed_docsite/github_utils.py

services/embed_docsite/README.md

services/embed_docsite/github_utils.py

services/embed_docsite/docsite_processor.py

josephjclark · 2025-03-10T16:47:29Z

services/embed_docsite/docsite_indexer.py

+        )
+        sleep_time = 30
+        logger.info(f"Waiting for {sleep_time}s to verify upload count")
+        time.sleep(sleep_time)


Is sleep really the best way to handle this?

What happens if the update isn't complete after 30 seconds?

I added a while loop and a max wait time instead. This works in the current implementation where we’re not expecting several indexing jobs running in parallel.

services/embed_docsite/docsite_processor.py

josephjclark · 2025-03-13T12:29:08Z

services/embed_docsite/embed_docsite.py

+    # Add docs
+    for docs_type in docs_to_upload:
+        # Download and process
+        docsite_processor = DocsiteProcessor(docs_type=docs_type)


Sorry, just before we merge and forget about this, can we feed docs_to_ignore into the processor from here from user args?

And can we then update the readme with:
a) one example with default values (which will do everything)
b) one example which sets docs_to_upload and _docs_to_ignore

Sorry I forgot to commit this file, I've added it in now along with the readme.

services/embed_docsite/github_utils.py

services/embed_docsite/docsite_processor.py

hanna-paasivirta added 8 commits February 19, 2025 09:59

add new docsite rag file structure

37b92a0

refactor docsite processing

b92e782

simplify adaptor data processing to get docs

1938de6

remove empty file

751fa40

add splitting by headers

0d8e72a

add github download for docs

43c75d3

simplify chunking for all doc types

b2598be

add indexing for all three types of docs

92d3c1e

hanna-paasivirta added 6 commits February 24, 2025 19:09

fix chunk overlap for all docs

9e787cb

add search filtering

df3ecb1

fix index initialisation

d619f20

add database upload check

4f1164c

adjust metadata fields

4dc1e03

Tidy and add docstrings

2d6e9a5

hanna-paasivirta marked this pull request as ready for review March 3, 2025 13:37

hanna-paasivirta assigned hanna-paasivirta and josephjclark and unassigned hanna-paasivirta Mar 3, 2025

changeset

2c51eb8

josephjclark reviewed Mar 3, 2025

View reviewed changes

services/embed_docsite/README.md Show resolved Hide resolved

services/embed_docsite/embed_docsite.py Show resolved Hide resolved

services/embed_docsite/github_utils.py Outdated Show resolved Hide resolved

add GitHub API limits doc link

867e797

This comment was marked as resolved.

Sign in to view

josephjclark reviewed Mar 5, 2025

View reviewed changes

services/embed_docsite/README.md Outdated Show resolved Hide resolved

update dependencies

2ca7663

This comment was marked as resolved.

Sign in to view

hanna-paasivirta mentioned this pull request Mar 6, 2025

Docsite search: Set update mechanism #173

Open

hanna-paasivirta added 2 commits March 6, 2025 17:16

fix new index creation

f900c59

Merge branch 'docsite-rag' of github.com:OpenFn/apollo into docsite-rag

5ef6fad

This comment was marked as resolved.

Sign in to view

josephjclark reviewed Mar 10, 2025

View reviewed changes

tidy and add docs to skip

866a401

hanna-paasivirta mentioned this pull request Mar 12, 2025

Docsite search: Test different chunking strategies #181

Open

josephjclark reviewed Mar 13, 2025

View reviewed changes

josephjclark changed the base branch from main to release/next March 13, 2025 12:37

hanna-paasivirta added 3 commits March 14, 2025 10:04

add docs to ignore

504031c

update readme with example payload

66481fa

correct readme

18a532f

josephjclark approved these changes Mar 14, 2025

View reviewed changes

josephjclark merged commit c0e391c into release/next Mar 14, 2025

josephjclark deleted the docsite-rag branch March 14, 2025 10:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docsite rag #176

Docsite rag #176

hanna-paasivirta commented Feb 21, 2025 •

edited

Loading

hanna-paasivirta commented Feb 21, 2025

josephjclark commented Mar 3, 2025

josephjclark left a comment

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

josephjclark Mar 10, 2025

hanna-paasivirta Mar 12, 2025

josephjclark Mar 13, 2025

hanna-paasivirta Mar 14, 2025

Docsite rag #176

Docsite rag #176

Conversation

hanna-paasivirta commented Feb 21, 2025 • edited Loading

Short Description

Implementation Details

AI Usage

hanna-paasivirta commented Feb 21, 2025

josephjclark commented Mar 3, 2025

josephjclark left a comment

Choose a reason for hiding this comment

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

josephjclark Mar 10, 2025

Choose a reason for hiding this comment

hanna-paasivirta Mar 12, 2025

Choose a reason for hiding this comment

josephjclark Mar 13, 2025

Choose a reason for hiding this comment

hanna-paasivirta Mar 14, 2025

Choose a reason for hiding this comment

hanna-paasivirta commented Feb 21, 2025 •

edited

Loading