Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InstructLab and Deepsearch #106

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 52 additions & 0 deletions docs/instructlab-deepsearch-integration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@

# DeepSearch + InstructLab Integration Proposal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# DeepSearch + InstructLab Integration Proposal
# Document Conversion Proposal


<https://github.com/DS4SD>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved below where DeepSearch is first mentioned


## Why is a Conversion Tool Necessary?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Why is a Conversion Tool Necessary?
## Why is Document Conversion Necessary?


Managing submissions for the open-source InstructLab project has revealed a significant bottleneck in processing
knowledge documents. For the InstructLab backend to effectively utilize these documents, they must be in markdown
format. Currently, we only accept Wikipedia articles, but the built-in conversion tool is inadequate. Internally at
IBM, and other companies, many knowledge submissions are in multiple document formats, including PDF format,
necessitating conversion to markdown before being used in InstructLab.
Comment on lines +8 to +12
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Managing submissions for the open-source InstructLab project has revealed a significant bottleneck in processing
knowledge documents. For the InstructLab backend to effectively utilize these documents, they must be in markdown
format. Currently, we only accept Wikipedia articles, but the built-in conversion tool is inadequate. Internally at
IBM, and other companies, many knowledge submissions are in multiple document formats, including PDF format,
necessitating conversion to markdown before being used in InstructLab.
Managing taxonomy submissions in InstructLab has revealed an issue in processing knowledge documents. For the InstructLab to handle these documents, they must be in [markdown](https://en.wikipedia.org/wiki/Markdown) format. However, many knowledge submissions are in multiple document formats, including PDF, which necessitating conversion to markdown before being used by InstructLab.


Existing open-source methods, such as PanDoc, are inconsistent. While they preserve text, they struggle with parsing
tables and special symbols, as evidenced by issues in PR #1154 of the taxonomy repo in the InstructLab project. Other
open-source solutions have similar shortcomings.
Comment on lines +14 to +16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Existing open-source methods, such as PanDoc, are inconsistent. While they preserve text, they struggle with parsing
tables and special symbols, as evidenced by issues in PR #1154 of the taxonomy repo in the InstructLab project. Other
open-source solutions have similar shortcomings.
Existing open source tools, such as [Pandoc](https://pandoc.org/), are inconsistent. While they preserve text, they struggle with parsing tables and special symbols, as evidenced by issues in [PR #1154](https://github.com/instructlab/taxonomy/pull/1154) of the taxonomy repo in the InstructLab. Other open-source solutions have similar shortcomings.


## Why DeepSearch?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Why DeepSearch?
## Proposed Solution


IBM's DeepSearch software excels in document conversion, outperforming traditional open-source methods. Utilizing a
computer vision model layer, it accurately parses content in the files, including titles, headers, and tables.
Additionally, it automatically implements RAG layers for models, which could benefit the InstructLab process in
the future.
Comment on lines +20 to +23
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
IBM's DeepSearch software excels in document conversion, outperforming traditional open-source methods. Utilizing a
computer vision model layer, it accurately parses content in the files, including titles, headers, and tables.
Additionally, it automatically implements RAG layers for models, which could benefit the InstructLab process in
the future.
IBM's [DeepSearch](https://github.com/DS4SD) software excels in document conversion, outperforming traditional open source methods. Utilizing a computer vision model layer, it accurately parses content in the files, including titles, headers, and tables. Additionally, it automatically implements RAG layers for models, which could benefit the InstructLab process in the future.


## Integration Proposal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Integration Proposal
### Integration Proposal


To maintain the open-source nature of the project while leveraging the strengths of DeepSearch, we propose a
two-pronged approach:
Comment on lines +27 to +28
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To maintain the open-source nature of the project while leveraging the strengths of DeepSearch, we propose a
two-pronged approach:
To maintain the open source nature of the project while leveraging the strengths of DeepSearch, we propose a
two-pronged approach:


### Open-Source Conversion

- Implement a basic document conversion tool in the UI using an open-source method such as PanDoc. This tool will be
lightweight and easily hosted, ensuring it can be used and improved by the community.
Comment on lines +30 to +33
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Open-Source Conversion
- Implement a basic document conversion tool in the UI using an open-source method such as PanDoc. This tool will be
lightweight and easily hosted, ensuring it can be used and improved by the community.
- Open source conversion: Implement a basic document conversion tool in the InstructLab UI using an open source tool such as [Pandoc](https://pandoc.org/). This tool will be lightweight and easily hosted, ensuring it can be used and improved by the community.


### DeepSearch Integration

- Enable the UI to switch the conversion endpoint to DeepSearch, allowing high-fidelity markdown conversions for
backend use. This approach maintains an open-source version while benefiting from DeepSearch's superior
conversion capabilities.
Comment on lines +35 to +39
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### DeepSearch Integration
- Enable the UI to switch the conversion endpoint to DeepSearch, allowing high-fidelity markdown conversions for
backend use. This approach maintains an open-source version while benefiting from DeepSearch's superior
conversion capabilities.
- [DeepSearch](https://github.com/DS4SD) conversion: Enable the InstructLab UI to switch the conversion endpoint to DeepSearch, allowing high-fidelity markdown conversions for backend use. This approach uses the open source version of Deepsearch while benefiting from DeepSearch's superior conversion capabilities.


IBM Research and the DeepSearch team will host the DeepSearch endpoint for the open-source community. This
arrangement benefits the community by streamlining contributions and provides data and exposure for the DeepSearch
project. IBM's contribution underscores its commitment to supporting and improving open-source projects.
Comment on lines +41 to +43
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
IBM Research and the DeepSearch team will host the DeepSearch endpoint for the open-source community. This
arrangement benefits the community by streamlining contributions and provides data and exposure for the DeepSearch
project. IBM's contribution underscores its commitment to supporting and improving open-source projects.
IBM Research and the DeepSearch team will host the DeepSearch endpoint for the InstructLab community. This
arrangement benefits the community by providing a means to handle different document formats.


This integration will highlight the value of DeepSearch, highlighting their potential for those integrating
InstructLab into their workflows. If the volume of community requests becomes unsustainable for the DeepSearch team,
we hope for ample notification to allow the community to find alternative solutions. By then, we anticipate that the
open-source versions will have improved sufficiently, or the value of the integration will justify continued support.
Comment on lines +45 to +48
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This integration will highlight the value of DeepSearch, highlighting their potential for those integrating
InstructLab into their workflows. If the volume of community requests becomes unsustainable for the DeepSearch team,
we hope for ample notification to allow the community to find alternative solutions. By then, we anticipate that the
open-source versions will have improved sufficiently, or the value of the integration will justify continued support.
If the volume of community requests becomes unsustainable for the DeepSearch team to manage, we hope for ample notification to allow the community to find alternative solutions. By then, we anticipate that the open source conversion tools will have improved sufficiently, or the value of the integration will justify continued support.


By adopting this two-pronged approach, we ensure the integrity of the open-source project while leveraging IBM's
advanced DeepSearch capabilities. This strategy balances community collaboration with innovative technology,
fostering innovation and improvement in document processing for the InstructLab project.
Comment on lines +50 to +52
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
By adopting this two-pronged approach, we ensure the integrity of the open-source project while leveraging IBM's
advanced DeepSearch capabilities. This strategy balances community collaboration with innovative technology,
fostering innovation and improvement in document processing for the InstructLab project.
By adopting this two-pronged approach, we ensure the integrity of the open source project while leveraging advanced DeepSearch capabilities. This strategy balances community collaboration with innovative technology,
fostering innovation and improvement in document processing for the InstructLab community.