-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add escaping_underscores option to markdown export #135
Conversation
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
BREAKING CHANGE: export to text no longer escapes underscores. Add `escaping_underscores` option to `export_to_markdown(). Set default value to `escaping_underscores` to True. Signed-off-by: Vdaleke <[email protected]>
027900f
to
e8ce872
Compare
@Vdaleke I am in principle OK with this, but I would like to understand which problem it solves. In what examples do we not want to escape underscores? |
I gave an explanation in #134. This solves the problem when using docling to prepare documents for RAG, where text search of chunks is used among other things. When the document contains underscores, but the user query does not, the corresponding chunks are not found. |
The output without escaping underscores would be an illegal markdown, e.g. having non-closed @Vdaleke are you sure you actually need Markdown for your downstream application? Alternatively, you can use the option note: in case |
Hello @dolfim-ibm ! In our case we want to use docling as data preparation for RAG. The data should be used both during search and during LLM context processing. LLM interacts better with Markdown, since it was learned largely from it (depends on the LLM I think, but in our case we are sure of it). So we want to convert arbitrary files to Markdown for search and processing by the model. For model processing, underscores don't seem to be a problem. But for search, particularly for text search (as part of hybrid search) text is just text and if the word A_B_C occurs in the text, it is not the same text as in a user query containing A_B_C. So we want to use markdown for model processing, but also markdown without escaping for search. Using strict text for lookup may be worse, as we may lose the importance of headers and table structure, which can be important when processing for vector search (again as part of hybrid search). We realize that such a flag will generate incorrect syntax, but since docling is positioned as a tool for preparing data for LLM too, we hope that adding such a flag will not be a problem. |
BREAKING CHANGE: export to text no longer escapes underscores.
Add
escaping_underscores
option toexport_to_markdown(). Set default value to
escaping_underscores` to True.resolves #134