Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add escaping_underscores option to markdown export #135

Merged
merged 1 commit into from
Jan 29, 2025

Conversation

Vdaleke
Copy link
Contributor

@Vdaleke Vdaleke commented Jan 27, 2025

BREAKING CHANGE: export to text no longer escapes underscores.

Add escaping_underscores option to export_to_markdown(). Set default value to escaping_underscores` to True.

resolves #134

Copy link

mergify bot commented Jan 27, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

BREAKING CHANGE: export to text no longer escapes underscores.

Add `escaping_underscores` option to `export_to_markdown().
Set default value to `escaping_underscores` to True.

Signed-off-by: Vdaleke <[email protected]>
@Vdaleke Vdaleke force-pushed the feat/escaping-underscores branch from 027900f to e8ce872 Compare January 27, 2025 11:28
@PeterStaar-IBM PeterStaar-IBM self-requested a review January 29, 2025 08:51
@PeterStaar-IBM
Copy link
Contributor

@Vdaleke I am in principle OK with this, but I would like to understand which problem it solves. In what examples do we not want to escape underscores?

@Vdaleke
Copy link
Contributor Author

Vdaleke commented Jan 29, 2025

In what examples do we not want to escape underscores?

I gave an explanation in #134. This solves the problem when using docling to prepare documents for RAG, where text search of chunks is used among other things. When the document contains underscores, but the user query does not, the corresponding chunks are not found.

@dolfim-ibm
Copy link
Contributor

The output without escaping underscores would be an illegal markdown, e.g. having non-closed _ which have a semantic meaning.

@Vdaleke are you sure you actually need Markdown for your downstream application? Alternatively, you can use the option strict_text=True which is producing plain text.

note: in case strict_text=True is also escaping underscores, I would consider that a bug which should be fixed.

@vladnosiv
Copy link

@Vdaleke are you sure you actually need Markdown for your downstream application? Alternatively, you can use the option strict_text=True which is producing plain text.

Hello @dolfim-ibm !

In our case we want to use docling as data preparation for RAG. The data should be used both during search and during LLM context processing. LLM interacts better with Markdown, since it was learned largely from it (depends on the LLM I think, but in our case we are sure of it).

So we want to convert arbitrary files to Markdown for search and processing by the model. For model processing, underscores don't seem to be a problem. But for search, particularly for text search (as part of hybrid search) text is just text and if the word A_B_C occurs in the text, it is not the same text as in a user query containing A_B_C.

So we want to use markdown for model processing, but also markdown without escaping for search. Using strict text for lookup may be worse, as we may lose the importance of headers and table structure, which can be important when processing for vector search (again as part of hybrid search).

We realize that such a flag will generate incorrect syntax, but since docling is positioned as a tool for preparing data for LLM too, we hope that adding such a flag will not be a problem.

@PeterStaar-IBM PeterStaar-IBM merged commit c9739b2 into DS4SD:main Jan 29, 2025
7 checks passed
@Vdaleke Vdaleke deleted the feat/escaping-underscores branch January 29, 2025 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: add escaping_underscores option to markdown export
4 participants