Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: fix single newline handling in MD backend #824

Merged
merged 1 commit into from
Jan 28, 2025
Merged

Conversation

vagenas
Copy link
Contributor

@vagenas vagenas commented Jan 28, 2025

Resolves #822.

Copy link

mergify bot commented Jan 28, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

@vagenas
Copy link
Contributor Author

vagenas commented Jan 28, 2025

For reference, Marko (the Markdown parsing lib employed under the hood) handles new lines as follows:

  • captures a BlankLine in case of multiple (more than one) successive new lines,
  • otherwise, in case of single new line, captures a LineBreak

Copy link
Contributor

@ceberam ceberam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this PR fixes the problem described in this issue, it may at the same time disregard hard line breaks, which could be important for the document author.

For instance, consider the following text (rendered here as text):

First line with no space after.
The first line continues.

First line with two spaces after.  
And the next line.

According to the markdown syntax and, for instance, GitHub Flavored Markdown the first block represents a single line while the second block should have 2 lines:

First line with no space after. The first line continues.

First line with two spaces after.
And the next line.

However this docling fix puts the second block as a single line and thus remove the hard line break. Running docling --from md --to md on the document above would result in:

First line with no space after. The first line continues.

First line with two spaces after. And the next line.

@vagenas
Copy link
Contributor Author

vagenas commented Jan 28, 2025

@ceberam I would anyway regard the hard line break only as a visual wrap / formatting, in which case it should not correspond to a new DoclingDocument text item / paragraph.

To make it clear, in the discussed "rendering" example:

First line with no space after. The first line continues.

First line with two spaces after.
And the next line.

these are conceptually still only two paragraphs.

In this sense, the produced MD export is as expected (given that DoclingDocument will capture the paragraphs):

First line with no space after. The first line continues.

First line with two spaces after. And the next line.

@vagenas vagenas merged commit 5aed9f8 into main Jan 28, 2025
9 checks passed
@vagenas vagenas deleted the fix-md-single-newline branch January 28, 2025 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: Docling misinterprets linebreaks in markdown input as paragraph breaks
3 participants