Skip to content

Commit

Permalink
fix: fix single newline handling in MD backend (#824)
Browse files Browse the repository at this point in the history
Signed-off-by: Panos Vagenas <[email protected]>
  • Loading branch information
vagenas authored Jan 28, 2025
1 parent adf6353 commit 5aed9f8
Show file tree
Hide file tree
Showing 5 changed files with 170 additions and 8 deletions.
12 changes: 4 additions & 8 deletions docling/backend/md_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ def __init__(self, in_doc: "InputDocument", path_or_stream: Union[BytesIO, Path]

self.in_table = False
self.md_table_buffer: list[str] = []
self.inline_text_buffer = ""
self.inline_texts: list[str] = []

try:
if isinstance(self.path_or_stream, BytesIO):
Expand Down Expand Up @@ -152,15 +152,14 @@ def close_table(self, doc: DoclingDocument):
def process_inline_text(
self, parent_element: Optional[NodeItem], doc: DoclingDocument
):
# self.inline_text_buffer += str(text_in)
txt = self.inline_text_buffer.strip()
txt = " ".join(self.inline_texts)
if len(txt) > 0:
doc.add_text(
label=DocItemLabel.PARAGRAPH,
parent=parent_element,
text=txt,
)
self.inline_text_buffer = ""
self.inline_texts = []

def iterate_elements(
self,
Expand Down Expand Up @@ -266,9 +265,7 @@ def traverse(node: marko.block.BlockElement):
self.close_table(doc)
self.in_table = False
# most likely just inline text
self.inline_text_buffer += str(
element.children
) # do not strip an inline text, as it may contain important spaces
self.inline_texts.append(str(element.children))

elif isinstance(element, marko.inline.CodeSpan):
self.close_table(doc)
Expand All @@ -292,7 +289,6 @@ def traverse(node: marko.block.BlockElement):
doc.add_code(parent=parent_element, text=snippet_text)

elif isinstance(element, marko.inline.LineBreak):
self.process_inline_text(parent_element, doc)
if self.in_table:
_log.debug("Line break in a table")
self.md_table_buffer.append("")
Expand Down
52 changes: 52 additions & 0 deletions tests/data/groundtruth/docling_v2/duck.md.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
Summer activities

# Swimming in the lake

Duck

Figure 1: This is a cute duckling

## Let’s swim!

To get started with swimming, first lay down in a water and try not to drown:

- You can relax and look around
- Paddle about
- Enjoy summer warmth

Also, don’t forget:

- Wear sunglasses
- Don’t forget to drink water
- Use sun cream

Hmm, what else…

## Let’s eat

After we had a good day of swimming in the lake, it’s important to eat something nice

I like to eat leaves

Here are some interesting things a respectful duck could eat:

| | Food | Calories per portion |
|---------|----------------------------------|------------------------|
| Leaves | Ash, Elm, Maple | 50 |
| Berries | Blueberry, Strawberry, Cranberry | 150 |
| Grain | Corn, Buckwheat, Barley | 200 |

And let’s add another list in the end:

- Leaves
- Berries
- Grain

And here my listing in code:

```
Leaves
Berries
Grain
```
23 changes: 23 additions & 0 deletions tests/data/groundtruth/docling_v2/wiki.md.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# IBM

International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.

It is a publicly traded company and one of the 30 companies in the Dow Jones Industrial Average.

IBM is the largest industrial research organization in the world, with 19 research facilities across a dozen countries, having held the record for most annual U.S. patents generated by a business for 29 consecutive years from 1993 to 2021.

IBM was founded in 1911 as the Computing-Tabulating-Recording Company (CTR), a holding company of manufacturers of record-keeping and measuring systems. It was renamed "International Business Machines" in 1924 and soon became the leading manufacturer of punch-card tabulating systems. During the 1960s and 1970s, the IBM mainframe, exemplified by the System/360, was the world's dominant computing platform, with the company producing 80 percent of computers in the U.S. and 70 percent of computers worldwide.[11]

IBM debuted in the microcomputer market in 1981 with the IBM Personal Computer, — its DOS software provided by Microsoft, — which became the basis for the majority of personal computers to the present day.[12] The company later also found success in the portable space with the ThinkPad. Since the 1990s, IBM has concentrated on computer services, software, supercomputers, and scientific research; it sold its microcomputer division to Lenovo in 2005. IBM continues to develop mainframes, and its supercomputers have consistently ranked among the most powerful in the world in the 21st century.

As one of the world's oldest and largest technology companies, IBM has been responsible for several technological innovations, including the automated teller machine (ATM), dynamic random-access memory (DRAM), the floppy disk, the hard disk drive, the magnetic stripe card, the relational database, the SQL programming language, and the UPC barcode. The company has made inroads in advanced computer chips, quantum computing, artificial intelligence, and data infrastructure.[13][14][15] IBM employees and alumni have won various recognitions for their scientific research and inventions, including six Nobel Prizes and six Turing Awards.[16]

## 1910s–1950s

IBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine (1889);[19] and Willard Bundy invented a time clock to record workers' arrival and departure times on a paper tape (1889).[20] On June 16, 1911, their four companies were amalgamated in New York State by Charles Ranlett Flint forming a fifth company, the Computing-Tabulating-Recording Company (CTR) based in Endicott, New York.[1][21] The five companies had 1,300 employees and offices and plants in Endicott and Binghamton, New York; Dayton, Ohio; Detroit, Michigan; Washington, D.C.; and Toronto, Canada.[22]

Collectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John Henry Patterson, called on Flint and, in 1914, was offered a position at CTR.[23] Watson joined CTR as general manager and then, 11 months later, was made President when antitrust cases relating to his time at NCR were resolved.[24] Having learned Patterson's pioneering business practices, Watson proceeded to put the stamp of NCR onto CTR's companies.[23]: 105  He implemented sales conventions, "generous sales incentives, a focus on customer service, an insistence on well-groomed, dark-suited salesmen and had an evangelical fervor for instilling company pride and loyalty in every worker".[25][26] His favorite slogan, "THINK", became a mantra for each company's employees.[25] During Watson's first four years, revenues reached $9 million ($158 million today) and the company's operations expanded to Europe, South America, Asia and Australia.[25] Watson never liked the clumsy hyphenated name "Computing-Tabulating-Recording Company" and chose to replace it with the more expansive title "International Business Machines" which had previously been used as the name of CTR's Canadian Division;[27] the name was changed on February 14, 1924.[28] By 1933, most of the subsidiaries had been merged into one company, IBM.

## 1960s–1980s

In 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.
56 changes: 56 additions & 0 deletions tests/data/md/duck.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
Summer activities

# Swimming in the lake

Duck


Figure 1: This is a cute duckling

## Let’s swim!

To get started with swimming, first lay down in a water and try not to drown:

- You can relax and look around
- Paddle about
- Enjoy summer warmth

Also, don’t forget:

- Wear sunglasses
- Don’t forget to drink water
- Use sun cream

Hmm, what else…

## Let’s eat

After we had a good day of swimming in the lake,
it’s important to eat
something nice

I like to eat leaves


Here are some interesting things a respectful duck could eat:

| | Food | Calories per portion |
|---------|----------------------------------|------------------------|
| Leaves | Ash, Elm, Maple | 50 |
| Berries | Blueberry, Strawberry, Cranberry | 150 |
| Grain | Corn, Buckwheat, Barley | 200 |

And let’s add another list in the end:

- Leaves
- Berries
- Grain

And here my listing in code:

```
Leaves
Berries
Grain
```
35 changes: 35 additions & 0 deletions tests/test_backend_markdown.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
from pathlib import Path

from docling.backend.md_backend import MarkdownDocumentBackend
from docling.datamodel.base_models import InputFormat
from docling.datamodel.document import InputDocument


def test_convert_valid():
fmt = InputFormat.MD
cls = MarkdownDocumentBackend

test_data_path = Path("tests") / "data"
relevant_paths = sorted((test_data_path / "md").rglob("*.md"))
assert len(relevant_paths) > 0

for in_path in relevant_paths:
gt_path = test_data_path / "groundtruth" / "docling_v2" / f"{in_path.name}.md"

in_doc = InputDocument(
path_or_stream=in_path,
format=fmt,
backend=cls,
)
backend = cls(
in_doc=in_doc,
path_or_stream=in_path,
)
assert backend.is_valid()

act_doc = backend.convert()
act_data = act_doc.export_to_markdown()

with open(gt_path, "r", encoding="utf-8") as f:
exp_data = f.read().rstrip()
assert act_data == exp_data

0 comments on commit 5aed9f8

Please sign in to comment.