Using Docling Library to create Training Dataset #807

mehfuzh · 2025-01-25T15:59:51Z

mehfuzh
Jan 25, 2025

Hello-

First of all, good work on the library, I have migrated over our existing document parser to IBM Docling library use a uniform data structure to process Docx, PDF and other file types, I use this information to build a dataset that then been used to for both RAG and building out SLM model.

You can take a look at the project here

Web:
https://smartloop.ai

Command Line Interface:
https://github.com/smartloop-ai/smartloop

Happy to show a demo. However, here is my other question, like SectionItem, TableItem, etc. I want to parse and embed Images in the processed output or save it in the disk or blob and then use it as a metadata in vector DB and training dataset, is there a best practice that I should follow

Here is a code snippet for the DocX parser:

    def process_items(self, document: DoclingDocument,  items: [TextItem]) -> List[Paragraph]:
        paragraphs = []

        for _, item in enumerate(items, start=0):
            if item is not None:
                if isinstance(item, SectionHeaderItem):
                    texts = []

                    page_no = item.prov[0].page_no if len(item.prov) > 0 else 0
                    page_ref = f"document:{document.origin.filename}:page_no:{page_no}"

                    texts.append(page_ref)
                      
                    # process header
                    texts.append(f"{''.join(['#' for i in range(item.level)])} {item.text}")

                    if len(item.children) > 0:
                        items = self.get_text_items(document,  item.children)
                        texts.extend(self.process_items(document, items))

                    paragraphs.append(Paragraph(item.text , True))
                # process table or text
                elif isinstance(item, TableItem):
                    arr = np.array([item.data.table_cells[i].text for i in range(len(item.data.table_cells))])
                    arr.resize(item.data.num_rows, item.data.num_cols)
                    table = tabulate(arr, headers='firstrow' if item.data.table_cells[0].column_header else ())
                   
                    paragraphs.append(Paragraph(table , True))
                elif isinstance(item, TextItem):
                    if len(item.children) > 0:
                        items = self.get_text_items(document, item.children)

                        texts = self.process_items(document, items)
                   
                        paragraphs.extend([Paragraph(text, False) for text in texts])
                    else:
                        if len(item.text) > 0:
                            paragraphs.append(Paragraph(item.text , False))

        return paragraphs

Regards
Mehfuz Hossain
Co-founder | smartloop.ai

We (L) open source

dolfim-ibm · 2025-01-27T09:19:11Z

dolfim-ibm
Jan 27, 2025
Maintainer

I don't get exactly what is the format you are looking to achieve, here are anything a few pointers which could help you.

Iterating the document
We usually use the doc.iterate_items() which takes care of the hierarchy.
https://ds4sd.github.io/docling/reference/docling_document/#docling_core.types.doc.DoclingDocument.iterate_items

The export to markdown or html methods could be a useful example.

Tables
For processing tables, I would suggest:

item.export_to_dataframe() which generates a pandas dataframe
item.data.grid which is a 2d grid of the table.

Image of components
We are currently building around the idea of having the output DoclingDocument to contain the image of the pages, which can be used to crop of the document item from it.

For example with the following code snippet (from the figures export example)

element.get_image(conv_res.document)

Creating training data
We actually have some streams on creating training data, also in collaboration with Hugging Face, where we try to leverage more and more the DoclingDocument output format.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Docling Library to create Training Dataset #807

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Using Docling Library to create Training Dataset #807

mehfuzh Jan 25, 2025

Replies: 1 comment

dolfim-ibm Jan 27, 2025 Maintainer

mehfuzh
Jan 25, 2025

dolfim-ibm
Jan 27, 2025
Maintainer