Using Docling Library to create Training Dataset #807
Replies: 1 comment
-
I don't get exactly what is the format you are looking to achieve, here are anything a few pointers which could help you. Iterating the document The export to markdown or html methods could be a useful example. Tables
Image of components For example with the following code snippet (from the figures export example) element.get_image(conv_res.document) Creating training data |
Beta Was this translation helpful? Give feedback.
-
Hello-
First of all, good work on the library, I have migrated over our existing document parser to IBM Docling library use a uniform data structure to process Docx, PDF and other file types, I use this information to build a dataset that then been used to for both RAG and building out SLM model.
You can take a look at the project here
Web:
https://smartloop.ai
Command Line Interface:
https://github.com/smartloop-ai/smartloop
Happy to show a demo. However, here is my other question, like
SectionItem
,TableItem
, etc. I want to parse and embed Images in the processed output or save it in the disk or blob and then use it as a metadata in vector DB and training dataset, is there a best practice that I should followHere is a code snippet for the DocX parser:
Regards
Mehfuz Hossain
Co-founder | smartloop.ai
We (L) open source
Beta Was this translation helpful? Give feedback.
All reactions