Post processing improvements #28

mllife · 2024-09-18T11:22:07Z

Thanks, for all your work.
I see in some of the tables, not all the tokens are assigned to cell text. I think this can be handled in post processing to make sure that all tokens that are within the table bounding box are assigned to some cell (row/coloumn). Also, sometime rows cells are not aligned, I think this can be fixed by checking the Xmin of each cell within the row, basically to keep everything parallel. Can you please look into these cases. I think the first one should be a obvious one.

Sorry, I would have shared examples to check but the documents I have are sensitive. I will try to find any similar examples and share if possible.

maxmnemonic · 2024-09-18T14:53:05Z

Hey @mllife! Thanks for the note, as you can see here: matching_post_processor.py
We already have quite involved post processing with cell matching / massaging / orphan-picking etc.

As an output cell bounding boxes encompass content that is located in the cell, so coordinate-wise cells in the same line might not always match pixel-wise.

If you could provide some open examples to understand the problem, that would help, alternatively feel free to modify the code and make a PR, so we can also run it on wast collection of tables that we have.

Thanks again!

mllife · 2024-09-19T07:21:44Z

I will try to create some similar artificial examples and share with you. Currently, I also need to improve my side of preprocessing steps I guess. As I wrote my own backbone parser with pymupdf to integrate with your code and using the low resolution images as input. Can you share if increasing input page image resolution can help?

maxmnemonic · 2024-09-19T07:39:46Z

Hey @mllife, increasing resolution certainly can help, we noticed a bump in accuracy if we increase resolution from 72dpi to 150dpi, but anything above doesn't help.

mllife · 2024-09-19T10:15:49Z

Thanks, for help. I will try it out and update here.

mllife · 2024-10-02T05:00:08Z

I updated my code to receive high dpi input and mapped the tokens accordingly, I see some improvement. I still see the model is randomly struggling if the tables have big cells with lot of content in a single cell. Hopefully, you will add some checks that all the tokens inside a table have be assigned to some cell.

mllife · 2024-11-18T07:27:57Z

This is issue, I am facing DS4SD/docling#278 missing text assignment for long cells

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Post processing improvements #28

Post processing improvements #28

mllife commented Sep 18, 2024

maxmnemonic commented Sep 18, 2024

mllife commented Sep 19, 2024

maxmnemonic commented Sep 19, 2024

mllife commented Sep 19, 2024

mllife commented Oct 2, 2024

mllife commented Nov 18, 2024

Post processing improvements #28

Post processing improvements #28

Comments

mllife commented Sep 18, 2024

maxmnemonic commented Sep 18, 2024

mllife commented Sep 19, 2024

maxmnemonic commented Sep 19, 2024

mllife commented Sep 19, 2024

mllife commented Oct 2, 2024

mllife commented Nov 18, 2024