Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Post processing improvements #28

Open
mllife opened this issue Sep 18, 2024 · 6 comments
Open

Post processing improvements #28

mllife opened this issue Sep 18, 2024 · 6 comments

Comments

@mllife
Copy link

mllife commented Sep 18, 2024

Thanks, for all your work.
I see in some of the tables, not all the tokens are assigned to cell text. I think this can be handled in post processing to make sure that all tokens that are within the table bounding box are assigned to some cell (row/coloumn). Also, sometime rows cells are not aligned, I think this can be fixed by checking the Xmin of each cell within the row, basically to keep everything parallel. Can you please look into these cases. I think the first one should be a obvious one.

Sorry, I would have shared examples to check but the documents I have are sensitive. I will try to find any similar examples and share if possible.

@maxmnemonic
Copy link
Contributor

Hey @mllife! Thanks for the note, as you can see here: matching_post_processor.py
We already have quite involved post processing with cell matching / massaging / orphan-picking etc.

As an output cell bounding boxes encompass content that is located in the cell, so coordinate-wise cells in the same line might not always match pixel-wise.

If you could provide some open examples to understand the problem, that would help, alternatively feel free to modify the code and make a PR, so we can also run it on wast collection of tables that we have.

Thanks again!

@mllife
Copy link
Author

mllife commented Sep 19, 2024

I will try to create some similar artificial examples and share with you. Currently, I also need to improve my side of preprocessing steps I guess. As I wrote my own backbone parser with pymupdf to integrate with your code and using the low resolution images as input. Can you share if increasing input page image resolution can help?

@maxmnemonic
Copy link
Contributor

Hey @mllife, increasing resolution certainly can help, we noticed a bump in accuracy if we increase resolution from 72dpi to 150dpi, but anything above doesn't help.

@mllife
Copy link
Author

mllife commented Sep 19, 2024

Thanks, for help. I will try it out and update here.

@mllife
Copy link
Author

mllife commented Oct 2, 2024

I updated my code to receive high dpi input and mapped the tokens accordingly, I see some improvement. I still see the model is randomly struggling if the tables have big cells with lot of content in a single cell. Hopefully, you will add some checks that all the tokens inside a table have be assigned to some cell.

@mllife
Copy link
Author

mllife commented Nov 18, 2024

This is issue, I am facing DS4SD/docling#278 missing text assignment for long cells

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants