Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignoring Images When Converting from PDF to MD #794

Open
sallahbaksh opened this issue Jan 23, 2025 · 3 comments
Open

Ignoring Images When Converting from PDF to MD #794

sallahbaksh opened this issue Jan 23, 2025 · 3 comments
Labels
PDF parsing question Further information is requested

Comments

@sallahbaksh
Copy link

Question

Is there a way to ignore images when converting from PDF to Markdown? If a PDF contains many images, the conversion process becomes very slow, sometimes taking over an hour. Any guidance on optimizing this or skipping images would be greatly appreciated.

@sallahbaksh sallahbaksh added the question Further information is requested label Jan 23, 2025
@PeterStaar-IBM
Copy link
Contributor

Can you give us an example?

@sallahbaksh
Copy link
Author

sallahbaksh commented Jan 27, 2025

I've attached a pdf that takes over an hour to convert from pdf to md:
Whitestown-UDO-Adopted-2020-06-12_Amended-November-2023 1.pdf

@PeterStaar-IBM
Copy link
Contributor

@sallahbaksh Thanks a lot, let me do some investigation, but at first glance, this looks like the model gets confused from the page furniture (left and right) and starts to interprete all as a table (making it slow).

I think that with this example, we can robustify the layout model. Let us work on that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PDF parsing question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants