Ignoring Images When Converting from PDF to MD #794

sallahbaksh · 2025-01-23T18:50:19Z

Question

Is there a way to ignore images when converting from PDF to Markdown? If a PDF contains many images, the conversion process becomes very slow, sometimes taking over an hour. Any guidance on optimizing this or skipping images would be greatly appreciated.

PeterStaar-IBM · 2025-01-26T07:31:43Z

Can you give us an example?

sallahbaksh · 2025-01-27T16:35:12Z

I've attached a pdf that takes over an hour to convert from pdf to md:
Whitestown-UDO-Adopted-2020-06-12_Amended-November-2023 1.pdf

PeterStaar-IBM · 2025-01-28T06:42:47Z

@sallahbaksh Thanks a lot, let me do some investigation, but at first glance, this looks like the model gets confused from the page furniture (left and right) and starts to interprete all as a table (making it slow).

I think that with this example, we can robustify the layout model. Let us work on that!

sallahbaksh added the question Further information is requested label Jan 23, 2025

PeterStaar-IBM added PDF parsing and removed PDF parsing labels Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignoring Images When Converting from PDF to MD #794

Ignoring Images When Converting from PDF to MD #794

sallahbaksh commented Jan 23, 2025

PeterStaar-IBM commented Jan 26, 2025

sallahbaksh commented Jan 27, 2025 •

edited

Loading

PeterStaar-IBM commented Jan 28, 2025

Ignoring Images When Converting from PDF to MD #794

Ignoring Images When Converting from PDF to MD #794

Comments

sallahbaksh commented Jan 23, 2025

Question

PeterStaar-IBM commented Jan 26, 2025

sallahbaksh commented Jan 27, 2025 • edited Loading

PeterStaar-IBM commented Jan 28, 2025

sallahbaksh commented Jan 27, 2025 •

edited

Loading