Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support of MDX or NON-PDF formarts in EXPORT only .PDF -> MDX #654

Open
aerojeyenth opened this issue Feb 20, 2025 · 8 comments
Open

Support of MDX or NON-PDF formarts in EXPORT only .PDF -> MDX #654

aerojeyenth opened this issue Feb 20, 2025 · 8 comments
Labels
enhancement New feature or request Low priority

Comments

@aerojeyenth
Copy link

Is your feature request related to a problem?

Will it be possible to support different format which will be very useful in the data extraction pipelines.

Describe the solution you'd like

No response

Additional context

No response

@aerojeyenth aerojeyenth added the enhancement New feature or request label Feb 20, 2025
@awwaawwa
Copy link
Collaborator

This is a super long-term goal, can wait patiently.

@awwaawwa
Copy link
Collaborator

The current internal representation we use contains many PDF implementation-related details and is highly unstable, so it is temporarily not suitable for other data analysis scenarios.

@aerojeyenth aerojeyenth changed the title Support of MDX or NON-PDF formarts Support of MDX or NON-PDF formarts in EXPORT only .PDF -> MDX Feb 20, 2025
@aerojeyenth
Copy link
Author

This is a super long-term goal, can wait patiently.

Just to be clear it is just in the export format, can we export PDF to other formats like MDX, HTML etc?

@awwaawwa
Copy link
Collaborator

This is a super long-term goal, can wait patiently.

Just to be clear it is just in the export format, can we export PDF to other formats like MDX, HTML etc?

No

@awwaawwa
Copy link
Collaborator

Exporting PDF to other formats requires a lot of work and is not that simple.

@awwaawwa
Copy link
Collaborator

For this type of task, I suggest you consider other projects. There should be many such projects available now.

The core focus of this project at the current stage is to maintain the layout while translating PDFs, rather than converting PDFs to other formats.

@awwaawwa
Copy link
Collaborator

The PDF records the drawing of XX glyphs using XX font at XX coordinates. It does not record high-level paragraph relationships. To convert PDF to other formats, you need to use layout OCR + reading order recognition + a bunch of other work to achieve the conversion.

@awwaawwa
Copy link
Collaborator

This is not Just in Export, but rather a massive undertaking...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Low priority
Projects
None yet
Development

No branches or pull requests

2 participants