pdf broken encoding reader #522

sinkudo · 2025-04-03T14:55:39Z

Reader to extract code from PDF with complex background using information from PDF

added my code from project needed to extract text:

reader
config
h5 models
text post processing
fontforge wrapper
example pdf

added script to scripts dir to extract text using reader

edited index.html, api_args, manager_config, gitignore and requirements

…takes too much time), changed imports

NastyBoget · 2025-04-07T11:14:48Z

requirements.txt

@@ -40,3 +39,6 @@ wget==3.2
 xgbfir>=0.3.1,<1.0
 xgboost>=1.6.0,<2.0  # lower versions aren't compatible with pandas>2
 xlrd>=1.2.0,<2.0
+nltk==3.9.1
+tensorflow==2.13.0


tensorflow import should be optional similarly to torch (https://github.com/ispras/dedoc/blob/master/pyproject.toml)

NastyBoget · 2025-04-07T11:17:02Z

scripts/get_text_broken_pdf.py

+
+from dedoc.readers.pdf_reader.pdf_txtlayer_reader.pdf_broken_encoding_reader.pdf_broken_encoding_reader import PdfBrokenEncodingReader
+
+if __name__ == "__main__":


This script can be converted into test (e.g. here https://github.com/ispras/dedoc/blob/master/tests/unit_tests/test_format_pdf_reader.py)

NastyBoget · 2025-04-07T11:20:03Z

dedoc/api/api_args.py

@@ -24,7 +24,7 @@ class QueryParameters:
    table_type: str = Form("", description="Pipeline mode for table recognition")

    # pdf handling
-    pdf_with_text_layer: str = Form("auto_tabby", enum=["true", "false", "auto", "auto_tabby", "tabby"],
+    pdf_with_text_layer: str = Form("auto_tabby", enum=["true", "false", "auto", "auto_tabby", "tabby","bad_encoding_reader"],


API test should be written for a new parameter: you can create file test_pdf_bad_encoding_reader.py similarly to test_api_format_pdf_with_text.py

NastyBoget · 2025-04-07T11:24:03Z

.gitignore

@@ -26,6 +26,7 @@ var/
 *.egg-info/
 .installed.cfg
 *.egg
+dedoc/readers/pdf_reader/pdf_txtlayer_reader/pdf_broken_encoding_reader/data/pdfdata


the directory should be placed in cache ("resources_path" of dedoc config) and downloaded if needed, as it is done for other data (datasets, models) in download_models.py - this script is used in Docker for downloading data in advance

For PyPI library, readers download their data if needed, e.g. for PdfAutoReader - pdf_auto_reader/txtlayer_classifier.py#L27

NastyBoget · 2025-04-07T11:31:32Z

...ders/pdf_reader/pdf_txtlayer_reader/pdf_broken_encoding_reader/pdf_broken_encoding_reader.py

+
+class PdfBrokenEncodingReader(PdfBaseReader):
+    """
+    This class allows to extract content (text, tables, attachments) from the .pdf documents with a textual layer (copyable documents).


Wrong docstring (it's doc for other reader)

NastyBoget · 2025-04-07T11:33:15Z

...ders/pdf_reader/pdf_txtlayer_reader/pdf_broken_encoding_reader/pdf_broken_encoding_reader.py

+        self.extractor_layer = PdfminerExtractor(config=self.config)
+        self.__pdf_txtlayer_reader = PdfTxtlayerReader(config=config)
+
+    def can_read(self, file_path: Optional[str] = None, mime: Optional[str] = None, extension: Optional[str] = None,


Wrong docstring

NastyBoget · 2025-04-07T11:33:29Z

...ders/pdf_reader/pdf_txtlayer_reader/pdf_broken_encoding_reader/pdf_broken_encoding_reader.py

+        return super().can_read(file_path=file_path, mime=mime, extension=extension) and get_param_pdf_with_txt_layer(
+            parameters) == "bad_encoding_reader"
+
+    def read(self, file_path: str, parameters: Optional[dict] = None) -> UnstructuredDocument:


Docstring is needed

NastyBoget · 2025-04-07T11:35:30Z

dedoc/api/web/index.html

@@ -110,6 +110,7 @@ <h4>PDF handling</h4>
                            <option value="auto">auto</option>
                            <option value="auto_tabby" selected>auto_tabby</option>
                            <option value="tabby">tabby</option>
+                            <option value="bad_encoding_reader">bad_encoding_reader</option>


New parameter should be added into docs:

https://github.com/ispras/dedoc/blob/master/docs/source/dedoc_api_usage/api.rst

https://github.com/ispras/dedoc/blob/master/docs/source/parameters/pdf_handling.rst

NastyBoget · 2025-04-07T11:37:01Z

...ders/pdf_reader/pdf_txtlayer_reader/pdf_broken_encoding_reader/pdf_broken_encoding_reader.py

+WordObj = namedtuple("Word", ["start", "end", "value"])
+
+
+class PdfBrokenEncodingReader(PdfBaseReader):


New reader should be added to the docs: https://github.com/ispras/dedoc/blob/master/docs/source/modules/readers.rst

NastyBoget · 2025-04-07T11:39:26Z

Please look to the logs of test pipelines - they all should pass before merge

oksidgy · 2025-04-08T11:17:56Z

.../pdf_reader/pdf_txtlayer_reader/pdf_broken_encoding_reader/data/models/default_models/eng.h5

All binary files should be downloaded as needed at the reader initialization stage.

all binary files should be downloaded as needed at the reader initialization stage
You can see how it is done in dedoc/readers/pdf_reader/pdf_image_reader/pdf_image_reader.py
in the line:

checkpoint_path=os.path.join(get_config()["resources_path"], "scan_orientation_efficient_net_b0.pth"), ...

Binary files should be downloaded to the "resources_path" directory of config file

an example of initialization and downloading of weights dedoc/readers/pdf_reader/pdf_image_reader/columns_orientation_classifier/columns_orientation_classifier.py (function _load_weights)

oksidgy · 2025-04-08T11:28:07Z

scripts/get_text_broken_pdf.py

+    args = parser.parse_args()
+    reader = PdfBrokenEncodingReader()
+    document = reader.read(args.pdf_path)
+    print(document.get_text())


You should add more then one tests:

add api tests into tests/api_tests to check that your reader is working correctly via api

add a unit test to test/unit_test to check that your reader is working correctly

oksidgy · 2025-04-08T11:32:09Z

...ders/pdf_reader/pdf_txtlayer_reader/pdf_broken_encoding_reader/pdf_broken_encoding_reader.py

+
+
+class PdfBrokenEncodingReader(PdfBaseReader):
+    """


Please add a detailed description of the functionality of your reader

oksidgy · 2025-04-08T11:40:35Z

...ders/pdf_reader/pdf_txtlayer_reader/pdf_broken_encoding_reader/pdf_broken_encoding_reader.py

+        except Exception as e:
+            raise Exception(f"some problem occured: {e}")
+        pages, layouts = reader.get_correct_layout(file_path)
+        tables = []


'tables' unused

oksidgy · 2025-04-08T11:45:13Z

...ders/pdf_reader/pdf_txtlayer_reader/pdf_broken_encoding_reader/pdf_broken_encoding_reader.py

+            image_page = cv2.cvtColor(image_page, cv2.COLOR_GRAY2BGR)
+        return image_page
+
+    def __debug_extract_layout(self, image_src: np.ndarray, layout: LTContainer, page_num: int, k_w: float, k_h: float,


It is a dublicate code of dedoc/readers/pdf_reader/pdf_txtlayer_reader/pdfminer_reader/pdfminer_extractor.py

oksidgy · 2025-04-08T11:45:27Z

...ders/pdf_reader/pdf_txtlayer_reader/pdf_broken_encoding_reader/pdf_broken_encoding_reader.py

+        cv2.imwrite(os.path.join(tmp_dir, f"img_page_{page_num}.png"), image_src)
+        file_text.close()
+
+    def __extract_image(self,


It is a dublicate code of dedoc/readers/pdf_reader/pdf_txtlayer_reader/pdfminer_reader/pdfminer_extractor.py

oksidgy · 2025-04-08T11:45:32Z

...ders/pdf_reader/pdf_txtlayer_reader/pdf_broken_encoding_reader/pdf_broken_encoding_reader.py

+
+        return text_with_bbox
+
+    def __get_line_annotations(self, lobj: LTTextLineHorizontal, height: int, width: int) -> Tuple[


It is a dublicate code of dedoc/readers/pdf_reader/pdf_txtlayer_reader/pdfminer_reader/pdfminer_extractor.py

oksidgy · 2025-04-08T11:45:37Z

...ders/pdf_reader/pdf_txtlayer_reader/pdf_broken_encoding_reader/pdf_broken_encoding_reader.py

+    def _get_new_weight(self) -> str:
+        return binascii.hexlify(os.urandom(8)).decode("ascii")
+
+    def __parse_style_string(self, chars_with_meta: str, begin: int, end: int) -> List[Annotation]:


It is a dublicate code of dedoc/readers/pdf_reader/pdf_txtlayer_reader/pdfminer_reader/pdfminer_extractor.py

oksidgy · 2025-04-08T11:51:38Z

...ders/pdf_reader/pdf_txtlayer_reader/pdf_broken_encoding_reader/pdf_broken_encoding_reader.py

+                                                                                      call_classifier=False))
+        return lines, tables, page.attachments, []
+
+    def __handle_page(self, page: PDFPage, page_number: int, path: str,


it is dublicated code of dedoc/readers/pdf_reader/pdf_txtlayer_reader/pdfminer_reader/pdfminer_extractor.py with different that you use own layout (pass own layout ). Try to use code of dedoc/readers/pdf_reader/pdf_txtlayer_reader/pdfminer_reader/pdfminer_extractor.py with small changes of code of pdfminer_extractor.py.

For example in file dedoc/readers/pdf_reader/pdf_txtlayer_reader/pdfminer_reader/pdfminer_extractor.py:

do pubic function def __handle_page(...) -> def handle_page(...)

added small changes into code:

def handle_page(self, page: PDFPage, page_number: int, path: str, parameters: ParametersForParseDoc, Layout: Optional[LTPage] = None) -> PageWithBBox: ... if not layout: layout = device.get_result() ...

So, by adding these small changes you will get rid of 300 lines of duplicate code!

oksidgy · 2025-04-08T11:56:35Z

...s/pdf_reader/pdf_txtlayer_reader/pdf_broken_encoding_reader/pdf_worker/pdf_text_correcter.py

+onlyRus = ['я', 'й', 'ц', 'б', 'ж', 'з', 'д', 'л', 'ф', 'ш', 'щ', "ч", "ъ", "ь", "э", "ю", 'г']
+onlyEng = ['q', 'w', 'f', 'i', 'j', 'l', 'z', 's', 'v', 'g']
+
+from nltk.corpus import words


from nltk.corpus import words put inside code of function substitute_chars_by_dict

oksidgy · 2025-04-08T11:59:38Z

dedoc/readers/pdf_reader/pdf_txtlayer_reader/pdf_broken_encoding_reader/model.py

+from dedoc.readers.pdf_reader.pdf_txtlayer_reader.pdf_broken_encoding_reader.functions import get_project_root
+
+
+class Model:


Docstring is needed

oksidgy · 2025-04-08T12:01:59Z

dedoc/readers/pdf_reader/pdf_txtlayer_reader/pdf_broken_encoding_reader/model.py

+import tensorflow as tf
+from keras import layers
+from keras.callbacks import TensorBoard
+from keras.models import load_model


move all external imports inside the function codes where they are called

oksidgy · 2025-04-08T12:02:22Z

dedoc/readers/pdf_reader/pdf_txtlayer_reader/pdf_broken_encoding_reader/functions.py

+
+import PIL.ImageOps
+from PIL import Image
+from pdfminer.high_level import extract_text


move all external imports inside the function codes where they are called

oksidgy · 2025-04-08T12:02:32Z

...ers/pdf_reader/pdf_txtlayer_reader/pdf_broken_encoding_reader/ffwrapper/fontforge_wrapper.py

+import warnings
+from pathlib import Path
+
+import fontforge


move all external imports inside the function codes where they are called

oksidgy · 2025-04-08T12:02:43Z

dedoc/readers/pdf_reader/pdf_txtlayer_reader/pdf_broken_encoding_reader/config.py

+import os
+from pathlib import Path
+
+from keras.models import load_model


move all external imports inside the function codes where they are called

…pdfminer_extractor, added function handle_page(...) to pdfminer_extractor to use in my reader

…ction to tempfile

sinkudo added 7 commits March 26, 2025 00:51

встроил, надо разобраться с импортами и т.п.

29454d6

changed script to extract text, but problem with get_text() remains (…

d3ef3cf

…takes too much time), changed imports

adding reader to manager, cleaning comments

62ec1d1

added reader to api

d0179da

is pdf valid check, (cid:xxx) instead of chars fix

a3b51e7

reduntant funcs

89a320e

imports

db17824

NastyBoget self-requested a review April 7, 2025 11:10

NastyBoget assigned sinkudo Apr 7, 2025

NastyBoget added the enhancement New feature or request label Apr 7, 2025

NastyBoget reviewed Apr 7, 2025

View reviewed changes

oksidgy reviewed Apr 8, 2025

View reviewed changes

sinkudo added 8 commits April 23, 2025 03:27

tf optional import(soon will remove and replace with torch)

fad75fc

unit test

f9a877b

txt for test and remove script

78271db

reader and parameters into docs

c072785

added docstrings, removed reduntant tables var, removed dublicate of …

9738962

…pdfminer_extractor, added function handle_page(...) to pdfminer_extractor to use in my reader

now download model to resources_path, saving pdfdata needed for extra…

7ece585

…ction to tempfile

moved external functions into functions

25d0adc

model imports

a7659dd


		from dedoc.readers.pdf_reader.pdf_txtlayer_reader.pdf_broken_encoding_reader.pdf_broken_encoding_reader import PdfBrokenEncodingReader

		if __name__ == "__main__":

		WordObj = namedtuple("Word", ["start", "end", "value"])


		class PdfBrokenEncodingReader(PdfBaseReader):


		return text_with_bbox

		def __get_line_annotations(self, lobj: LTTextLineHorizontal, height: int, width: int) -> Tuple[

		from dedoc.readers.pdf_reader.pdf_txtlayer_reader.pdf_broken_encoding_reader.functions import get_project_root


		class Model:

pdf broken encoding reader #522

Are you sure you want to change the base?

pdf broken encoding reader #522

Conversation

sinkudo commented Apr 3, 2025

Reader to extract code from PDF with complex background using information from PDF

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NastyBoget Apr 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NastyBoget Apr 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NastyBoget commented Apr 7, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oksidgy Apr 8, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NastyBoget Apr 7, 2025 •

edited

Loading

NastyBoget Apr 7, 2025 •

edited

Loading

oksidgy Apr 8, 2025 •

edited

Loading