Enhance Conversion Result by Specifying Default OCR Option (EasyOCR)? #792

haoshan98 · 2025-01-23T08:32:10Z

haoshan98
Jan 23, 2025

Based on the documentation, EasyOCR is the default ocr_options, which the custom conversion pipeline would work without specifying ocr_options.
However when parsing a Chinese Language PDF, the result is not acceptable. After a random trying to clearly specify the ocr_options as EasyOcrOptions(), the result suddenly became very good. Is this an intended design?

def parse_document_content(file_path: str) -> ConversionResult:
    pipeline_options = PdfPipelineOptions()
    pipeline_options.do_ocr = True
    pipeline_options.do_table_structure = True
    pipeline_options.table_structure_options.do_cell_matching = True
    pipeline_options.ocr_options.force_full_page_ocr = True
    pipeline_options.accelerator_options = AcceleratorOptions(
        num_threads=8, device=AcceleratorDevice.CUDA
    )
    
    # clearly specify ocr_options yield better result
    pipeline_options.ocr_options = EasyOcrOptions()  # if comment out this line the result doesn't meet the expectation
    
    doc_converter = DocumentConverter(
        format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
    )
   conv_result = doc_converter.convert(file_path)
   return conv_result.document.export_to_markdown()

Answered by haoshan98

Jan 23, 2025

The sequence of specifications made a difference. No need to set force_full_page_ocr =True for Chinese language. Set pipeline_options.ocr_options = EasyOcrOptions() after pipeline_options.ocr_options.force_full_page_ocr = True overwrite it and made force_full_page_ocr = False, which yield better result.

View full answer

haoshan98 · 2025-01-23T09:16:58Z

haoshan98
Jan 23, 2025
Author

The sequence of specifications made a difference. No need to set force_full_page_ocr =True for Chinese language. Set pipeline_options.ocr_options = EasyOcrOptions() after pipeline_options.ocr_options.force_full_page_ocr = True overwrite it and made force_full_page_ocr = False, which yield better result.

1 reply

AdityaMannu1709 Jan 28, 2025

Can u tell more about pipeline_options.ocr_options.force_full_page_ocr = True especially for English?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance Conversion Result by Specifying Default OCR Option (EasyOCR)? #792

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Enhance Conversion Result by Specifying Default OCR Option (EasyOCR)? #792

haoshan98 Jan 23, 2025

Replies: 1 comment · 1 reply

haoshan98 Jan 23, 2025 Author

AdityaMannu1709 Jan 28, 2025

haoshan98
Jan 23, 2025

Replies: 1 comment 1 reply

haoshan98
Jan 23, 2025
Author