-
Based on the documentation, EasyOCR is the default def parse_document_content(file_path: str) -> ConversionResult:
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.ocr_options.force_full_page_ocr = True
pipeline_options.accelerator_options = AcceleratorOptions(
num_threads=8, device=AcceleratorDevice.CUDA
)
# clearly specify ocr_options yield better result
pipeline_options.ocr_options = EasyOcrOptions() # if comment out this line the result doesn't meet the expectation
doc_converter = DocumentConverter(
format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
)
conv_result = doc_converter.convert(file_path)
return conv_result.document.export_to_markdown() |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
The sequence of specifications made a difference. No need to set |
Beta Was this translation helpful? Give feedback.
The sequence of specifications made a difference. No need to set
force_full_page_ocr =True
for Chinese language. Setpipeline_options.ocr_options = EasyOcrOptions()
afterpipeline_options.ocr_options.force_full_page_ocr = True
overwrite it and madeforce_full_page_ocr = False
, which yield better result.