ocr-benchmarking

Testing methods for digitizing print bibliography as structured data

🚀 Installation

See src/workflow/pipeline.ipynb for prerequisites.

🔧 Usage

Before usage, ensure that the data directory is populated with the correct files.

Then, run each cell in src/workflow/pipeline.ipynb following its instructions to generate and benchmark LLM outputs.

To obtain visualizations, move benchmark results from benchmarking-results to benchmarking-results-for-visualizations and then run src/workflow/visualizations.ipynb.

You may view the visualizations used for our experiments in the output of src/workflow/visualizations.ipynb.

Directory overview

benchmarking-results: Benchmarking results CSV files generated from the pipeline.
benchmarking-results-for-visualizations: Benchmarking results used for visualizations.
config: Setup files.
data: Ground truth text and JSON files, as well as images.
project-notes: Descriptions of workflow, scratchwork, etc.
results: LLM and OCR output. Created automatically by the pipeline.
src: Source code

File naming scheme

XYZ refers to a three-digit page number (with padded zeroes as necessary).
{A,B} refers to either A or B.
<NAME> means to replace with the correct name.

The base file naming scheme is kbaa-pXYZ, followed by the file extension. This is preceded by prefixes as described below:

data

data/ground-truth/{txt,json}/gt_kbaa-pXYZ.{txt,json}
data/{pngs,tiffs}/kbaa-pXYZ.{png,tif}

results

results/{llm,ocr}{img,txt}2{txt,json}/
  <MODEL_NAME>/<MODEL_NAME>_{txt,img}_kbaa-pXYZ.{txt,json}
                            ^ input format      ^ output format

Other information

Other information about our project can be found in the project-notes directory.

📋 Credits

Greif et al., Multimodal LLMs for OCR, OCR-Post-Correction, and Named Entity Recognition in Historical Documents

Name		Name	Last commit message	Last commit date
Latest commit History 239 Commits
benchmarking-results-for-visualizations		benchmarking-results-for-visualizations
config		config
project-notes		project-notes
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ocr-benchmarking

🚀 Installation

🔧 Usage

Directory overview

File naming scheme

data

results

Other information

📋 Credits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ocr-benchmarking

🚀 Installation

🔧 Usage

Directory overview

File naming scheme

data

results

Other information

📋 Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages