ocr_post_correction

This is the repository for TPDL 2023. Organization is below:

Files

explore_visualize_dataset.ipynb explores character and word-level synthetic ground-truth (SGT) and OCR matches in several ways, including interactive Altair visualizations.
explore_aligned_dataset.ipynb explores the aligned sentences used for training the models.

Data & Model Weights

520k (500k in training, 10k in validation, 10k in test) randomized, aligned sentences and model weights are found on the Zenodo page for this work.

Folders

data/ - storage location of all data
- all_time_plot.csv - data showing the arXiv Bulk Downloads and our dataset over our time range (1991-2011)
- letters.pickle - Python dictionary with each SGT character as key, and all OCR-matched characters as values
- words.pickle - Python dictionary with each SGT word as key, and all OCR-matched words as values
- words_cleaned.pickle - Python dictionary with each SGT word as key, and all OCR-matched words as values.
  Here, each SGT word has punctuation and captialization removed (this dictionary is smaller than the one in words.pickle)
models/
- byt5/ contains all of the files needed to run and evaluate the byt5 model
- windowed/ contains all of the files needed to run and evaluate the windowed model
- mBART/ -- initial set up files for the mBART, but not used for the paper
example_alignment/ is an example of our alignment routine "in action". Per our agreement with the arXiv, we cannot release our full dataset as of yet, but we hope this acts as an example of our methods.

------------------------

TODO

add in HuggingFace links

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
data		data
example_alignment		example_alignment
misc		misc
models		models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
explore_aligned_dataset.ipynb		explore_aligned_dataset.ipynb
explore_visualize_dataset.ipynb		explore_visualize_dataset.ipynb
image_utils.py		image_utils.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ocr_post_correction

Files

Data & Model Weights

Folders

------------------------

TODO

About

Releases

Packages

Contributors 2

Languages

License

ReadingTimeMachine/ocr_post_correction

Folders and files

Latest commit

History

Repository files navigation

ocr_post_correction

Files

Data & Model Weights

Folders

------------------------

TODO

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages