GitHub - yashvoladoddi37/movie-title-ocr-corrector

Entirely documented notebook in order to load and prepare the dataset built for training / finetuning the t5-base text-to-text small language model, to correct movie titles from text paragraphs containing errors in movie titles.

Download title.basics.tsv from https://datasets.imdbws.com/ -> official imdb site for non-commercial datasets
Filter the titles by removing all NSFW titles -> IMPORTANT!
Save the filtered titles to filtered_titles.tsv
Run fill_imdb.py to build the dataset required for training T5-base model. The dataset will then contain two columns, OCR generated title (error) and Movie Title (actual corrected title)
Examine the jupyter notebook and run each cell in order to train and finetune the t5-base model

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
fill_imdb.py		fill_imdb.py
ocr_text_correction_model.ipynb		ocr_text_correction_model.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

yashvoladoddi37/movie-title-ocr-corrector

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages