Entirely documented notebook in order to load and prepare the dataset built for training / finetuning the t5-base text-to-text small language model, to correct movie titles from text paragraphs containing errors in movie titles.
- Download title.basics.tsv from https://datasets.imdbws.com/ -> official imdb site for non-commercial datasets
- Filter the titles by removing all NSFW titles -> IMPORTANT!
- Save the filtered titles to filtered_titles.tsv
- Run fill_imdb.py to build the dataset required for training T5-base model. The dataset will then contain two columns, OCR generated title (error) and Movie Title (actual corrected title)
- Examine the jupyter notebook and run each cell in order to train and finetune the t5-base model