This repository contains a fine-tuned transformer-based model for restoring punctuation in automatic speech recognition (ASR) outputs and spoken transcripts. The model adds missing punctuation like ., ,, ?, !, : and ; to improve readability and downstream NLP performance.
- Fine-tuned on diverse spoken text data (Wikipedia corpus, Hugging Face datasets, podcast transcripts, manual YouTube captions from TedTalks and interviews)
- Supports
; : ! ? , .— uncommon punctuation like;and:included - Built on top of google/bert_uncased_L-4_H-256_A-4
- Easy to plug into any transcript-cleaning pipeline
- Does not support auto capitalisations, and works only on clean transcripts without any
; : ! ? , .punctuation
Follow these steps to install and run the punctuation restoration model locally.
git clone https://github.com/yyihaoc/punctuate-bert-mini.git
cd punctuate-bert-minipython3 -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`pip install -r requirements.txtOpen test_result_bert_mini.py and replace the example input with your own text. Then run
python test_result_bert_mini.py