A curated list of awesome disfluency detection publications along with their released code (if available) and bibliography. A chronological order of the published papers is available here.
Please feel free to send me pull requests or email me to add a new resource.
Studies on disfluency detection are categorized as follows (some papers belong to more than one category):
The main idea behind a noisy channel model of speech disfluency is that we assume there is a fluent source utterance x
to which some noise has
been added, resulting in a disfluent utterance y
. Given y
, the goal is to find the most likely source fluent sentence such that p(x|y)
is maximized.
Disfluency detection using a noisy channel model and a deep neural language model. Jamshid Lou et al. ACL 2017. [bib]
The impact of language models and loss functions on repair disfluency detection. Zwarts et al. ACL 2011. [bib]
An improved model for recognizing disfluencies in conversational speech. Johnson et al. Rich Transcription Workshop 2004. [bib]
A TAG-based noisy channel model of speech repair. Johnson et al. ACL 2004. [bib]
The task of disfluency detection is framed as a word token classification problem, where each word token is classified as being disfluent/fluent or by using a begin-inside-outside (BIO) based tagging scheme.
Joint prediction of punctuation and disfluency in speech transcripts. Lin et al. INTERSPEECH 2020. [bib]
Giving attention to the unexpected: using prosody innovations in disfluency detection. Zayats et al. NAACL 2019. [bib] [code]
Disfluency detection based on speech-aware token-by-token sequence labeling with BLSTM-CRFs and attention mechanisms. Tanaka et al. APSIPA 2019. [bib]
Noisy BiLSTM-based models for disfluency detection. Bach et al. INTERSPEECH 2019. [bib]
Disfluency detection using auto-correlational neural networks. Jamshid Lou et al. EMNLP 2018. [bib] [code]
Robust cross-domain disfluency detection with pattern match networks. Zayats et al. Arxiv 2018. [bib] [code]
Disfluency detection using a bidirectional LSTM. Zayats et al. INTERSPEECH 2016. [bib]
Multi-domain disfluency and repair detection. Zayats et al. INTERSPEECH 2014. [bib]
A Sequential Repetition Model for Improved Disfluency Detection. Ostendorf et al. INTERSPEECH 2013. [bib]
The role of disfluencies in topic classification of human-human conversations. Liu et al. IEEE TRANSACTIONS ON SPEECH & AUDIO PROCESSING 2006. [bib]
Automatic disfluency identification in conversational speech using multiple knowledge sources. Liu et al. Eurospeech 2003. [bib]
Automatic punctuation and disfluency detection in multi-party meetings using prosodic and lexical cues. Baron et al. ICSLP 2002. [bib]
Translation-based approaches for disfluency detection are commonly formulated as encoder-decoder systems, where the encoder learns the representation of input sentence containing disfluencies and the decoder learns to generate the underlying fluent version of the input.
Adapting translation models for transcript disfluency detection. Dong et al. AAAI 2019. [bib]
Semi-supervised disfluency detection. Wang et al. COLING 2018. [bib]
A neural attention model for disfluency detection. Wang et al. COLING 2016. [bib]
Parsing-based approaches detect disfluencies while simultaneously identifying the syntactic or semantic structure of the sentence. Training a parsing-based model requires large annotated treebanks that contain both disfluencies and syntactic/semantic structures.
Semantic parsing of disfluent speech. Sen et al. EACL 2021.
Improving disfluency detection by self-training a self-attentive model. Jamshid Lou et al. ACL 2020. [bib] [code]
Neural constituency parsing of speech transcripts. Jamshid Lou et al. NAACL 2019. [bib] [code]
On the role of style in parsing speech with neural models. Tran et al. INTERSPEECH 2019. [bib] [code]
Parsing speech: a neural approach to integrating lexical and acoustic-prosodic information. Tran et al. NAACL 2018. [bib] [code]
Transition-based disfluency detection using LSTMs. Wang et al. EMNLP 2017. [bib] [code]
Joint transition-based dependency parsing and disfluency detection for automatic speech recognition texts. Yoshikawa et al. EMNLP 2016. [bib]
Joint incremental disfluency detection and dependency parsing. Honnibal et al. TACL 2014. [bib]
Joint parsing and disfluency detection in linear time. Rasooli et al. EMNLP 2013. [bib]
Edit detection and parsing for transcribed speech. Charniak et al. NAACL 2001. [bib]
Speech signal carries extra information beyond the words which can provide useful cues for disfluency detection models. Some studies have explored integrating acoustic/prosodic cues to lexical features for detecting disfluencies.
On the role of style in parsing speech with neural models. Tran et al. INTERSPEECH 2019. [bib] [code]
Disfluency detection based on speech-aware token-by-token sequence labeling with BLSTM-CRFs and attention mechanisms. Tanaka et al. APSIPA 2019. [bib]
Giving attention to the unexpected: using prosody innovations in disfluency detection. Zayats et al. NAACL 2019. [bib] [code]
Parsing speech: a neural approach to integrating lexical and acoustic-prosodic information. Tran et al. NAACL 2018. [bib] [code]
Automatic disfluency identification in conversational speech using multiple knowledge sources. Liu et al. Eurospeech 2003. [bib]
Automatic punctuation and disfluency detection in multi-party meetings using prosodic and lexical cues. Baron et al. ICSLP 2002. [bib]
Disfluency detection models are usually trained and evaluated on Switchboard corpus. Switchboard is the largest disfluency annotated dataset; however, only about 6% of the words in the Switchboard are disfluent. Some studies have suggested new data augmentation techniques to mitigate the scarcity of gold disfluency-labeled data.
Disfluency detection with unlabeled data and small BERT models. Rocholl et al. Submitted to INTERSPEECH 2021.
Planning and generating natural and diverse disfluent texts as augmentation for disfluency detection. Yang et al. EMNLP 2020. [bib] [code]
Combining self-training and self-supervised learning for unsupervised disfluency detection. Wang et al. EMNLP 2020. [bib] [code]
Improving disfluency detection by self-training a self-attentive model. Jamshid Lou et al. ACL 2020. [bib] [code] [data]
Auxiliary sequence labeling tasks for disfluency detection. Lee et al. arxiv 2020.
Multi-task self-supervised learning for disfluency detection. Wang et al. AAAI 2020. [bib]
Noisy BiLSTM-based models for disfluency detection. Bach et al. INTERSPEECH 2019. [bib]
Semi-supervised disfluency detection. Wang et al. COLING 2018. [bib]
Most disfluency detection models are developed based on the assumptions that a full sequence context as well as rich transcriptions including pre-segmentation information are available. These assumptions, however, are not valid in real-time scenarios where the input to the disfluency detector is live transcripts generated by a streaming ASR model. In such cases, a disfluency detector is expected to incrementally label input transcripts as it receives token-by-token data. Some studies have proposed new incremental disfluency detectors.
Re-framing incremental deep language models for dialogue processing with multi-task learning. Rohanian et al. COLING 2020. [bib] [code]
Recurrent neural networks for incremental disfluency detection. Hough et al. INTERSPEECH 2015. [bib]
Joint incremental disfluency detection and dependency parsing. Honnibal et al. TACL 2014. [bib]
Most disfluency detectors are applied as an intermediate step between a speech recognition and a downstream task. Unlike the conventional pipeline models, some studies have explored end-to-end speech recoginition and disfluency removal.
Improved robustness to disfluencies in RNN-Transducer based speech recognition. Mendelev et al. Arxiv 2020. [bib]
End-to-end speech recognition and disfluency removal. Jamshid Lou et al. EMNLP Findings 2020. [bib] [code]
While most of the end-to-end speech translation studies have explored translating read speech, there are a few studies that examine the end-to-end conversational speech translation, where the task is to directly translate source disfluent speech into target fluent texts.
NAIST’s machine translation systems for IWSLT 2020 conversational speech translation task. Fukuda et al. IWSLT 2020. [bib]
Generating fluent translations from disfluent text without access to fluent references: IIT Bombay@IWSLT2020. Saini et al. IWSLT 2020. [bib]
Fluent translations from disfluent speech in end-to-end speech translation. Salesky et al. NAACL 2019. [bib] [data]
Segmentation and disfluency removal for conversational speech translation. Hassan et al. INTERSPEECH 2014. [bib]
Analysis of Disfluency in Children’s Speech. Tran et al. INTERSPEECH 2020. [bib]
Speech disfluencies occur at higher perplexities. Sen. Cognitive Aspects of the Lexicon Workshop 2020. [bib]
Controllable time-delay transformer for real-time punctuation prediction and disfluency detection. Chen et al. ICASSP 2020. [bib]
Expectation and locality effects in the prediction of disfluent fillers and repairs in English speech. Dammalapati et al. NAACL Student Research Workshop 2019. [bib]
Disfluencies and human speech transcription errors. Zayats et al. INTERSPEECH 2019. [bib] [data]
Unediting: detecting disfluencies without careful transcripts. Zayats et al. NAACL 2015. [bib]
The role of disfluencies in topic classification of human-human conversations. Boulis et al. AAAI Workshop 2005.
- Preliminaries to a theory of speech disfluencies. Shriberg. PhD Thesis 1994. [bib]
- Disfluent Speech Segments Detection and Remediation. Arbajian. PhD Thesis 2019.
Paria Jamshid Lou [email protected]