Assignment 3: Hidden Markov Models
In this assignment, we'll be revising word recognition, this time using hidden Markov models.
As with assignment 1 (part 4), we'll be using the free spoken digits dataset.
We will be using the hmmlearn library (which depends on numpy)
Please use hmms.py for the whole assignment. When submitting your work, please do not include the dataset.
As you can learn from the tutorial, hmmlearn provides us with the base implementation of hidden Markov models; we'll be using the hmm.GaussianHMM, which implements HMMs with a single Gaussian emission probability per state.
For a starter, build a basic isolated word recognizer that uses a separate model for each digit.
- Compute the MFCC features for the complete data set (3000 recordings; use
n_mfcc=13). - Implement a 6-fold cross-validation (x/v) loop to (later) figure out, which test speaker performs best/worst.
- Inside the c/v loop, train an individual HMM with linear topology for each digit.
- The
fitexpects features to be sequential in a single array; seenp.concatenate(..., axis=0) - How many states (
n_components) do you choose, and why? - How can you enforce a linear topology?
- You might find that certain digits perform particularly bad; what could be a reason and how to mitigate it?
- The
- Compute a confusion matrix for each speaker and for the overall dataset.
The example above is can't handle sequences of spoken digits.
In this part of the assignment, you'll build a basic decoder that is able to decode arbitrary sequences of digits (without a prior, though).
The decode method in hmmlearn only works for a single HMM.
There are two ways how to solve this assignment:
- Construct a "meta" HMM from the previously trained digit HMMs, by allowing state transitions from one digit to another; the resulting HMM can be decoded using the existing
decodemethod (don't forget to re-map the state ids to the originating digit). - (Optional) Implement a real (time-synchronous) decoder using beam search. The straight-forward way is to maintain a (sorted) list of active hypotheses (ie. state history and current log-likelihood) that is first expanded and then pruned in each time step. The tricky part is at the "end" of a model: do you loop or expand new words?
Now for this assignment:
- Generate a few test sequences of random length in in between 3 and 6 digits; use
numpy.random.randintand be sure to also retain the digits sequence since we need to compute edit distance between reference and hypotheses later. - Combine th epreviously trained HMMs to a single "meta" HMM, altering the transition probabilities to make a circular graph that allows each word to follow another.
- Implement a method that converts a state sequence relating to the meta HMM into a sequence of actual digits.
- Decode your test sequences and compute the word error rate (WER)
- Compute an overall (cross-validated) WER.
- (Optional) Implement a basic time-synchronous beam search; how do the results compare to the above viterbi decoding in terms of accuracy and time?