This is a quick guide for using x-vector utility tool (recipe/local/xvector_utils.py
).
- Kaldi (for extracting x-vectors)
- Python 3.5+
For Python package dependencies, see requirements.txt
.
$ python xvector_utils.py -h
Usage:
xvector_utils.py [options] make-xvectors <wav.scp> <labels.stm> <xvectors.ark>
xvector_utils.py [options] visualize <labels.stm> <xvectors.ark> <output_dir>
xvector_utils.py [options] train <labels.stm> <xvectors.ark> <model.pkl>
xvector_utils.py [options] tune <labels.stm> <xvectors.ark> <model.pkl>
xvector_utils.py [options] evaluate <labels.stm> <xvectors.ark> <model.pkl>
xvector_utils.py [options] predict <xvectors.ark> <model.pkl>
xvector_utils.py [-h]
A utility tool to:
- extract x-vectors
- visualize x-vectors
- train, tune and evaluate an x-vector classifier
Visualize options:
--interactive If present, save plot as an interactive html file.
--plot-all If present, plot all data points. --sample-size option
will be ingored.
--cmap STR Specity matplotlib colormap for static plots, ignored
when generating interactive plots with --interactive.
Qualitative colormaps are recommended. [default: tab20]
--num-components INT Dimention of embedded space using t-SNE. [default: 2]
--perc-variance FLOAT Fraction of the total variance that needs to be
explained by the generated components using PCA.
[default: 0.95]
--sample-size INT Sample size. [default: 10_000]
--title STR Custom title of the plot.
Train options:
--params STR String of LinearSVC parameters (see scikitlearn doc
for list of params) to train classifier on. e.g.,
'{"C": 0.001, "loss": "hinge"}'
[default: {"C": 0.01, "dual": False}]
Tune options:
--param-grid STR String of LinearSVC parameters (see scikitlearn doc
for list of params) to perform grid search on. e.g.,
'{"C": [0.1, 1], "loss": ["hinge", "squared_hinge"]}'
[default: {"C": [0.001, 0.01, 0.1, 1, 10, 100]}]
--score STR Scoring parameter (see scikitlearn documentation for
list of scoring params). [default: accuracy]
--test-size FLOAT Fraction of test sample size. [default: 0.20]
Evaluate options:
--print-error-keys Print xvector key names for classification errors.
--save-roc PATH Path to output CSV file containing ROC metrics.
Output CSV headers:
- fpr: False positive rate
- tpr: True positive rate
- thr: Probability thresholds
--thr FLOAT Probability threshold for binary classification.
Common options:
-h --help Show this screen.
--version Show version.
--num-cpus INT Max number of cpus to use. If not specified it uses all
processors.
Common arguments:
<wav.scp> PATH SCP file mapping from audio file IDs to absolute paths.
<labels.stm> PATH STM file containing class lablel(s) in the 6th column.
<xvectors.ark> PATH ARK file to read / write xvectors to.
<model.pkl> PATH Pickle file to read / write model to.
<output_dir> PATH Path to output directory.
The system expects audio files in 16KHz 16-bit mono wav format - use tools such as ffmpeg
if your audio files don't conformt to this specification.
e.g.:
$ ffmpeg -i input.mp3 -vn -ac 1 -ar 16000 output.wav
It's a binary Kaldi ARK file containing 512-dimensional vectors (x-vectors) extracted from audio files. X-vectors can be extracted from audio files using the xvector_utils.py make-xvectors
command and by providing appropriate wav.scp
and labels.stm
.
A Kaldi style wav.scp
file. It is a text file which provides mapping between file ids and the absolute path
of corresponding audio files. The file IDs should be the basename of the wav file minus the extension.
e.g.:
audio_file_1 absolute/path/to/audio_file_1.wav
audio_file_2 absolute/path/to/audio_file_2.wav
audio_file_3 absolute/path/to/audio_file_3.wav
The STM file is defined as part of the SCLITE toolkit and can be found here.
Definition is as follows:
STM :== <F> <C> <S> <BT> <ET> [ <LABEL> ] transcript . . .
Where:
<F> The waveform filename. NOTE: no pathnames or extensions are expected.
<C> The waveform channel. This value is ignored.
<S> The speaker id, no restrictions apply to this name.
<BT> The begin time (seconds) of the segment.
<ET> The end time (seconds) of the segment.
<LABEL> A comma separated list of subset identifiers enclosed in angle brackets.
transcript (Optional) A whitespace separated list of words or _ to denote empty transcripts.
For this tool the transcript will be ignored.
Use 6th column to provide lables for each segment.
e.g.:
audio_file_1 0 audio_file_1_Spk1 0.0 37.1779439252 <noise> _
audio_file_1 0 audio_file_1_Spk1 37.1779439252 39.5 <speech> EXSAMPLE TRANSCRIPT ONE
audio_file_1 0 audio_file_1_Spk1 39.5 41.31 <speech> EXSAMPLE TRANSCRIPT TWO
audio_file_1 0 audio_file_1_Spk1 41.31 47.38 <noise> _
audio_file_1 0 audio_file_1_Spk1 47.38 53.5610280374 <speech> _
If a segment belongs to more than one class, separate each class with comma.
e.g.:
audio_file_1 0 audio_file_1_Spk1 0.0 37.1779439252 <noise,laughter> _
audio_file_1 0 audio_file_1_Spk1 37.1779439252 39.5 <speech> EXSAMPLE TRANSCRIPT ONE
audio_file_1 0 audio_file_1_Spk1 39.5 41.31 <speech,applause> EXSAMPLE TRANSCRIPT TWO
audio_file_1 0 audio_file_1_Spk1 41.31 47.38 <noise,music> _
audio_file_1 0 audio_file_1_Spk1 47.38 53.5610280374 <speech> _