5. Segmentation

Segmentation is a very important processing stage for most of audio analysis applications. The goal is to split an uninterrupted audio signal into homogeneous segments. Segmentation can either be

supervised: in that case some type of supervised knowledge is used to classify and segment the input signals. This is either achieved through applying a classifier in order to classify successive fix-sized segments to a set of predefined classes, or by using a HMM approach to achieve joint segmentation-classification.
unsupervised: a supervised model is not available and the detected segments are clustered (example: speaker diarization)

Fixed-segment Segmentation & Classification

The goal of this functionality is to apply an existing classifier to fix-sized segments of an audio recording, yielding to a sequence of class labels that characterize the whole signal.

Towards this end, function mtFileClassification() from audioSegmentation.py can be used. This function

splits an audio signal to successive mid-term segments and extracts mid-term feature statistics from each of these sgments, using mtFeatureExtraction() from audioFeatureExtraction.py
classifies each segment using a pre-trained supervised model
merges successive fix-sized segments that share the same class label to larger segments
visualize statistics regarding the results of the segmentation - classification process.

Example:

from pyAudioAnalysis import audioSegmentation as aS
[flagsInd, classesAll, acc] = aS.mtFileClassification("data/scottish.wav", "data/svmSM", "svm", True, 'data/scottish.segments')

Note that the last argument of this function is a .segment file. This is used as ground-truth (if available) in order to estimate the overall performance of the classification-segmentation method. If this file does not exist, the performance measure is not calculated. These files are simple comma-separated files of the format: <segment start (seconds)>,<segment end (seconds)>,<segment label>. For example:

0.01,9.90,speech
9.90,10.70,silence
10.70,23.50,speech
23.50,184.30,music
184.30,185.10,silence
185.10,200.75,speech
...

Function plotSegmentationResults() is used to plot both resulting segmentation-classification and to evaluate the performance of this result (if ground-truth file is available).

Command-line use:

python audioAnalysis.py -segmentClassifyFile <method(svm or knn)> <modelName> <fileName>

Example:

python audioAnalysis.py -segmentClassifyFile svm data/svmSM data/scottish.wav

The last command-line execution generates the following figure (in the above example data/scottish.segmengts is found and automatically used as ground-truth for calculating the overall accuracy):

Mid-term classification example

Note In the next section (hmm-based segmentation-classification) we present how one can evaluate both the fix-sized approach and the hmm approach.

HMM-based segmentation and classification

pyAudioAnalysis provides the ability to train and apply Hidden Markov Models (HMMs) in order to achieve joint classification-segmentation. This is a supervised approach, therefore the required transition and prior matrices of the HMM model need to be trained. Towards this end, function trainHMM_fromFile() can be used to train a HMM model for segmentation-classification using a single annotated audio file. A list of files stored in a particular folder can also be used, through calling function trainHMM_fromDir(). In both cases, an annotation file is needed for each audio recording (WAV file). HMM annotation files have a .segments extension (see above).

As soon as the model is trained, function hmmSegmentation() can be used to apply the HMM, in order to estimate the most probable sequence of class labels, given a respective sequence of mid-term feature vectors. Note that hmmSegmentation() uses plotSegmentationResults() to plot results and evaluate the performance, as with the Fixed-segment Segmentation & Classification methodology.

Code example:

from pyAudioAnalysis import audioSegmentation as aS
aS.trainHMM_fromFile('radioFinal/train/bbc4A.wav', 'radioFinal/train/bbc4A.segments', 'hmmTemp1', 1.0, 1.0)	# train using a single file
aS.trainHMM_fromDir('radioFinal/small/', 'hmmTemp2', 1.0, 1.0)							# train using a set of files in a folder
aS.hmmSegmentation('data/scottish.wav', 'hmmTemp1', True, 'data/scottish.segments')				# test 1
aS.hmmSegmentation('data/scottish.wav', 'hmmTemp2', True, 'data/scottish.segments')				# test 2

Command-line use examples:

python audioAnalysis.py -trainHMMsegmenter_fromdir radioFinal/small/ hmmtemp2 1.0 1.0 (train)
python audioAnalysis.py -segmentClassifyFileHMM hmmtemp2 data/scottish.wav                (test)

It has to be noted that for shorter-term events (e.g. spoken words, etc), shorter mid-term windows need to be used when training the model. Try, for example the following example:

python audioAnalysis.py -trainHMMsegmenter_fromfile data/count.wav data/count.segments hmmcount 0.1 0.1           (train)
python audioAnalysis.py -segmentClassifyFileHMM hmmcount data/count2.wav                                              (test)

Note 1 file hmmRadioSM contains a trained HMM model for speech-music discrimination. It can be directly used for joint classification-segmentation of audio recordings, e.g.:

python audioAnalysis.py -segmentClassifyFileHMM data/hmmRadioSM data/scottish.wav

Note 2 Function evaluateSegmentationClassificationDir() evaluates the performance of either a fixed-sized method or an HMM model regarding the segmentation-classification task. TODO!

Silence Removal and Event Detection

Function silenceRemoval() from audioSegmentation.py takes an uninterrupted audio recording as input and returns segments endpoints that correspond to individual audio events. In this way, all "silent" areas of the signal are removed.

This is achieved through a semi-supervised approach: first an SVM model is trained to distingush between high-energy and low-energy short-term frames. Towards this end, 10% of the highest energy frames along with the 10% of the lowest ones are used. Then, the SVM is applied (with a probabilistic output) on the whole recording and a dynamic thresholding is used to detect the active segments.

silenceRemoval() takes the following arguments: signal, sampling frequency, short-term window size and step, window (in seconds) used to smooth the SVM probabilistic sequence, a factor between 0 and 1 that specifies how "strict" the thresholding is and finally a boolean associated to the ploting of the results. Usage Example:

from pyAudioAnalysis import audioBasicIO as aIO
from pyAudioAnalysis import audioSegmentation as aS
[Fs, x] = aIO.readAudioFile("data/recording1.wav")
segments = aS.silenceRemoval(x, Fs, 0.020, 0.020, smoothWindow = 1.0, Weight = 0.3, plot = True)

In this example, segments is a list of segments endpoints, i.e. each element is a list of two elements: segment beginning and segment end (in seconds).

A command-line call can be used to also generate WAV files with the detected segments. In the following example, a classification of all resulting segments using a predefined classifier (in our case speech vs music) is also performed:

python audioAnalysis.py -silenceRemoval data/recording3.wav 1.0 0.3
python audioAnalysis.py -classifyFolder svm data/svmSM data/recording3_ 1

The following figure shows the output of silence removal on file data/recording3.wav (vertical lines correspond to detected segment limits): Silence Removal Example

Depending on the nature of the recordings under study, different smoothing window lengths and probability weights must be used. The aforementioned examples, where applied on two rather sparse recordings (i.e. the silent periods where quite long). For a continous speech recording (e.g. data/count2.wav) a shorter smoothing window and a stricter probability threshold must be used, e.g.:

python audioAnalysis.py -silenceRemoval data/count2.wav 0.1 0.6

Speaker Diarization

TODO

Audio thumbnailing

Audio thumbnailing is an important application of music information retrieval that focuses on detecting instances of the most representative part of a music recording. In pyAudioAnalysisLibrary this has been implemented in the musicThumbnailing(x, Fs, shortTermSize=1.0, shortTermStep=0.5, thumbnailSize=10.0) function from the audioSegmentation.py. The function uses the given wav file as an input music track and generates two thumbnails of <thumbnailDuration> length. It results are written in two wav files <wavFileName>_thumb1.wav and <wavFileName>_thumb2.wav

It uses selfSimilarityMatrix() that calculates the self-similarity matrix of an audio signal (also located in audioSegmentation.py)

Command-line use:

python audioAnalysis.py -thumbnail <wavFileName> <thumbnailDuration>

For example, the following thumbnail analysis is the result of applying the method on the famous song "Charmless Man" by Blur, using a 20-second thubmnail length. The automatically annotated diagonal segment represents the area where the self similarity is maximized, leading to the definition of the "most common segments" in the audio stream.

ThumbnailingExample

pyAudioAnalysis - Theodoros Giannakopoulos

1.Home
2.General
3.Feature Extraction
4.Classification and Regression
5.Segmentation
6.Data-visualization
7.Audio-Recording-Functionalities
8.Other-Functionalities

Provide feedback

Saved searches