This repository contains the source code for our ICASSP2022 paper Pseudo strong labels for large scale weakly supervised audio tagging.
Highlights:
- State-of-the-art on the balanced Audioset subset.
- Simple MobileNetV2 model, don't need expensive GPU to run.
- Quick training, since only 60h of balanced Audioset is required.
- Achieves an mAP of 35.48 (more or less), useable for most real-world applications.
The aim of this work is to show that by adding automatic supervision on a fixed scale from a machine annotator (or teacher) to a student model, performance gains can be observed on Audioset.
Specifically, our method outperforms other approaches in literature on the balanced
subset of Audioset, while using a rather simple MobileNetV2 architecture.
Method | Label | mAP | |
---|---|---|---|
Baseline (Weak) | Weak | 17.69 | 1.994 |
PSL-10s (Proposed) | PSL-10s | 31.13 | 2.454 |
PSL-5s (Proposed) | PSL-5s | 34.11 | 2.549 |
PSL-2s (Proposed) | PSL-2s | 35.48 | 2.588 |
----------------------------------- | ---------------------------------------- | ------------- | ------------- |
CNN14 [@Kong2020d] | Weak | 27.80 | 1.850 |
EfficientNet-B0 [@gong2021psla] | Weak | 33.50 | - |
EfficientNet-B2 [@gong2021psla] | Weak | 34.06 | - |
ResNet-50 [@gong2021psla] | Weak | 31.80 | - |
AST [@gong21b_interspeech] | Weak | 34.70 | - |
gnu-parallel for the preprocessing, which can be installed using conda:
conda install parallel
If you have root rights you can:
# On Arch distros
sudo pacman -S parallel
# On Debian
sudo apt install parallel
Further, the download script in scripts/1_download_audioset.sh
uses Proxychains to download the data. You might want to disable proxychains by simply removing the line or configure your own proxychains proxy.
This script has been tested using python=3.8
on a Centos 5 and Manjaro.
To install the python dependencies just run:
python3 -m pip install -r requirements.txt
The structure of this repo is as follows:
.
├── configs
├── data
│ ├── audio
│ │ ├── balanced
│ │ └── eval
│ ├── csvs
│ └── logs
├── figures
├── scripts
│ └── utils
If already have downloaded audioset, please put the raw data of the balanced and eval subsets in data/audio/balanced
and data/audio/eval
respectively.
Then put balanced_train_segments.csv
, eval_segments.csv
and class_labels_indices.csv
into data/csvs
.
Firstly, you need the balanced and evaluation subsets of audioset. These can be downloaded using the following script:
./scripts/1_download_audioset.sh
In order to speed up IO, we pack the data into hdf5 files. This can be done by:
./scripts/2_prepare_data.sh
For the experiments in Table 2, run:
## For the 10s PSL training
./train_psl.sh configs/psl_balanced_chunk_10sec.yaml
## For the 5s PSL training
./train_psl.sh configs/psl_balanced_chunk_5sec.yaml
## For the 2s PSL training
./train_psl.sh configs/psl_balanced_chunk_2sec.yaml
For the experiments in Table 3, run:
## For the 10s PSL training
./train_psl.sh configs/teacher_student_chunk_10sec.yaml
## For the 5s PSL training
./train_psl.sh configs/teacher_student_chunk_5sec.yaml
## For the 2s PSL training
./train_psl.sh configs/teacher_student_chunk_2sec.yaml
Note that this repo can be easily extended to run the experiments in Table 4, i.e., using the full Audioset dataset.