[ICASSP 2026] Official code for UNMIXX: Untangling Highly Correlated Singing Voices Mixtures
UNMIXX a novel framework for the multiple singing voices separation (MSVS). While similar to speech separation, MSVS presents unique challenges, namely data scarcity and highly correlated nature of singing voice. To address these issues, we propose three key components: (1) a musically informed mixing strategy to construct highly correlated training mixtures, (2) a reverse attention that drives the two outputs apart using cross attention and (3) a magnitude penalty loss penalizing energy erroneously assigned to the other output. Experiments show that UNMIXX achieves substantial improvements, with more than ~2.2 dB SDRi gains on MedleyVox evaluation set over prior method. Audio samples are available on our demo page.
We use a total of 400 hours of singing datasets for training.
Download the datasets and follow the preprocessing steps in MedleyVox preprocessing steps.
- Children’s song dataset (CSD) — 4.9 hours
- NUS — 1.9 hours
- VocalSet — 8.8 hours
- Jsut-song — 0.4 hours
- Jvs_music — 2.3 hours
- Musdb-hq (train subset) — 2.0 hours
- Single singing regions extracted using musdb-lyrics extension
- OpenSinger — 51.9 hours
- K_multisinger — 169.6 hours
- K_multitimbre — 150.8 hours
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python audio_train.py --conf_dir configs/unmixx.ymlpython inference.py \
--conf_path ckpt/conf.yml \
--ckpt_path ckpt/best.ckpt \
--audio_path sample_music/free_mixture.wav \
--output_dir separated_audioOutputs will be saved to separated_audio/.