Reproducing results for multispeaker model #40

AndreevP · 2021-11-01T16:37:30Z

Hi!

I am trying to reproduce the results of the paper.

For data preparation I use the following commands:

python ../prep_vctk.py \
  --file-list  train-files.txt \
  --in-dir ./ \
  --out vctk-multispeaker-train.4.16000.8192.8192.h5 \
  --scale 4 \
  --sr 16000 \
  --dimension 8192 \
  --stride 8192 \
  --interpolate \
  --batch-size 64 \
  --sam 0.25

python ../prep_vctk.py \
  --file-list val-files.txt \
  --in-dir ./ \
  --out vctk-multispeaker-val.4.16000.8192.8192.h5.tmp \
  --scale 4 \
  --sr 16000 \
  --dimension 8192 \
  --stride 8192 \
  --interpolate \
  --batch-size 64


python ../prep_vctk.py \
  --file-list  train-files.txt \
  --in-dir ./ \
  --out vctk-multispeaker-train.4.16000.-1.8192.h5 \
  --scale 4 \
  --sr 16000 \
  --dimension -1 \
  --stride 8192 \
  --interpolate \
  --batch-size 64 \
  --sam 0.25

python ../prep_vctk.py \
  --file-list val-files.txt \
  --in-dir ./ \
  --out vctk-multispeaker-val.4.16000.-1.8192.h5.tmp \
  --scale 4 \
  --sr 16000 \
  --dimension -1 \
  --stride 8192 \
  --interpolate \
  --batch-size 64

For training I use the following command:

python3 run.py train --train ../data/vctk/multispeaker/vctk-multispeaker-train.4.16000.8192.8192.h5 --val ../data/vctk/multispeaker/vctk-multispeaker-val.4.16000.8192.8192.h5.tmp -e 50 --batch-size 64 --lr 3e-4 --logname multispeaker --model audiotfilm --r 4 --layers 4 --piano false --speaker multi --pool_size 2 --strides 2 --full true

After training I use the following command to infer the model:

python run.py eval --logname ./model.ckpt-53351 --out-label mul-out --wav-file-list ./test_files.txt --r 4 --pool_size 2 --strides 2 --model audiotfilm --speaker multi

I got a poor performance of your model, spectrograms look like this (predicted is above, gt is below):

I doubt that this behavior is expected, could you please give me a hint where I may go wrong or better provide the model checkpoint for the multi-speaker model?

Thank you in advance!

The text was updated successfully, but these errors were encountered:

AndreevP · 2021-11-05T16:29:49Z

In the article, it is stated that you set pool_size and stride parameters to 2. Is it a mistake and they should be equal to 8?

Sawyerb · 2021-11-06T01:23:24Z

What SNR and LSD are you seeing? And the pool_size and stride should be 2. Where do you see 8?

AndreevP · 2021-11-07T13:05:20Z

What SNR and LSD are you seeing?

Here logs of a few last epochs:

Epoch 46 of 50 took 509437.803s (1067 minibatches)
training l2_loss/segsnr/LSD: 0.006147 15.546462 2.536153
validation l2_loss/segsnr/LSD: 0.004268 15.627095 2.583452
Full SNR: 38456.16613149643
Epoch 47 of 50 took 522063.082s (1067 minibatches)
training l2_loss/segsnr/LSD: 0.006152 15.538509 2.629113
validation l2_loss/segsnr/LSD: 0.004270 15.636318 2.687247
Full SNR: 38445.54378128052
Epoch 48 of 50 took 534709.013s (1067 minibatches)
training l2_loss/segsnr/LSD: 0.006091 15.602586 2.587814
validation l2_loss/segsnr/LSD: 0.004269 15.650008 2.648711
Full SNR: 38460.77331352234
Epoch 49 of 50 took 547406.473s (1067 minibatches)
training l2_loss/segsnr/LSD: 0.006117 15.582755 2.603747
validation l2_loss/segsnr/LSD: 0.004257 15.657111 2.661661
Full SNR: 38513.77068042755
Epoch 50 of 50 took 560018.804s (1067 minibatches)
training l2_loss/segsnr/LSD: 0.006142 15.550923 2.602868
validation l2_loss/segsnr/LSD: 0.004290 15.606818 2.671877
Full SNR: 38329.45089030266

And the pool_size and stride should be 2. Where do you see 8?

I saw it in this issue #27 (comment) .

Sawyerb · 2021-11-07T20:40:33Z

Looks like you are reproducing the paper's results. We reported an SNR of 15.0 and an LSD of 2.7 on the multispeaker task with an upsampling ratio of 4. Multispeaker with r=4 is a hard task.

And that's a typo on my part. Thanks for flagging it. I've updated my answer.

AndreevP · 2021-11-08T12:25:40Z

I also thought maybe there is a problem with my inference command. Could you check it?

I am afraid to report lower results for your method than actual ones. Because it seems that samples outlined on this page https://anonymousqwerty.github.io/audio-sr/ are way better than what I get. From your experience do you think it is possible that your model may work as I showed before on some samples?

Sawyerb · 2021-11-09T02:14:02Z

I don't see anything wrong with your inference command. And performance can vary a lot depending on the sample.

AndreevP · 2021-11-09T09:28:02Z

Ok, thank you very much for the help!

AndreevP closed this as completed Nov 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducing results for multispeaker model #40

Reproducing results for multispeaker model #40

AndreevP commented Nov 1, 2021 •

edited

Loading

AndreevP commented Nov 5, 2021

Sawyerb commented Nov 6, 2021

AndreevP commented Nov 7, 2021 •

edited

Loading

Sawyerb commented Nov 7, 2021

AndreevP commented Nov 8, 2021

Sawyerb commented Nov 9, 2021

AndreevP commented Nov 9, 2021

Reproducing results for multispeaker model #40

Reproducing results for multispeaker model #40

Comments

AndreevP commented Nov 1, 2021 • edited Loading

AndreevP commented Nov 5, 2021

Sawyerb commented Nov 6, 2021

AndreevP commented Nov 7, 2021 • edited Loading

Sawyerb commented Nov 7, 2021

AndreevP commented Nov 8, 2021

Sawyerb commented Nov 9, 2021

AndreevP commented Nov 9, 2021

AndreevP commented Nov 1, 2021 •

edited

Loading

AndreevP commented Nov 7, 2021 •

edited

Loading