Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducing results for multispeaker model #40

Closed
AndreevP opened this issue Nov 1, 2021 · 7 comments
Closed

Reproducing results for multispeaker model #40

AndreevP opened this issue Nov 1, 2021 · 7 comments

Comments

@AndreevP
Copy link

AndreevP commented Nov 1, 2021

Hi!

I am trying to reproduce the results of the paper.

For data preparation I use the following commands:

python ../prep_vctk.py \
  --file-list  train-files.txt \
  --in-dir ./ \
  --out vctk-multispeaker-train.4.16000.8192.8192.h5 \
  --scale 4 \
  --sr 16000 \
  --dimension 8192 \
  --stride 8192 \
  --interpolate \
  --batch-size 64 \
  --sam 0.25

python ../prep_vctk.py \
  --file-list val-files.txt \
  --in-dir ./ \
  --out vctk-multispeaker-val.4.16000.8192.8192.h5.tmp \
  --scale 4 \
  --sr 16000 \
  --dimension 8192 \
  --stride 8192 \
  --interpolate \
  --batch-size 64


python ../prep_vctk.py \
  --file-list  train-files.txt \
  --in-dir ./ \
  --out vctk-multispeaker-train.4.16000.-1.8192.h5 \
  --scale 4 \
  --sr 16000 \
  --dimension -1 \
  --stride 8192 \
  --interpolate \
  --batch-size 64 \
  --sam 0.25

python ../prep_vctk.py \
  --file-list val-files.txt \
  --in-dir ./ \
  --out vctk-multispeaker-val.4.16000.-1.8192.h5.tmp \
  --scale 4 \
  --sr 16000 \
  --dimension -1 \
  --stride 8192 \
  --interpolate \
  --batch-size 64

For training I use the following command:

python3 run.py train --train ../data/vctk/multispeaker/vctk-multispeaker-train.4.16000.8192.8192.h5 --val ../data/vctk/multispeaker/vctk-multispeaker-val.4.16000.8192.8192.h5.tmp -e 50 --batch-size 64 --lr 3e-4 --logname multispeaker --model audiotfilm --r 4 --layers 4 --piano false --speaker multi --pool_size 2 --strides 2 --full true

After training I use the following command to infer the model:

python run.py eval --logname ./model.ckpt-53351 --out-label mul-out --wav-file-list ./test_files.txt --r 4 --pool_size 2 --strides 2 --model audiotfilm --speaker multi

I got a poor performance of your model, spectrograms look like this (predicted is above, gt is below):

image

I doubt that this behavior is expected, could you please give me a hint where I may go wrong or better provide the model checkpoint for the multi-speaker model?

Thank you in advance!

@AndreevP
Copy link
Author

AndreevP commented Nov 5, 2021

In the article, it is stated that you set pool_size and stride parameters to 2. Is it a mistake and they should be equal to 8?

@Sawyerb
Copy link
Collaborator

Sawyerb commented Nov 6, 2021

What SNR and LSD are you seeing? And the pool_size and stride should be 2. Where do you see 8?

@AndreevP
Copy link
Author

AndreevP commented Nov 7, 2021

What SNR and LSD are you seeing?

Here logs of a few last epochs:

Epoch 46 of 50 took 509437.803s (1067 minibatches)
training l2_loss/segsnr/LSD: 0.006147 15.546462 2.536153
validation l2_loss/segsnr/LSD: 0.004268 15.627095 2.583452
Full SNR: 38456.16613149643
Epoch 47 of 50 took 522063.082s (1067 minibatches)
training l2_loss/segsnr/LSD: 0.006152 15.538509 2.629113
validation l2_loss/segsnr/LSD: 0.004270 15.636318 2.687247
Full SNR: 38445.54378128052
Epoch 48 of 50 took 534709.013s (1067 minibatches)
training l2_loss/segsnr/LSD: 0.006091 15.602586 2.587814
validation l2_loss/segsnr/LSD: 0.004269 15.650008 2.648711
Full SNR: 38460.77331352234
Epoch 49 of 50 took 547406.473s (1067 minibatches)
training l2_loss/segsnr/LSD: 0.006117 15.582755 2.603747
validation l2_loss/segsnr/LSD: 0.004257 15.657111 2.661661
Full SNR: 38513.77068042755
Epoch 50 of 50 took 560018.804s (1067 minibatches)
training l2_loss/segsnr/LSD: 0.006142 15.550923 2.602868
validation l2_loss/segsnr/LSD: 0.004290 15.606818 2.671877
Full SNR: 38329.45089030266

And the pool_size and stride should be 2. Where do you see 8?

I saw it in this issue #27 (comment) .

@Sawyerb
Copy link
Collaborator

Sawyerb commented Nov 7, 2021

Looks like you are reproducing the paper's results. We reported an SNR of 15.0 and an LSD of 2.7 on the multispeaker task with an upsampling ratio of 4. Multispeaker with r=4 is a hard task.

And that's a typo on my part. Thanks for flagging it. I've updated my answer.

@AndreevP
Copy link
Author

AndreevP commented Nov 8, 2021

I also thought maybe there is a problem with my inference command. Could you check it?

I am afraid to report lower results for your method than actual ones. Because it seems that samples outlined on this page https://anonymousqwerty.github.io/audio-sr/ are way better than what I get. From your experience do you think it is possible that your model may work as I showed before on some samples?

@Sawyerb
Copy link
Collaborator

Sawyerb commented Nov 9, 2021

I don't see anything wrong with your inference command. And performance can vary a lot depending on the sample.

@AndreevP
Copy link
Author

AndreevP commented Nov 9, 2021

Ok, thank you very much for the help!

@AndreevP AndreevP closed this as completed Nov 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants