FSQ training collapse #226

pfeatherstone · 2025-10-10T12:20:39Z

pfeatherstone
Oct 10, 2025

When training a neural network audio codec model, e.g. soundstream, encodec, TS3-Codec, etc, with (Grouped)(Residual)FSQ, without careful weight initialization in the final projection layer of the encoder, training can completely collapse.

In all my tests, loss will stay put at some value and stay there for 500 epochs, that's even with learning rate warmup and all the standard tricks. Looking at the quantized output of FSQ all values are +/- 1, meaning the encoder outputs were too high and once tanh-ed, yielded +/- 1. In this situation the gradients go nowhere. So everything is getting pushed to the outer boundaries of the FSQ hypercube. The only way i get around that is if the projection layer in FSQ has very small weight initialization and zero initialized bias. That keeps encoder outputs small early on and allows gradients to flow.

So basically for this layer:

vector-quantize-pytorch/vector_quantize_pytorch/finite_scalar_quantization.py

Line 109 in 2b367e5

    
           self.project_in = nn.Linear(self.dim, effective_codebook_dim, bias = projection_has_bias) if has_projections else nn.Identity()

I'm having to add something like:

torch.nn.init.uniform_(self.project_in.weight, -1/(codebook_dim*100), 1/(codebook_dim*100))
torch.nn.init.zeros_(self.project_in.bias)

Has anybody observed behaviour like this?

pfeatherstone · 2025-10-10T12:25:59Z

pfeatherstone
Oct 10, 2025
Author

I was wondering if there was a way to improve on the straight through estimator by using the same tricks in https://github.com/necla-ml/Diff-JPEG/blob/main/diff_jpeg/rounding.py i.e. either soft rounding in the backward pass, or what they call polynomial rounding

1 reply

pfeatherstone Oct 10, 2025
Author

Trying this now. Convergence seems to be exactly the same as plain STE

pfeatherstone · 2025-10-13T08:48:12Z

pfeatherstone
Oct 13, 2025
Author

Right, even with the suggested weight initialization, i find that i get pretty poor codebook utilization, despite what the paper says.
I'm using:

ResidualFSQ(num_quantizers=5, levels=[8, 8, 8, 5, 5, 5], noise_dropout=0.5, preserve_symmetry=True)

I'm training on LibriTTS dataset. Once trained, i plot a histogram of the codebook using the validation set and i get the following:

With levels=[8, 8, 8, 5, 5, 5] you get a codebook size of 64000, in case you were wondering why i'm using so many bins.
@lucidrains Any thoughts?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FSQ training collapse #226

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

FSQ training collapse #226

Uh oh!

Uh oh!

pfeatherstone Oct 10, 2025

Replies: 2 comments · 1 reply

Uh oh!

pfeatherstone Oct 10, 2025 Author

Uh oh!

pfeatherstone Oct 10, 2025 Author

Uh oh!

pfeatherstone Oct 13, 2025 Author

pfeatherstone
Oct 10, 2025

Replies: 2 comments 1 reply

pfeatherstone
Oct 10, 2025
Author

pfeatherstone Oct 10, 2025
Author

pfeatherstone
Oct 13, 2025
Author