Vikhr Salt is a multimodal model based on a pre-trained large language model, extended with new audio tokens to handle both TTS (text-to-speech) and ASR (automatic speech recognition) tasks. The model incorporates two variants for encoding audio—Encodec and SpeechTokenizer—and achieves stable training by fine-tuning precision settings. This approach allows Vikhr Salt to leverage pre-existing LLM knowledge while effectively generating and understanding speech, marking a step forward in multimodal learning.
Ksenia Sycheva, Konstantin Korolev, Aleksandr Nikolic
To tokenize data run prepare_data.py. Configs for different tokenizers (SpeechTokenizer, WavTokenizer, FishTokenizer) are available in this folder.
python prepare_data.py --config configs/quantization/<your-tokenizer-config>.yaml
It is possible to configure tokenization for TTS and ASR differently:
- different number of tokens
- different tokenizers
To do that specify type of quantizer and number of codebooks for both tasks. Examples of configs can be found here. Notes:
- music/other non-speech data is only supported by this version of WavTokenizer
- WavTokenizer has fixed number of codebooks = 1, for SpeechTokenizer values between 1 and 8 can be chosen
for single gpu
source scripts/run_me.sh
for multi gpu+ds2
source scripts/run_me_ds2.sh