[FAQ] Alternatives to Finetuning Kokoro

#19

by hexgrad - opened 7 days ago

Owner 7 days ago

A very frequently asked question is how to finetune Kokoro, i.e. continue training the checkpoint uploaded in this repository. This is currently not feasible: for a number of reasons, the additional models required to facilitate this have not yet been open-sourced.

However, there are a few alternatives available to you in the meantime (staying in the realm of open source speech models). Please be aware that there are varying degrees of difficulty to each option, and some could be unsuitable depending on your technical abilities and/or requirements.

1. Use a Speech-to-Speech model like RVC

First, generate the speech using Kokoro. Then pipe the TTS output into RVC-Project/Retrieval-based-Voice-Conversion-WebUI or a similar speech-to-speech model (Beatrice v2, see also w-okada/voice-changer).

Pros: There are many pre-trained RVC models readily available (search "rvc models"), and you can also train your own RVC model.
Cons: You have to run a separate model after TTS, which impacts latency and increases inference-time compute footprint. Results will also probably fall short of a true base model finetune.

2. Train your own StyleTTS 2 Model

Kokoro v0.19 was trained on relatively little data and transparently uses a StyleTTS 2 architecture, so if you have proficiency in training models and the required compute, you can train your own from the public checkpoints. Here are some resources:

StyleTTS 2 repo: https://github.com/yl4579/StyleTTS2
Notes on Finetuning: https://github.com/yl4579/StyleTTS2/discussions/81
Colab notebooks: https://github.com/yl4579/StyleTTS2/tree/main/Colab

It is also possible to train StyleTTS 2 models in other languages, although this is can be more difficult than English for tokenization & g2p and/or data procurement:

@Respair has done Japanese: https://huggingface.co/spaces/Respair/Tsukasa_Speech
@patriotyk has done Ukranian: https://huggingface.co/spaces/patriotyk/styletts2-ukrainian
I have heard (of) models in: Korean, German, French, Italian, Spanish, Persian

Pros: Full customizability and ownership over your trained model.
Cons: Requires compute, data, and technical skills.

3. Zero-shot or train a different TTS architecture

In no particular order, here are some links to other open-source TTS models (although not all are permissive):

XTTS v2: https://hf.co/coqui/XTTS-v2
MaskGCT: https://hf.co/amphion/MaskGCT
E2/F5-TTS: https://github.com/SWivid/F5-TTS
GPT-SoVITS: https://github.com/RVC-Boss/GPT-SoVITS
Fish Speech: https://hf.co/fishaudio/fish-speech-1.5
Piper TTS: https://github.com/rhasspy/piper

Pros: Training may not be required, or in some cases if the base models were trained on more data, less training may be required to finetune.
Cons: Licenses, parameter counts, and resulting output quality vary.

MonolithFoundation

4 days ago

“Hi, the styleTTS requires diffusion, while Kokoros does not. Does this indicate that there are some differences in their architectures?”

erichartford

about 22 hours ago

regardless of whether the additional models required to facilitate finetuning have not yet been open-sourced - please share the process, so we can make the decision for ourselves about the legal applicability / fair use of the model licenses in our individual jurisdictions, rather than making your own legal judgment on our behalf.

hexgrad

Owner about 15 hours ago

“Hi, the styleTTS requires diffusion, while Kokoros does not. Does this indicate that there are some differences in their architectures?”

@MonolithFoundation Yes, Kokoro quite transparently omits the style diffusion element of StyleTTS2, as I personally do not believe it is worth the ~25M additional parameters, but I could be wrong about that.

regardless of whether the additional models required to facilitate finetuning have not yet been open-sourced - please share the process, so we can make the decision for ourselves about the legal applicability / fair use of the model licenses in our individual jurisdictions, rather than making your own legal judgment on our behalf.

@erichartford Kokoro's Data provenance is addressed in https://hf.co/hexgrad/Kokoro-82M#training-details already. Due to the reasoning in https://hf.co/hexgrad/Kokoro-82M/discussions/21#67814dc92af1d47cdd6ac407 I likely cannot be more specific than that. I am not a lawyer, and I do not make legal judgments on your behalf. I simply provide the facts as they are, to the extent that I can. The model is also licensed under Apache 2.0, and it does not really get more permissive than that. You are always free to not use the model if you have any reservations.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment