[FAQ] Alternatives to Finetuning Kokoro

#19
by hexgrad - opened

A very frequently asked question is how to finetune Kokoro, i.e. continue training the checkpoint uploaded in this repository. This is currently not feasible: for a number of reasons, the additional models required to facilitate this have not yet been open-sourced.

However, there are a few alternatives available to you in the meantime (staying in the realm of open source speech models). Please be aware that there are varying degrees of difficulty to each option, and some could be unsuitable depending on your technical abilities and/or requirements.

1. Use a Speech-to-Speech model like RVC

First, generate the speech using Kokoro. Then pipe the TTS output into RVC-Project/Retrieval-based-Voice-Conversion-WebUI or a similar speech-to-speech model (Beatrice v2, see also w-okada/voice-changer).

Pros: There are many pre-trained RVC models readily available (search "rvc models"), and you can also train your own RVC model.
Cons: You have to run a separate model after TTS, which impacts latency and increases inference-time compute footprint. Results will also probably fall short of a true base model finetune.

2. Train your own StyleTTS 2 Model

Kokoro v0.19 was trained on relatively little data and transparently uses a StyleTTS 2 architecture, so if you have proficiency in training models and the required compute, you can train your own from the public checkpoints. Here are some resources:

It is also possible to train StyleTTS 2 models in other languages, although this is can be more difficult than English for tokenization & g2p and/or data procurement:

Pros: Full customizability and ownership over your trained model.
Cons: Requires compute, data, and technical skills.

3. Zero-shot or train a different TTS architecture

In no particular order, here are some links to other open-source TTS models (although not all are permissive):

Pros: Training may not be required, or in some cases if the base models were trained on more data, less training may be required to finetune.
Cons: Licenses, parameter counts, and resulting output quality vary.

β€œHi, the styleTTS requires diffusion, while Kokoros does not. Does this indicate that there are some differences in their architectures?”

regardless of whether the additional models required to facilitate finetuning have not yet been open-sourced - please share the process, so we can make the decision for ourselves about the legal applicability / fair use of the model licenses in our individual jurisdictions, rather than making your own legal judgment on our behalf.

β€œHi, the styleTTS requires diffusion, while Kokoros does not. Does this indicate that there are some differences in their architectures?”

@MonolithFoundation Yes, Kokoro quite transparently omits the style diffusion element of StyleTTS2, as I personally do not believe it is worth the ~25M additional parameters, but I could be wrong about that.

regardless of whether the additional models required to facilitate finetuning have not yet been open-sourced - please share the process, so we can make the decision for ourselves about the legal applicability / fair use of the model licenses in our individual jurisdictions, rather than making your own legal judgment on our behalf.

@erichartford Kokoro's Data provenance is addressed in https://hf.co/hexgrad/Kokoro-82M#training-details already. Due to the reasoning in https://hf.co/hexgrad/Kokoro-82M/discussions/21#67814dc92af1d47cdd6ac407 I likely cannot be more specific than that. I am not a lawyer, and I do not make legal judgments on your behalf. I simply provide the facts as they are, to the extent that I can. The model is also licensed under Apache 2.0, and it does not really get more permissive than that. You are always free to not use the model if you have any reservations.

Sign up or log in to comment