Hexgrad PRO
AI & ML interests
Recent Activity
Articles
Organizations
hexgrad's activity
@to-be
There are more details at https://hf.co/hexgrad/Kokoro-82M/discussions/21 and my Discord DMs are open if you have more questions, but essentially I am looking for segmented text-audio pairs: likely .txt
and .wav
pairs, with each .txt
being ~500 characters or less (needs to fit inside 512 token context hard limit) and the .wav
matching the text.
It's simple: what you put in is what you get out. π German support in the future depends mostly on how much German data (synthetic audio + text labels) is contributed.
TLDR: π¨ Trade Offer π¨
I receive: Synthetic Audio w/ Text Labels
You receive: Trained Voicepacks for an 82M Apache TTS model
Join https://discord.gg/QuGxSWBfQy to discuss
If your data exceeds quantity & quality thresholds and is approved into the next hexgrad/Kokoro-82M training mix, and you permissively DM me the data under an effective Apache license, then I will DM back the corresponding voicepacks for YOUR data if/when the next Apache-licensed Kokoro base model drops.
What does this mean? If you've been calling closed-source TTS or audio API endpoints to:
- Build voice agents
- Make long-form audio, like audiobooks or podcasts
- Handle customer support, etc
Then YOU can contribute to the training mix and get useful artifacts in return. β€οΈ
More details at hexgrad/Kokoro-82M#21
Archive of that article available at https://archive.ph/kX4kp
bf_emma
, bf_isabella
, bm_george
, bm_lewis
Feedback appreciated, both positive or negative. Non-English languages haven't been validated by the model creator(s), so if you're a native speaker, criticize away!
γγ³γ³γγγ£γΌγγ£γΌγ¨γΉγ―γθ±θͺγ¨ζ₯ζ¬θͺγ«ε γγ¦γδΈε½θͺγιε½θͺγγγ©γ³γΉθͺγθ©±γγγ¨γγ§γγγγγ«γͺγγΎγγγγ
Wav converted to mp4 using FFmpeg, since audio attachments aren't allowed in Posts. You may have to unmute the video.
The voice quality actually sounds close to ElevenLabs.
I might've mentioned this elsewhere, but if you plug Kokoro outputs for named ElevenLabs voices into https://elevenlabs.io/ai-speech-classifier you should get very reliable positives (98% confident generated by ElevenLabs).
By ear, I think Kokoro is indeed close to ElevenLabs, especially on certain voices. For Nicole, they are indistinguishable to me. Michael is pretty close; Adam is still somewhat weak.
But StyleTTS usually is not very emotional.
I agree. Kokoro also has 2 specific issues in this area: (1) little to no emotional audio seen during training, and (2) even if there was, the stock voices are average style vectors over 10-100 samples, creating an average/neutral style anyway.
self.brag():
Kokoro finally got 300 votes in
Pendrokar/TTS-Spaces-Arena after
@Pendrokar
was kind enough to add it 3 weeks ago.Discounting the small sample size of votes, I think it is safe to say that hexgrad/Kokoro-TTS is currently a top 3 model among the contenders in that Arena. This is notable because:
- At 82M params, Kokoro is one of the smaller models in the Arena
- MeloTTS has 52M params
- F5 TTS has 330M params
- XTTSv2 has 467M params
I used ffmpeg to make the video:
ffmpeg -i input.wav -r 25 -filter_complex "[0:a]compand,showwaves=size=400x400:colors=#ffd700:draw=full:mode=line,format=yuv420p[vout]" -map "[vout]" -map 0:a -c:v libx264 -c:a aac output.mp4
It's expressive, punches way above its weight class, and supports voice cloning. Go check it out! π
(Unmute the audio sample below after hitting play)
What tool are you using to generate that video?
No voice cloning yet, but an 80M model I trained makes this:
If the voice sounds familiar, it is, and the classifier seems to agree.
At 500M parameters, it's efficient enough to run on basic hardware but powerful enough for professional use.
This could transform how we produce audio content for new - think instant translated interviews keeping original voices, or scaled audio article production!
Demo and Model on the Hub: OuteAI/OuteTTS-0.2-500M h/t @reach-vb
This is conjecture, but it's possible the voice sample for XTTS is in-distribution, i.e. seen during training, and if so you'd expect it to perform better than F5 given the same reference. No knock on XTTS btw, Kokoro is equally guilty for thisβthe voice used in the Arena is also in-distribution.
It would not be surprising to me if voice cloning is simply "looking up" the most similar speaker or interpolation of speakers seen in training. François Chollet has discussed this phenomenon many times wrt LLMs, and I highly recommend to listening to his talks.
https://hf.co/spaces/hexgrad/Kokoro-TTS/discussions/3#6744bdea8c689a7071742134
Read more and listen to before/after audio samples at https://hf.co/blog/hexgrad/kokoro-short-burst-upgrade
(Probably would have made that Article a Post instead, if audio could be embedded into Posts.)