39 6

Hexgrad PRO

hexgrad

https://hf.co/hexgrad/Kokoro-82M

AI & ML interests

Solo Leveling Research

Recent Activity

new activity about 6 hours ago

hexgrad/Kokoro-82M:[DATA] Synthetic Data Trade Offer

new activity about 16 hours ago

hexgrad/Kokoro-82M:[FAQ] Alternatives to Finetuning Kokoro

new activity about 16 hours ago

hexgrad/Kokoro-82M:German language training

View all activity

Articles

Upgrading Kokoro: natural TTS for short bursts

Nov 22, 2024

• 17

Organizations

None yet

hexgrad's activity

replied to their post 1 day ago

@to-be There are more details at https://hf.co/hexgrad/Kokoro-82M/discussions/21 and my Discord DMs are open if you have more questions, but essentially I am looking for segmented text-audio pairs: likely .txt and .wav pairs, with each .txt being ~500 characters or less (needs to fit inside 512 token context hard limit) and the .wav matching the text.

replied to their post 5 days ago

It's simple: what you put in is what you get out. 😄 German support in the future depends mostly on how much German data (synthetic audio + text labels) is contributed.

replied to their post 5 days ago

TLDR: 🚨 Trade Offer 🚨
I receive: Synthetic Audio w/ Text Labels
You receive: Trained Voicepacks for an 82M Apache TTS model
Join https://discord.gg/QuGxSWBfQy to discuss

posted an update 5 days ago

Post

4883

📣 Looking for labeled, high-quality synthetic audio/TTS data 📣 Have you been or are you currently calling API endpoints from OpenAI, ElevenLabs, etc? Do you have labeled audio data sitting around gathering dust? Let's talk! Join https://discord.gg/QuGxSWBfQy or comment down below.

If your data exceeds quantity & quality thresholds and is approved into the next hexgrad/Kokoro-82M training mix, and you permissively DM me the data under an effective Apache license, then I will DM back the corresponding voicepacks for YOUR data if/when the next Apache-licensed Kokoro base model drops.

What does this mean? If you've been calling closed-source TTS or audio API endpoints to:
- Build voice agents
- Make long-form audio, like audiobooks or podcasts
- Handle customer support, etc
Then YOU can contribute to the training mix and get useful artifacts in return. ❤️

More details at hexgrad/Kokoro-82M#21

13 replies

replied to their post 10 days ago

Archive of that article available at https://archive.ph/kX4kp

posted an update 10 days ago

Post

2660

Happy New Year! 🌃 af_sky landed in Kokoro, along with an article: hexgrad/Kokoro-82M

2 replies

posted an update 14 days ago

Post

2719

🇬🇧 Four British voices have joined hexgrad/Kokoro-82M (Apache TTS model): bf_emma, bf_isabella, bm_george, bm_lewis

posted an update 16 days ago

Post

3104

Tonight, Adam & Michael join the 82M Apache TTS model in hexgrad/Kokoro-82M

posted an update 17 days ago

Post

3944

Merry Christmas! 🎄 Open sourced a small TTS model at hexgrad/Kokoro-82M

2 replies

posted an update about 1 month ago

Post

1083

🚀 Shipmas Day 2.5 🚀 Kokoro v0.22 packs 5 languages in 82M params! 🇺🇸🇬🇧🇫🇷🇯🇵🇰🇷🇨🇳 hexgrad/Kokoro-TTS

Feedback appreciated, both positive or negative. Non-English languages haven't been validated by the model creator(s), so if you're a native speaker, criticize away!

「ココロティーティーエスは、英語と日本語に加えて、中国語、韓国語、フランス語を話すことができるようになりました。」

Wav converted to mp4 using FFmpeg, since audio attachments aren't allowed in Posts. You may have to unmute the video.

replied to their post about 1 month ago

The voice quality actually sounds close to ElevenLabs.

I might've mentioned this elsewhere, but if you plug Kokoro outputs for named ElevenLabs voices into https://elevenlabs.io/ai-speech-classifier you should get very reliable positives (98% confident generated by ElevenLabs).

By ear, I think Kokoro is indeed close to ElevenLabs, especially on certain voices. For Nicole, they are indistinguishable to me. Michael is pretty close; Adam is still somewhat weak.

But StyleTTS usually is not very emotional.

I agree. Kokoro also has 2 specific issues in this area: (1) little to no emotional audio seen during training, and (2) even if there was, the stock voices are average style vectors over 10-100 samples, creating an average/neutral style anyway.

posted an update about 1 month ago

Post

2941

self.brag(): Kokoro finally got 300 votes in Pendrokar/TTS-Spaces-Arena after @Pendrokar was kind enough to add it 3 weeks ago.
Discounting the small sample size of votes, I think it is safe to say that hexgrad/Kokoro-TTS is currently a top 3 model among the contenders in that Arena. This is notable because:
- At 82M params, Kokoro is one of the smaller models in the Arena
- MeloTTS has 52M params
- F5 TTS has 330M params
- XTTSv2 has 467M params

5 replies

replied to fdaudens's post about 1 month ago

I used ffmpeg to make the video:

ffmpeg -i input.wav -r 25 -filter_complex "[0:a]compand,showwaves=size=400x400:colors=#ffd700:draw=full:mode=line,format=yuv420p[vout]" -map "[vout]" -map 0:a -c:v libx264 -c:a aac output.mp4

posted an update about 1 month ago

Post

1369

@Respair just dropped Tsukasa: frontier TTS in Japanese Respair/Tsukasa_Speech
It's expressive, punches way above its weight class, and supports voice cloning. Go check it out! 🚀
(Unmute the audio sample below after hitting play)

replied to fdaudens's post about 1 month ago

What tool are you using to generate that video?

No voice cloning yet, but an 80M model I trained makes this:

If the voice sounds familiar, it is, and the classifier seems to agree.

reacted to fdaudens's post with 👍 about 1 month ago

Post

1039

The rapid progress in small audio models is mind-blowing! 🤯 Just tested OuteTTS v0.2 - cloned my voice from a 10s clip with impressive accuracy and natural prosody.

At 500M parameters, it's efficient enough to run on basic hardware but powerful enough for professional use.

This could transform how we produce audio content for new - think instant translated interviews keeping original voices, or scaled audio article production!

Demo and Model on the Hub: OuteAI/OuteTTS-0.2-500M h/t @reach-vb

3 replies

replied to Pendrokar's post about 2 months ago

This is conjecture, but it's possible the voice sample for XTTS is in-distribution, i.e. seen during training, and if so you'd expect it to perform better than F5 given the same reference. No knock on XTTS btw, Kokoro is equally guilty for this—the voice used in the Arena is also in-distribution.

It would not be surprising to me if voice cloning is simply "looking up" the most similar speaker or interpolation of speakers seen in training. François Chollet has discussed this phenomenon many times wrt LLMs, and I highly recommend to listening to his talks.

https://hf.co/spaces/hexgrad/Kokoro-TTS/discussions/3#6744bdea8c689a7071742134

posted an update about 2 months ago

Post

1715

hexgrad/Kokoro-TTS just got an upgrade that substantially improves TTS naturalness for short bursts while maintaining parity for longer utterances! 🔥

Read more and listen to before/after audio samples at https://hf.co/blog/hexgrad/kokoro-short-burst-upgrade

(Probably would have made that Article a Post instead, if audio could be embedded into Posts.)

2 replies

posted an update about 2 months ago

Post

3250

Kokoro: a small, fast 80M param TTS model hosted on ZeroGPU at hexgrad/Kokoro-TTS

3 replies