[DATA] Synthetic Data Trade Offer

#21
by hexgrad - opened

Pinned discussion for parties interested in the trade offer proposed here: https://hf.co/posts/hexgrad/418806998707773

Discord is the best place to discuss this — https://discord.gg/QuGxSWBfQy — but for those not able or willing to use Discord, here is the next best option.

I am seeking synthetic audio for (multiple speakers & languages):

  1. OpenAI: GPT-4o AVM, Realtime API, HD TTS
  2. Gemini 2.0 Flash: Native Audio
  3. ElevenLabs: Full, not Flash or Turbo

EDIT: 2025 Jan 10. In general, 24khz wav pcm_24000.wav is likely the most preferred audio file format. Higher sample rates are also fine, but they will be downsampled to 24khz anyway. I estimate that 24khz wav might be preferred over even higher sample rate mp3 (e.g. 44.1khz), but I may defer to your judgement if this is your native language and/or you have a keen ear favoring the mp3.

To qualify for Voicepack(s):

  • I need to approve both the audio & text data before it enters the training mix. I may refuse some or all of your data for quality reasons. Please describe the quantity/quality/taxonomy of your data first, if that clears then send some samples, if those also check out, only then should you send the next or whole payload.
  • At least 1 approved hour per voice/tone. If you have X speaker whispering for 40 minutes, and another 20 minutes of them shouting, it is not enough. This threshold may go up later, but a previously approved contribution will still get Voicepacks delivered in return.
  • Audio should be clean with minimal artifacts. Text labels are expected to be perfect or near perfect, since you're calling it over API and should know what text you put in.
  • Text should be aligned to their corresponding audio segments obtained over API. Don't concatenate the segments into a giant multi-hour file and then dump the entire transcript.
  • You send the data directly to me under an Apache license. Likewise, you will directly receive a corresponding Apache-licensed Voicepack in return, if/when the model finishes training.

Other providers / small quantity / lower quality / unlabeled / unsegmented data can still be contributed, but the likelihood of inclusion in the training mix plummets, and I cannot promise any delivered voicepacks resulting from those.

In addition to the above, here is a disclaimer (adapted from the OpenAI investment disclaimer):

IMPORTANT
Contributing data to the Kokoro training mix is not a guaranteed investment
Contributors could deliver their data and not see any return
It would be wise to view any contribution in the spirit of a donation, with the understanding that there are risk factors that would delay or prevent the return of a trained voicepack, including but not limited to:

  • GPU access could be disrupted
  • The model could fail to converge
  • The model trainer could be given the Boeing whistleblower treatment

Notwithstanding the above, the model trainer will, to the best of his ability, deliver the promised artifacts.

Potential contributors should understand that they are free to pursue the following options instead:

  • Keep the data to themselves
  • Train their own models
  • Continue using vendors

This post may be edited later, but approved contributions will stand unless otherwise notified.

EDIT: 2025 Jan 9

FAQ: If you (the contributor) get a voicepack.pt from me, will that voicepack file also be uploaded and Apache open-sourced to the public?

Not exactly. When you contribute synthetic data and I compile a voicepack.pt from it, I do not intend to release your voicepack.pt as-is, unless you give me permission to do so otherwise. Note that if voicepack demand increases, those who grant that broad permission might have higher priority for entering the training mix.

However, for all contributors in general, I intend to follow a "rule of three" with at least 3-way minority mixes. Let's say Alice gives me 4 hours of VOICE_A. Bob and Charlie each give me 3 hours of VOICE_A as well (so 10 hours total). I may compile & upload a voicepack drawing minority (<50%) slices from contributors Alice, Bob, and Charlie, e.g. 2h each from Alice/Bob/Charlie is a valid configuration. This sort of "mixing" can also happen across different voices, such as VOICE_A, VOICE_B, and VOICE_C, as long as no slice of the pie becomes a majority.

You should also understand that contributing data broadly makes the base Kokoro model better, even if the style vectors from your voicepacks are never released or used at scale by people other than yourself.

Any changes will be communicated in further edits. Thanks for the interest!

hexgrad pinned discussion

I do not understand the concept( why synthetic rather than human-pronounced) but willing to help

Gemini 2.0 Flash: Native Audio, Czech language sample:

Zde je několik pěkných částí z R.U.R. Karla Čapka, které by se ti mohly líbit:

  1. Popis Heleny a její idealistické představy:

HELENA: „Já bych chtěla… víte… aby… aby roboti měli duši.“

DOMIN: „Duši? Pro Boha živého, k čemu by robotům byla duše?“

HELENA: „Aby… aby nebyli jenom stroje… aby… aby cítili… radost… smutek… aby…“

DOMIN: „Aby se bouřili? Aby stávkovali? Aby nám dělali revoluci? Děkuju pěkně, slečno Gloryová, my nepotřebujeme stroje s duší.“

Tato pasáž ukazuje Heleninu naivní, ale lidskou touhu po tom, aby roboti nebyli pouhými nástroji, ale bytostmi s citem. Kontrastuje s Dominovým pragmatickým pohledem, který vnímá roboty pouze jako pracovní sílu.

  1. Rozhovor Heleny s robotem Radiem, který ukazuje na probouzející se city robotů:

HELENA: „Máš mě rád?“

RÁDIO: „Mám rád lidi.“

HELENA: „A mne?“

RÁDIO: „Ty jsi člověk.“

HELENA: „To vím, ale… máš mě rád víc než ostatní lidi?“

RÁDIO: „Nevím.“

Tento úryvek naznačuje, že v robotech se začíná probouzet něco víc než jen naprogramované reakce. Rádio nedokáže definovat své city, ale jeho váhání a nejistota naznačují, že k Heleně cítí něco zvláštního.

  1. Alquistova úvaha o smyslu lidské práce:

ALQUIST: „Člověk pracuje… aby žil. Ale má taky pracovat, aby měl co dělat. Práce… to není jenom živobytí, to je taky… no… to je taky… radost.“

Alquist, stavitel, je v dramatu jediný, kdo si zachovává zdravý rozum a úctu k lidské práci. Jeho slova zdůrazňují, že práce není jen nutností, ale i zdrojem uspokojení a smyslu života.

  1. Závěrečná scéna, kdy se rodí nová naděje:

PRIMUS: „My… my jsme… jiní.“

HELENA (robotka): „My… my se máme rádi.“

V závěru hry, po zkáze lidstva, se objevuje nová naděje v podobě dvou robotů, kteří k sobě cítí lásku. Tento moment symbolizuje nový začátek a možná i naději na lepší budoucnost, založenou na citech a vzájemné úctě.

Která z těchto částí tě nejvíce zaujala? Pokud máš nějaké další otázky ohledně R.U.R., neváhej se zeptat.

<p><video src="https://cdn-uploads.huggingface.co/production/uploads/63c69e74f3f24996049c3b20/x3H3vzXaaQ9v9sh191On-.webm" controls class="!max-w-full"/></p>
/
cdn-uploads.huggingface.co/production/uploads/63c69e74f3f24996049c3b20/x3H3vzXaaQ9v9sh191On-.webm

We’re still waiting for an encode release for fine-tuning/training!

@AISafety I only know English (and definitely don't know Czech) so it is difficult for me to evaluate the text you provided, and I also cannot access the audio from that link (possibly broken). Recommend you join the Discord server here https://discord.gg/QuGxSWBfQy and/or DM me to discuss.

Keep in mind I need to evaluate the audio/text data for quality, and I reserve the right to decline data (audio/text pairs) if training on it could make the model worse. To answer your question here:

I do not understand the concept( why synthetic rather than human-pronounced) but willing to help

The best human data is great (possibly better than synthetic data can ever be unless AGI) but also very costly to obtain, label, and/or clean. You can get permissive human data for free or cheap, but too often it is low quality so training on it likely lowers the quality of the resulting model.

Synthetic data:

  • Can be obtained in relatively large quantities for zero/low marginal cost by existing API consumers. Zero marginal cost assumes they would have called such endpoints anyway, to deliver an existing product.
  • Can have really clean labels, often much cleaner than data gathered in the wild. The best providers have good text-audio alignment and minimal audio artifacts.
  • Can produce frontier, if not at least compelling, models. As of this post, the top 3 trending models on HuggingFace are microsoft/phi-4, deepseek-ai/DeepSeek-V3, and this model: Kokoro-82M. All 3 involve training on synthetic data to some degree.

Since you are asking Apache licenses from who give you the samples, why don't publish a public dataset of all the issued samples in order to have a synthetic multimodal language dataset and release it on HF?
I think many OSS projects would love to have open multilanguage datasets

I'm happy to give you my share of samples too

P.S.
I'd state some subtle rules may apply on the transcription texts:

  • do you want them with punctuation ?
  • numbers should be spelled in text (e.g. 100 = one hundred) because otherwise multilanguage issues may arise (e.g. 100 in italian = cento)
  • stripe away any symbol except punctuation?
    etc..

Since you are asking Apache licenses from who give you the samples, why don't publish a public dataset of all the issued samples in order to have a synthetic multimodal language dataset and release it on HF?
I think many OSS projects would love to have open multilanguage datasets

If publishing on HF is what the original contributor intended or desires, I welcome and encourage that if they make that choice. Keep in mind I am asking for an Apache license, not an exclusive license. However, I definitely will not be the one to upload their raw data.

Uploading audio & text could make it trivial for the original provider to identify and ban the original contributor of the data, which would be entirely counterproductive to what I'm trying to accomplish.

For the same reason, journalists don't name their anonymous sources. No one will ever confide privately in a journalist who names their anonymous sources. But if someone wants to step forward and say something publicly, that is well within their rights.

First of all, thank you very much for this great work you are doing. I'm really quite impressed with kokoro's output, considering how resource-efficient the system is and how quickly it generates the audio output.

I would like to contribute with high-quality data (openAI and/or Elevenlabs), but I haven't quite understood how much data is needed to train a new voice? For example, I would like to have a German voice: how many hours of high-quality audio material do you need for this?

@phi0112358 As stated in the OP, the minimum entrypoint is currently set at 1h per voice: below that tends to yield bad results. In past experiments I can generally observe smooth quality improvements to the 10 hour mark. Beyond that, it probably still improves but it is more difficult to tell, at least in the current training configuration.

However, you can never have enough data, and even if you already have, say 100 hours of data for a single voice, I might just ask for the best 10 hours for now, and come back later for more if needed, or overfitting becomes an issue.

New languages: For new languages, such as German, it is recommended the sum total of German audio (across multiple speakers if relevant) exceeds 10 hours if possible.

Emotions: If you manipulate a TTS voice to shout for X hours and then whisper for Y hours, those are different "voices" for the purposes of counting/voicepacks, and we want both X > 1 hour and Y > 1 hour. Same goes for emotions (happy, sad, sarcastic, etc). Most TTS sound relatively flat; so this is likely not an issue unless you are using OpenAI's Advanced Voice Mode and maybe Gemini 2.0 Flash. It is assumed you label the emotions/tones accordingly; you will receive voicepacks in the same way you labeled. E.g. if you hand me 4 folders titled VOICE_A_WHISPERING, VOICE_A_NEUTRAL, VOICE_B_SHOUTING, VOICE_B_NEUTRAL, you will get back 4 voicepacks matching that breakdown.

Okay, thanks, that was helpful. And how should the labeling look like? Let's assume I have a folder VOICE_A_NEUTRAL with:

labels.json
voice-a-000.mp3
voice-a-001.mp3
voice-a-002.mp3
...
etc

should the labels.json for example look like this:

{
  "samples": [
    {
      "audio_file": "voice-a-000.mp3",
      "text": "Hallo, ich heiße Martin.",
      "duration": 2.5,
      "speaker_id": "a"
    },
    {
      "audio_file": "voice-a-001.mp3",
      "text": "Kannst du mir beim Tragen helfen?",
      "duration": 3.1,
      "speaker_id": "a"
    },
    {
      "audio_file": "voice-a-002.mp3",
      "text": "Nein, so meinte ich das nicht.",
      "duration": 2.8,
      "speaker_id": "a"
    }
  ]
}

would this be enough or do you need some other metadata?

@phi0112358 I think that is mostly fine. I edited the OP to include the following:

EDIT: 2025 Jan 10. In general, 24khz wav pcm_24000.wav is likely the most preferred audio file format. Higher sample rates are also fine, but they will be downsampled to 24khz anyway. I estimate that 24khz wav might be preferred over even higher sample rate mp3 (e.g. 44.1khz), but I may defer to your judgement if this is your native language and/or you have a keen ear favoring the mp3.

If you have the option to choose >=24khz wav, go with that. If you have an attuned ear, and you think the wav sounds worse than mp3, you can choose to switch back to mp3 instead.

Generally I recommend that you use Discord DMs for finer-grained questions and ultimately pass the data to me over there, for your sake to avoid retribution, unless you intend to publish the dataset on HF (which is also fine).

If you called the API individually for each utterance, there should be no further steps.

But on the other hand, if you split larger chunks into sentences yourself—especially for turn-based chat—I would ask that you try to keep the splits in-order in groups. For example:

User: Give me 5 things I should bring on a camping trip.

Assistant:
Sure! Here are 5 things you should bring on a camping trip:
1. ...
2. ...
3. ...
4. ...
5. ...
Let me know if you need anything else!

User: ...

In this example turn, the most optimal training split is at least 7 ways, 1 for each line of the Assistant: the intro line, five list numbers 1-5, the outro line. For the 7 splits, the metadata would look like this instead:

{
  "camping_trip": [
    {
      "text": "Sure! Here are 5 things you should bring on a camping trip:",
      "audio_file": ...
    },
    {
      "text": "1. ...",
      "audio_file": ...
    },
    ...
    {
      "text": "Let me know if you need anything else!",
      "audio_file": ...
    }
  ],
  "turn_B": [
    {...},
    {...},
    ...
  ],
  "turn_C": ...
}

Note the array structure for camping_trip keeps the sentences in order. You can also communicate this information via folder structure as well, as long as the ordering information is present. This is an advanced & optional technique however, and only applicable splits on larger chunks of audio.

Thank you so much for the detailed explanation. I just have one more question about the sampling rate. When I try to export to wav format with Audacity, there is a choice of 22050 Hz, 32000 Hz, 44100 Hz etc. I can't find 24000 Hz here. Instead, I find “Signed 24-bit PCM” under the Encoding option. Is this perhaps what you mean by 24 kHz, or should I export wav files with ffmpeg and force 24 kHz?

Edit: And Oh, sorry for writing here again, but unfortunately I don't have a Discord account - but I'm trying to create one right now, so if I'm successful I'll probably send messages there in the future :))

When I try to export to wav format with Audacity, there is a choice of 22050 Hz, 32000 Hz, 44100 Hz etc. I can't find 24000 Hz here.

In that case, 44100 Hz wav is totally fine. (Better to downsample later instead of upsample.)

This is a very nice open dataset already hosted here. https://huggingface.co/datasets/SPRINGLab/IndicTTS-Hindi
This is Hindi Dataset with both male and female voices. Best quality audio along with indic text, is already split in less than 500 characters.
Please see if this can be piped to generate a new voice and add new language.

Link: https://huggingface.co/datasets/SPRINGLab/IndicTTS-Hindi

Dataset Details
Language: Hindi
Total Duration: ~10.33 hours (Male: 5.16 hours, Female: 5.18 hours)
Audio Format: WAV
Sampling Rate: 48000Hz
Speakers: 2 (1 male, 1 female native Hindi speakers)
Content Type: Monolingual Hindi utterances
Recording Quality: Studio-quality recordings
Transcription: Available for all audio files

@jslinuxta I'm already aware of the IndicTTS-Hindi dataset and actually have had my eye on it for a while, it's been sitting on disk for more than a month. The problem is that I do not know Hindi, and I don't know people who know Hindi. A few hours ago I opened a channel #multlingual-g2p in the Kokoro Discord server:

G2P is graphemes to phonemes. It is the process of taking text => tˈɛkst and converting it into symbols that is easier for a machine to read. Just like LLMs tokenize text, I view G2P as tokenization for TTS models—maybe even more important. See https://hf.co/spaces/hexgrad/Misaki-G2P for an example of how this might work in English.

Me—the author—I only speak English. With guesswork and help from others such as @Respair for JP, I kind of know how to do Chinese, Japanese, and Korean.

If you want to add another language that is not yet supported in Kokoro, I would love your input on how to do this properly.

If you know how to do Hindi G2P, I welcome your opinion. Thanks!

Sign up or log in to comment