Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add training.py usage instructions to README #102

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 68 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,74 @@ This should produce a `sample.wav` file in your project root directory.

_For repeated sampling we highly recommend using the gradio interface instead, as the minimal example needs to load the model every time it is run._


## Fine-Tuning / Training

We provide a script, training.py, to demonstrate how you can fine-tune or adapt Zonos on your own data, or on a public dataset like Mozilla Common Voice.


### Requirements

- A GPU with sufficient VRAM (6GB+ recommended). CPU training is possible but extremely slow.
- PyTorch and torchaudio
- The Hugging Face datasets library
- Enough disk space to download your chosen dataset (e.g., Common Voice can be quite large depending on language).
- If you plan to train the “hybrid” version, you must install CUDA-specific requirements (see Installation).

### Usage

#### 1. Clone or download the Zonos repository:
```bash
git clone https://github.com/Zyphra/Zonos.git
cd Zonos
```
#### 2. Install dependencies (e.g., in a virtual environment):
```bash
uv sync
uv sync --extra compile # optional but needed to run the hybrid
uv pip install -e .
```
#### 3. Edit training parameters if needed. By default, training.py uses:

- **Model:** Zyphra/Zonos-v0.1-transformer
- **Dataset:** mozilla-foundation/common_voice_17_0
- **Language:** uz
- `num_epochs=10`, `batch_size=8`, `learning_rate=1e-4`
- **Output directory:** `checkpoints`

#### 4. Run the training script:
```bash
uv run training.py
```

#### The script will:

- Load the specified dataset and the Zonos model
- Resample audio to 16kHz (adjust in code if necessary)
- Compute speaker embeddings for each sample
- Prepare text/language conditioning
- Forward through Zonos to predict codes
- Calculate cross-entropy loss
- Save periodic checkpoints to the checkpoints folder

#### 5. Customizing training:
- **Change model:** Pass --model_path to specify another pretrained checkpoint (e.g., the hybrid).
- **Change dataset:** Pass --dataset_name with a Hugging Face dataset or your custom dataset.
- **Modify hyperparams:** e.g., `--learning_rate 1e-5`, `--num_epochs 5`, etc.
- **Output directory:** Use `--output_dir` to choose a different folder for checkpoints.

#### Example command:
```bash
uv run training.py
--model_path Zyphra/Zonos-v0.1-hybrid
--dataset_name mozilla-foundation/common_voice_17_0
--language uz
--output_dir checkpoints
--num_epochs 10
--batch_size 8
--learning_rate 1e-4
```

## Features

- Zero-shot TTS with voice cloning: Input desired text and a 10-30s speaker sample to generate high quality TTS output
Expand Down
13 changes: 13 additions & 0 deletions emotion_labels.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
{
"0": [0.3077, 0.0256, 0.0256, 0.0256, 0.0256, 0.0256, 0.2564, 0.3077],
"1": [0.1000, 0.5000, 0.0500, 0.0500, 0.0500, 0.0500, 0.1000, 0.1000],
"2": [0.4000, 0.0500, 0.0500, 0.0500, 0.0500, 0.0500, 0.1500, 0.2000],
"3": [0.1000, 0.0500, 0.5000, 0.0500, 0.0500, 0.0500, 0.1000, 0.1000],
"4": [0.1000, 0.4000, 0.1000, 0.1000, 0.0500, 0.0500, 0.1000, 0.1000],
"5": [0.3500, 0.0500, 0.0500, 0.0500, 0.0500, 0.0500, 0.1500, 0.2500],
"6": [0.1000, 0.0500, 0.0500, 0.4000, 0.1000, 0.1000, 0.1000, 0.1000],
"7": [0.1000, 0.0500, 0.0500, 0.0500, 0.0500, 0.0500, 0.5000, 0.1500],
"8": [0.3000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000],
"9": [0.1000, 0.4500, 0.0500, 0.0500, 0.0500, 0.0500, 0.1000, 0.1500],
"10": [0.1500, 0.0500, 0.4000, 0.0500, 0.1000, 0.0500, 0.1000, 0.1000]
}
Loading