docs: add training.py usage instructions to README #102

sardorb3k · 2025-02-16T02:02:28Z

Add detailed section on training prerequisites and setup
Document configuration options and parameters
Explain checkpoint saving and training process
Include example commands for custom training

- Add detailed section on training prerequisites and setup - Document configuration options and parameters - Explain checkpoint saving and training process - Include example commands for custom training

darkacorn · 2025-02-16T06:11:51Z

its nice approach however that doesn't take any emotion / data prep into account what so ever .. the original training script is highly likely way more complicated then that ..

is that oai code ?

- Fix device mismatch in attention mechanism - Ensure KV cache is on same device as query tensor - Add device check and transfer in _update_kv_cache function - Improve training pipeline - Add proper emotion handling with 8-dimensional vectors - Update mel spectrogram calculation and normalization - Fix speaker embedding tensor shape handling - Streamline data processing in training loop - Code cleanup and optimizations - Remove unnecessary emotion embedding layer - Simplify emotion vector handling - Improve tensor stacking operations - Add proper error handling for device mismatches

sardorb3k · 2025-02-16T06:53:35Z

its nice approach however that doesn't take any emotion / data prep into account what so ever .. the original training script is highly likely way more complicated then that ..

is that oai code ?

You're right - this is just a basic example to demonstrate the concept. Real TTS training would be much more complex, especially for emotion handling and data preparation. And no, this isn't OpenAI code - it's a simplified training example.

darkacorn · 2025-02-16T07:01:52Z

i talked to grabriel and he maybe swayed to give out codes even if its hackish .. if there are optimisation efforts going on .. - its a good baseline what you have here .. and a tech demonstrator / ideally i think most (maybe just me) would want to see the original stuff so we can train at max fidelity

have you tested it with some language ? or preliminary results ? w&b / samples would be good

- Fix speaker embedding dimension handling - Remove extra channel dimension before speaker model - Ensure proper audio shape (time) for speaker embedding - Add batch dimension only for autoencoder input - Improve audio preprocessing - Add consistent audio length handling (30 seconds) - Proper mono conversion for stereo inputs - Better normalization and augmentation - Add SNR-based noise addition (20-30dB range) - Optimize training process - Reduce batch size to 4 for better stability - Reduce worker count to 2 - Simplify loss calculation - Add gradient clipping - Code cleanup - Add descriptive comments for tensor shapes - Remove redundant tensor operations - Improve code organization and readability

- Add language code mapping for espeak compatibility - Map 'uz' to 'uzb' for proper Uzbek phonemization - Maintain ISO 639-1 compatibility in public API - Improve error handling in phonemization - Add graceful fallback to English when needed - Add informative warning messages for debugging - Implement proper exception handling for phonemization failures - Enhance documentation - Add function docstrings explaining language handling - Document language code mappings - Clarify fallback behavior Technical note: espeak requires 'uzb' code for Uzbek while the API uses 'uz'

- Audio Processing: - Ensure proper resampling to 44.1kHz for DAC model compatibility - Implement consistent 30-second audio length handling (1,323,000 samples at 44.1kHz) - Add proper padding/truncation for audio inputs - Dimension Handling: - Fix tensor shape mismatches in autoencoder input - Add proper batch and channel dimensions for DAC model - Ensure correct audio tensor shapes throughout the pipeline - Language Support: - Add language code mapping for Uzbek (uz -> uzb) in espeak - Implement graceful fallback to English for unsupported languages - Improve error handling in phonemization process - Code Improvements: - Enhance documentation of tensor shapes and processing steps - Add validation checks for audio dimensions - Optimize audio preprocessing pipeline Technical note: DAC model expects 44.1kHz audio with shape [batch_size, channels, samples], where samples = 44100 * 30 for 30-second clips.

…ass implementation for model inference

…h in backbone forward pass and cache setup

…placement for freqs_cis and cache tensors

…moves throughout backbone - Ensure consistent device usage in cache and computation

…n cross entropy loss - Ensure target_codes are on correct device

…een output and target codes - Add proper permutation and reshaping for cross entropy loss

Fixed cross entropy loss calculation by properly reshaping tensors: - Removed extra batch dimension from output tensor - Corrected tensor permutation for sequence and codebook dimensions - Ensured matching dimensions between input and target tensors - Updated shape comments for better code clarity This fixes the ValueError where input batch_size (900) did not match target batch_size (450).

Fixed tensor reshaping logic for 4D output tensor: - Added .contiguous() before view() operation - Added proper target_codes transposition - Fixed permutation dimensions to match tensor shape - Added detailed shape comments for better traceability Resolves RuntimeError in permute operation where input.dim() = 4 did not match len(dims) = 3.

darkacorn · 2025-02-17T09:43:40Z

seen that you made massive updatesincoporateing emotions tags .. @spaghettiSystems has https://github.com/spaghettiSystems/serval and / or https://github.com/spaghettiSystems/emotion_whisper is used in the data prep .. maybe worth to look at it

darkacorn · 2025-02-18T09:43:37Z

i rest my preliminary 2 min review of that

docs: add training.py usage instructions to README

fc9c933

- Add detailed section on training prerequisites and setup - Document configuration options and parameters - Explain checkpoint saving and training process - Include example commands for custom training

sardorb3k mentioned this pull request Feb 16, 2025

New language support #93

Closed

sardorb3k added 11 commits February 16, 2025 13:12

feat(model): implement forward method for Zonos class - Add forward p…

441cabb

…ass implementation for model inference

fix(model): ensure tensors are on correct device - Fix device mismatc…

cb5674a

…h in backbone forward pass and cache setup

fix(backbone): ensure all tensors are on correct device - Fix device …

e2a6303

…placement for freqs_cis and cache tensors

fix(backbone): comprehensive device handling - Add device checks and …

ac0f467

…moves throughout backbone - Ensure consistent device usage in cache and computation

fix(training): correct loss calculation - Fix tensor shape mismatch i…

b3db49b

…n cross entropy loss - Ensure target_codes are on correct device

fix(training): proper tensor shape handling - Fix shape mismatch betw…

c57aed7

…een output and target codes - Add proper permutation and reshaping for cross entropy loss

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add training.py usage instructions to README #102

docs: add training.py usage instructions to README #102

sardorb3k commented Feb 16, 2025

darkacorn commented Feb 16, 2025 •

edited

Loading

sardorb3k commented Feb 16, 2025

darkacorn commented Feb 16, 2025 •

edited

Loading

darkacorn commented Feb 17, 2025

darkacorn commented Feb 18, 2025

docs: add training.py usage instructions to README #102

Are you sure you want to change the base?

docs: add training.py usage instructions to README #102

Conversation

sardorb3k commented Feb 16, 2025

darkacorn commented Feb 16, 2025 • edited Loading

sardorb3k commented Feb 16, 2025

darkacorn commented Feb 16, 2025 • edited Loading

darkacorn commented Feb 17, 2025

darkacorn commented Feb 18, 2025

darkacorn commented Feb 16, 2025 •

edited

Loading

darkacorn commented Feb 16, 2025 •

edited

Loading