-
Notifications
You must be signed in to change notification settings - Fork 527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add training.py usage instructions to README #102
base: main
Are you sure you want to change the base?
Conversation
sardorb3k
commented
Feb 16, 2025
- Add detailed section on training prerequisites and setup
- Document configuration options and parameters
- Explain checkpoint saving and training process
- Include example commands for custom training
- Add detailed section on training prerequisites and setup - Document configuration options and parameters - Explain checkpoint saving and training process - Include example commands for custom training
its nice approach however that doesn't take any emotion / data prep into account what so ever .. the original training script is highly likely way more complicated then that .. is that oai code ? |
- Fix device mismatch in attention mechanism - Ensure KV cache is on same device as query tensor - Add device check and transfer in _update_kv_cache function - Improve training pipeline - Add proper emotion handling with 8-dimensional vectors - Update mel spectrogram calculation and normalization - Fix speaker embedding tensor shape handling - Streamline data processing in training loop - Code cleanup and optimizations - Remove unnecessary emotion embedding layer - Simplify emotion vector handling - Improve tensor stacking operations - Add proper error handling for device mismatches
You're right - this is just a basic example to demonstrate the concept. Real TTS training would be much more complex, especially for emotion handling and data preparation. And no, this isn't OpenAI code - it's a simplified training example. |
i talked to grabriel and he maybe swayed to give out codes even if its hackish .. if there are optimisation efforts going on .. - its a good baseline what you have here .. and a tech demonstrator / ideally i think most (maybe just me) would want to see the original stuff so we can train at max fidelity have you tested it with some language ? or preliminary results ? w&b / samples would be good |
- Fix speaker embedding dimension handling - Remove extra channel dimension before speaker model - Ensure proper audio shape (time) for speaker embedding - Add batch dimension only for autoencoder input - Improve audio preprocessing - Add consistent audio length handling (30 seconds) - Proper mono conversion for stereo inputs - Better normalization and augmentation - Add SNR-based noise addition (20-30dB range) - Optimize training process - Reduce batch size to 4 for better stability - Reduce worker count to 2 - Simplify loss calculation - Add gradient clipping - Code cleanup - Add descriptive comments for tensor shapes - Remove redundant tensor operations - Improve code organization and readability
- Add language code mapping for espeak compatibility - Map 'uz' to 'uzb' for proper Uzbek phonemization - Maintain ISO 639-1 compatibility in public API - Improve error handling in phonemization - Add graceful fallback to English when needed - Add informative warning messages for debugging - Implement proper exception handling for phonemization failures - Enhance documentation - Add function docstrings explaining language handling - Document language code mappings - Clarify fallback behavior Technical note: espeak requires 'uzb' code for Uzbek while the API uses 'uz'
- Audio Processing: - Ensure proper resampling to 44.1kHz for DAC model compatibility - Implement consistent 30-second audio length handling (1,323,000 samples at 44.1kHz) - Add proper padding/truncation for audio inputs - Dimension Handling: - Fix tensor shape mismatches in autoencoder input - Add proper batch and channel dimensions for DAC model - Ensure correct audio tensor shapes throughout the pipeline - Language Support: - Add language code mapping for Uzbek (uz -> uzb) in espeak - Implement graceful fallback to English for unsupported languages - Improve error handling in phonemization process - Code Improvements: - Enhance documentation of tensor shapes and processing steps - Add validation checks for audio dimensions - Optimize audio preprocessing pipeline Technical note: DAC model expects 44.1kHz audio with shape [batch_size, channels, samples], where samples = 44100 * 30 for 30-second clips.
…ass implementation for model inference
…h in backbone forward pass and cache setup
…placement for freqs_cis and cache tensors
…moves throughout backbone - Ensure consistent device usage in cache and computation
…n cross entropy loss - Ensure target_codes are on correct device
…een output and target codes - Add proper permutation and reshaping for cross entropy loss
Fixed cross entropy loss calculation by properly reshaping tensors: - Removed extra batch dimension from output tensor - Corrected tensor permutation for sequence and codebook dimensions - Ensured matching dimensions between input and target tensors - Updated shape comments for better code clarity This fixes the ValueError where input batch_size (900) did not match target batch_size (450).
Fixed tensor reshaping logic for 4D output tensor: - Added .contiguous() before view() operation - Added proper target_codes transposition - Fixed permutation dimensions to match tensor shape - Added detailed shape comments for better traceability Resolves RuntimeError in permute operation where input.dim() = 4 did not match len(dims) = 3.
seen that you made massive updatesincoporateing emotions tags .. @spaghettiSystems has https://github.com/spaghettiSystems/serval and / or https://github.com/spaghettiSystems/emotion_whisper is used in the data prep .. maybe worth to look at it |
i rest my preliminary 2 min review of that |