Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add training.py usage instructions to README #102

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

sardorb3k
Copy link

  • Add detailed section on training prerequisites and setup
  • Document configuration options and parameters
  • Explain checkpoint saving and training process
  • Include example commands for custom training

- Add detailed section on training prerequisites and setup
- Document configuration options and parameters
- Explain checkpoint saving and training process
- Include example commands for custom training
@sardorb3k sardorb3k mentioned this pull request Feb 16, 2025
@darkacorn
Copy link
Contributor

darkacorn commented Feb 16, 2025

its nice approach however that doesn't take any emotion / data prep into account what so ever .. the original training script is highly likely way more complicated then that ..

is that oai code ?

- Fix device mismatch in attention mechanism
  - Ensure KV cache is on same device as query tensor
  - Add device check and transfer in _update_kv_cache function

- Improve training pipeline
  - Add proper emotion handling with 8-dimensional vectors
  - Update mel spectrogram calculation and normalization
  - Fix speaker embedding tensor shape handling
  - Streamline data processing in training loop

- Code cleanup and optimizations
  - Remove unnecessary emotion embedding layer
  - Simplify emotion vector handling
  - Improve tensor stacking operations
  - Add proper error handling for device mismatches
@sardorb3k
Copy link
Author

its nice approach however that doesn't take any emotion / data prep into account what so ever .. the original training script is highly likely way more complicated then that ..

is that oai code ?

You're right - this is just a basic example to demonstrate the concept. Real TTS training would be much more complex, especially for emotion handling and data preparation. And no, this isn't OpenAI code - it's a simplified training example.

@darkacorn
Copy link
Contributor

darkacorn commented Feb 16, 2025

i talked to grabriel and he maybe swayed to give out codes even if its hackish .. if there are optimisation efforts going on .. - its a good baseline what you have here .. and a tech demonstrator / ideally i think most (maybe just me) would want to see the original stuff so we can train at max fidelity

have you tested it with some language ? or preliminary results ? w&b / samples would be good

- Fix speaker embedding dimension handling
  - Remove extra channel dimension before speaker model
  - Ensure proper audio shape (time) for speaker embedding
  - Add batch dimension only for autoencoder input

- Improve audio preprocessing
  - Add consistent audio length handling (30 seconds)
  - Proper mono conversion for stereo inputs
  - Better normalization and augmentation
  - Add SNR-based noise addition (20-30dB range)

- Optimize training process
  - Reduce batch size to 4 for better stability
  - Reduce worker count to 2
  - Simplify loss calculation
  - Add gradient clipping

- Code cleanup
  - Add descriptive comments for tensor shapes
  - Remove redundant tensor operations
  - Improve code organization and readability
- Add language code mapping for espeak compatibility
  - Map 'uz' to 'uzb' for proper Uzbek phonemization
  - Maintain ISO 639-1 compatibility in public API

- Improve error handling in phonemization
  - Add graceful fallback to English when needed
  - Add informative warning messages for debugging
  - Implement proper exception handling for phonemization failures

- Enhance documentation
  - Add function docstrings explaining language handling
  - Document language code mappings
  - Clarify fallback behavior

Technical note: espeak requires 'uzb' code for Uzbek while the API uses 'uz'
- Audio Processing:
  - Ensure proper resampling to 44.1kHz for DAC model compatibility
  - Implement consistent 30-second audio length handling (1,323,000 samples at 44.1kHz)
  - Add proper padding/truncation for audio inputs

- Dimension Handling:
  - Fix tensor shape mismatches in autoencoder input
  - Add proper batch and channel dimensions for DAC model
  - Ensure correct audio tensor shapes throughout the pipeline

- Language Support:
  - Add language code mapping for Uzbek (uz -> uzb) in espeak
  - Implement graceful fallback to English for unsupported languages
  - Improve error handling in phonemization process

- Code Improvements:
  - Enhance documentation of tensor shapes and processing steps
  - Add validation checks for audio dimensions
  - Optimize audio preprocessing pipeline

Technical note: DAC model expects 44.1kHz audio with shape [batch_size, channels, samples],
where samples = 44100 * 30 for 30-second clips.
…moves throughout backbone - Ensure consistent device usage in cache and computation
…n cross entropy loss - Ensure target_codes are on correct device
…een output and target codes - Add proper permutation and reshaping for cross entropy loss
Fixed cross entropy loss calculation by properly reshaping tensors:
- Removed extra batch dimension from output tensor
- Corrected tensor permutation for sequence and codebook dimensions
- Ensured matching dimensions between input and target tensors
- Updated shape comments for better code clarity

This fixes the ValueError where input batch_size (900) did not match target batch_size (450).
Fixed tensor reshaping logic for 4D output tensor:
- Added .contiguous() before view() operation
- Added proper target_codes transposition
- Fixed permutation dimensions to match tensor shape
- Added detailed shape comments for better traceability

Resolves RuntimeError in permute operation where input.dim() = 4 did not match len(dims) = 3.
@darkacorn
Copy link
Contributor

seen that you made massive updatesincoporateing emotions tags .. @spaghettiSystems has https://github.com/spaghettiSystems/serval and / or https://github.com/spaghettiSystems/emotion_whisper is used in the data prep .. maybe worth to look at it

@darkacorn
Copy link
Contributor

i rest my preliminary 2 min review of that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants