Skip to content

PMZFX/speecht5-voice-conversion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

SpeechT5 Voice Conversion - TTNN Implementation

Pre-hardware exploration of porting Microsoft's SpeechT5 Voice Conversion model to Tenstorrent's TTNN API.

Status

Pre-hardware. Not validated on device.

I have two Tenstorrent Blackhole p150a cards on order. While waiting for them to arrive, I'm getting a head start on learning the TTNN development workflow by porting the SpeechT5 voice conversion pipeline. This work is motivated by a planned local voice assistant project (Jarvis-style STT -> LLM -> TTS, all on-device).

I saw there's an open bounty for this model bringup, which inspired the investigation. This repo is my independent exploration - not a bounty claim.

Once the cards arrive and are installed, I'll validate on hardware and update with real results.

What's Here

Implemented: Speech Encoder Prenet (src/ttnn_speech_encoder_prenet.py)

The one genuinely new TTNN component needed for voice conversion. The shared encoder, decoder, and postnet already exist in tt-metal's SpeechT5 TTS implementation.

The speech encoder prenet converts raw 16kHz audio into encoder-ready hidden states:

Raw waveform [B, 16000]  (1 second of audio)
    |
    v  7x Conv1d (wav2vec2-style, total stride=320)
[B, 512, 49]  (49 frames at 20ms each)
    |
    v  LayerNorm + Linear projection
[B, 49, 768]
    |
    v  Positional conv embedding (grouped Conv1d, k=128, g=16)
    v  + Sinusoidal positional encoding
[B, 49, 768]  -> ready for shared encoder

TTNN ops used:

  • ttnn.conv1d - strided convolutions (stride 2 and 5), grouped convolution (groups=16)
  • ttnn.group_norm - instance normalization (groups=num_channels) on first conv layer
  • ttnn.gelu - activation function
  • ttnn.layer_norm - feature projection normalization
  • ttnn.linear - 512->768 projection
  • ttnn.add - residual positional embeddings

Key implementation details:

  • Weight normalization on the positional conv is fused at load time (_fuse_weight_norm). PyTorch's parametrization API uses dim=2 for this layer (normalize over dims 0,1 preserving kernel dim), not the more common dim=0. Auto-detected from the parametrization object.
  • Conv1d stride=2 is tested in tt-metal's sweep tests. Stride=5 (first layer) is untested but the API passes it through to conv2d - needs hardware validation.
  • Groups=16 for positional conv. Groups up to 512 are tested in tt-metal.

Implemented: VC Model Integration (src/ttnn_speecht5_vc_model.py)

Wires the speech encoder prenet to tt-metal's existing encoder/decoder/postnet. This file requires tt-metal to be installed (it imports from models.experimental.speecht5_tts.tt).

Key design decisions:

  • Speaker embeddings are required - the decoder preprocessor normalizes and bakes them into the weights at init time. They are not passed at runtime.
  • The shared encoder's __call__ is bypassed (it does text embedding lookup). Instead, _run_encoder_from_hidden_states() feeds speech prenet output directly into pre-encoder LayerNorm + encoder blocks.
  • Encoder parameters are preprocessed with placeholders for text-specific weights (embed_tokens, positional encoding) that the encoder init expects but the VC path never uses.
  • HiFi-GAN vocoder runs on CPU/PyTorch, matching the existing TTS demo pattern. ConvTranspose1d (stride=4 upsampling) doesn't run on TTNN - this is the accepted approach in the ecosystem.

Tests (tests/test_speech_encoder_prenet.py)

8 local validation tests (run without TT hardware):

  1. Feature encoder output shapes through all 7 layers
  2. Feature projection shape (512 -> 768)
  3. Positional conv embedding shape preservation
  4. Weight normalization fusion PCC = 0.99999994
  5. No bias in feature encoder convs (architecture verification)
  6. Output frame counts for 5 audio durations
  7. Sinusoidal embedding buffer sanity
  8. Full prenet end-to-end output shape

What's NOT Here (Yet)

  • Hardware validation (need the cards)
  • End-to-end audio demo (WAV in -> conversion -> WAV out)
  • Performance measurements (tok/s, RTF)
  • Quality metrics (speaker similarity, WER, token accuracy)
  • Stage 2/3 optimizations (sharding, bfloat8_b, trace)

Architecture

SpeechT5 voice conversion reuses the same shared encoder/decoder as TTS. Only the input path differs:

Component TTS VC TTNN Status
Input prenet Text embeddings Speech Conv1d stack New (this repo)
Shared encoder 12 transformer layers Same Existing in tt-metal
Shared decoder 6 layers + cross-attn + speaker conditioning Same Existing in tt-metal
Postnet 5x Conv1d + BatchNorm Same Existing in tt-metal
Vocoder HiFi-GAN Same CPU/PyTorch

Total model: 153.7M params (SpeechT5) + 12.7M params (HiFi-GAN) = 166.4M params, ~0.67 GB. Fits easily on a single Wormhole N150 (12 GB) or Blackhole p150a (32 GB).

Requirements

# For local tests (no TT hardware needed)
conda create -n tenstorrent python=3.10
conda activate tenstorrent
pip install torch transformers datasets soundfile scipy pytest

# For on-device execution (requires tt-metal installed)
# See https://docs.tenstorrent.com/getting-started/README.html

Running Tests

# Local architecture validation (no hardware needed)
python tests/test_speech_encoder_prenet.py

References

License

Apache 2.0 - matching Tenstorrent's tt-metal licensing.

About

SpeechT5 voice conversion ported to Tenstorrent TTNN API. Pre-hardware exploration while waiting for Blackhole p150a cards.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages