Pre-hardware exploration of porting Microsoft's SpeechT5 Voice Conversion model to Tenstorrent's TTNN API.
Pre-hardware. Not validated on device.
I have two Tenstorrent Blackhole p150a cards on order. While waiting for them to arrive, I'm getting a head start on learning the TTNN development workflow by porting the SpeechT5 voice conversion pipeline. This work is motivated by a planned local voice assistant project (Jarvis-style STT -> LLM -> TTS, all on-device).
I saw there's an open bounty for this model bringup, which inspired the investigation. This repo is my independent exploration - not a bounty claim.
Once the cards arrive and are installed, I'll validate on hardware and update with real results.
The one genuinely new TTNN component needed for voice conversion. The shared encoder, decoder, and postnet already exist in tt-metal's SpeechT5 TTS implementation.
The speech encoder prenet converts raw 16kHz audio into encoder-ready hidden states:
Raw waveform [B, 16000] (1 second of audio)
|
v 7x Conv1d (wav2vec2-style, total stride=320)
[B, 512, 49] (49 frames at 20ms each)
|
v LayerNorm + Linear projection
[B, 49, 768]
|
v Positional conv embedding (grouped Conv1d, k=128, g=16)
v + Sinusoidal positional encoding
[B, 49, 768] -> ready for shared encoder
TTNN ops used:
ttnn.conv1d- strided convolutions (stride 2 and 5), grouped convolution (groups=16)ttnn.group_norm- instance normalization (groups=num_channels) on first conv layerttnn.gelu- activation functionttnn.layer_norm- feature projection normalizationttnn.linear- 512->768 projectionttnn.add- residual positional embeddings
Key implementation details:
- Weight normalization on the positional conv is fused at load time (
_fuse_weight_norm). PyTorch's parametrization API usesdim=2for this layer (normalize over dims 0,1 preserving kernel dim), not the more commondim=0. Auto-detected from the parametrization object. - Conv1d stride=2 is tested in tt-metal's sweep tests. Stride=5 (first layer) is untested but the API passes it through to conv2d - needs hardware validation.
- Groups=16 for positional conv. Groups up to 512 are tested in tt-metal.
Wires the speech encoder prenet to tt-metal's existing encoder/decoder/postnet. This file requires tt-metal to be installed (it imports from models.experimental.speecht5_tts.tt).
Key design decisions:
- Speaker embeddings are required - the decoder preprocessor normalizes and bakes them into the weights at init time. They are not passed at runtime.
- The shared encoder's
__call__is bypassed (it does text embedding lookup). Instead,_run_encoder_from_hidden_states()feeds speech prenet output directly into pre-encoder LayerNorm + encoder blocks. - Encoder parameters are preprocessed with placeholders for text-specific weights (embed_tokens, positional encoding) that the encoder init expects but the VC path never uses.
- HiFi-GAN vocoder runs on CPU/PyTorch, matching the existing TTS demo pattern. ConvTranspose1d (stride=4 upsampling) doesn't run on TTNN - this is the accepted approach in the ecosystem.
8 local validation tests (run without TT hardware):
- Feature encoder output shapes through all 7 layers
- Feature projection shape (512 -> 768)
- Positional conv embedding shape preservation
- Weight normalization fusion PCC = 0.99999994
- No bias in feature encoder convs (architecture verification)
- Output frame counts for 5 audio durations
- Sinusoidal embedding buffer sanity
- Full prenet end-to-end output shape
- Hardware validation (need the cards)
- End-to-end audio demo (WAV in -> conversion -> WAV out)
- Performance measurements (tok/s, RTF)
- Quality metrics (speaker similarity, WER, token accuracy)
- Stage 2/3 optimizations (sharding, bfloat8_b, trace)
SpeechT5 voice conversion reuses the same shared encoder/decoder as TTS. Only the input path differs:
| Component | TTS | VC | TTNN Status |
|---|---|---|---|
| Input prenet | Text embeddings | Speech Conv1d stack | New (this repo) |
| Shared encoder | 12 transformer layers | Same | Existing in tt-metal |
| Shared decoder | 6 layers + cross-attn + speaker conditioning | Same | Existing in tt-metal |
| Postnet | 5x Conv1d + BatchNorm | Same | Existing in tt-metal |
| Vocoder | HiFi-GAN | Same | CPU/PyTorch |
Total model: 153.7M params (SpeechT5) + 12.7M params (HiFi-GAN) = 166.4M params, ~0.67 GB. Fits easily on a single Wormhole N150 (12 GB) or Blackhole p150a (32 GB).
# For local tests (no TT hardware needed)
conda create -n tenstorrent python=3.10
conda activate tenstorrent
pip install torch transformers datasets soundfile scipy pytest
# For on-device execution (requires tt-metal installed)
# See https://docs.tenstorrent.com/getting-started/README.html
# Local architecture validation (no hardware needed)
python tests/test_speech_encoder_prenet.py- SpeechT5 paper (ACL 2022)
- HuggingFace SpeechT5-VC model
- tt-metal SpeechT5 TTS implementation
- TTNN model bringup guide
- OpenVoice V2 PR (HiFi-GAN patterns)
Apache 2.0 - matching Tenstorrent's tt-metal licensing.