SpeechT5 Voice Conversion - TTNN Implementation

Pre-hardware exploration of porting Microsoft's SpeechT5 Voice Conversion model to Tenstorrent's TTNN API.

Status

Pre-hardware. Not validated on device.

I have two Tenstorrent Blackhole p150a cards on order. While waiting for them to arrive, I'm getting a head start on learning the TTNN development workflow by porting the SpeechT5 voice conversion pipeline. This work is motivated by a planned local voice assistant project (Jarvis-style STT -> LLM -> TTS, all on-device).

I saw there's an open bounty for this model bringup, which inspired the investigation. This repo is my independent exploration - not a bounty claim.

Once the cards arrive and are installed, I'll validate on hardware and update with real results.

What's Here

Implemented: Speech Encoder Prenet (`src/ttnn_speech_encoder_prenet.py`)

The one genuinely new TTNN component needed for voice conversion. The shared encoder, decoder, and postnet already exist in tt-metal's SpeechT5 TTS implementation.

The speech encoder prenet converts raw 16kHz audio into encoder-ready hidden states:

Raw waveform [B, 16000]  (1 second of audio)
    |
    v  7x Conv1d (wav2vec2-style, total stride=320)
[B, 512, 49]  (49 frames at 20ms each)
    |
    v  LayerNorm + Linear projection
[B, 49, 768]
    |
    v  Positional conv embedding (grouped Conv1d, k=128, g=16)
    v  + Sinusoidal positional encoding
[B, 49, 768]  -> ready for shared encoder

TTNN ops used:

ttnn.conv1d - strided convolutions (stride 2 and 5), grouped convolution (groups=16)
ttnn.group_norm - instance normalization (groups=num_channels) on first conv layer
ttnn.gelu - activation function
ttnn.layer_norm - feature projection normalization
ttnn.linear - 512->768 projection
ttnn.add - residual positional embeddings

Key implementation details:

Weight normalization on the positional conv is fused at load time (_fuse_weight_norm). PyTorch's parametrization API uses dim=2 for this layer (normalize over dims 0,1 preserving kernel dim), not the more common dim=0. Auto-detected from the parametrization object.
Conv1d stride=2 is tested in tt-metal's sweep tests. Stride=5 (first layer) is untested but the API passes it through to conv2d - needs hardware validation.
Groups=16 for positional conv. Groups up to 512 are tested in tt-metal.

Implemented: VC Model Integration (`src/ttnn_speecht5_vc_model.py`)

Wires the speech encoder prenet to tt-metal's existing encoder/decoder/postnet. This file requires tt-metal to be installed (it imports from models.experimental.speecht5_tts.tt).

Key design decisions:

Speaker embeddings are required - the decoder preprocessor normalizes and bakes them into the weights at init time. They are not passed at runtime.
The shared encoder's __call__ is bypassed (it does text embedding lookup). Instead, _run_encoder_from_hidden_states() feeds speech prenet output directly into pre-encoder LayerNorm + encoder blocks.
Encoder parameters are preprocessed with placeholders for text-specific weights (embed_tokens, positional encoding) that the encoder init expects but the VC path never uses.
HiFi-GAN vocoder runs on CPU/PyTorch, matching the existing TTS demo pattern. ConvTranspose1d (stride=4 upsampling) doesn't run on TTNN - this is the accepted approach in the ecosystem.

Tests (`tests/test_speech_encoder_prenet.py`)

8 local validation tests (run without TT hardware):

Feature encoder output shapes through all 7 layers
Feature projection shape (512 -> 768)
Positional conv embedding shape preservation
Weight normalization fusion PCC = 0.99999994
No bias in feature encoder convs (architecture verification)
Output frame counts for 5 audio durations
Sinusoidal embedding buffer sanity
Full prenet end-to-end output shape

What's NOT Here (Yet)

Hardware validation (need the cards)
End-to-end audio demo (WAV in -> conversion -> WAV out)
Performance measurements (tok/s, RTF)
Quality metrics (speaker similarity, WER, token accuracy)
Stage 2/3 optimizations (sharding, bfloat8_b, trace)

Architecture

SpeechT5 voice conversion reuses the same shared encoder/decoder as TTS. Only the input path differs:

Component	TTS	VC	TTNN Status
Input prenet	Text embeddings	Speech Conv1d stack	New (this repo)
Shared encoder	12 transformer layers	Same	Existing in tt-metal
Shared decoder	6 layers + cross-attn + speaker conditioning	Same	Existing in tt-metal
Postnet	5x Conv1d + BatchNorm	Same	Existing in tt-metal
Vocoder	HiFi-GAN	Same	CPU/PyTorch

Total model: 153.7M params (SpeechT5) + 12.7M params (HiFi-GAN) = 166.4M params, ~0.67 GB. Fits easily on a single Wormhole N150 (12 GB) or Blackhole p150a (32 GB).

Requirements

# For local tests (no TT hardware needed)
conda create -n tenstorrent python=3.10
conda activate tenstorrent
pip install torch transformers datasets soundfile scipy pytest

# For on-device execution (requires tt-metal installed)
# See https://docs.tenstorrent.com/getting-started/README.html

Running Tests

# Local architecture validation (no hardware needed)
python tests/test_speech_encoder_prenet.py

References

License

Apache 2.0 - matching Tenstorrent's tt-metal licensing.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SpeechT5 Voice Conversion - TTNN Implementation

Status

What's Here

Implemented: Speech Encoder Prenet (`src/ttnn_speech_encoder_prenet.py`)

Implemented: VC Model Integration (`src/ttnn_speecht5_vc_model.py`)

Tests (`tests/test_speech_encoder_prenet.py`)

What's NOT Here (Yet)

Architecture

Requirements

Running Tests

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SpeechT5 Voice Conversion - TTNN Implementation

Status

What's Here

Implemented: Speech Encoder Prenet (src/ttnn_speech_encoder_prenet.py)

Implemented: VC Model Integration (src/ttnn_speecht5_vc_model.py)

Tests (tests/test_speech_encoder_prenet.py)

What's NOT Here (Yet)

Architecture

Requirements

Running Tests

References

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Implemented: Speech Encoder Prenet (`src/ttnn_speech_encoder_prenet.py`)

Implemented: VC Model Integration (`src/ttnn_speecht5_vc_model.py`)

Tests (`tests/test_speech_encoder_prenet.py`)

Packages