speecht5_tts-wolof

This model is a fine-tuned version of SpeechT5 for Text-to-Speech (TTS) on a Wolof dataset. It uses a custom tokenizer designed for Wolof and adjusts the baseline model's configuration to account for the new vocabulary introduced by the custom tokenizer. This version of SpeechT5 provides speech synthesis capabilities specifically tuned for the Wolof language.

Model description

This model is based on the SpeechT5 architecture, which integrates both speech recognition and synthesis into a unified framework. It is fine-tuned for Text-to-Speech (TTS) using a custom-trained tokenizer and an adapted configuration that accounts for the unique vocabulary of the Wolof language. The fine-tuning process was carried out using a dataset containing text in Wolof to help the model synthesize speech that captures the nuances of the language.

Installation Instructions for Users

To install the necessary dependencies, run the following command:

!pip install transformers datasets

Model Loading and Speech Generation Code

import torch
from transformers import SpeechT5ForTextToSpeech, SpeechT5Processor
from transformers import SpeechT5HifiGan

def load_speech_model(checkpoint="bilalfaye/speecht5_tts-wolof", vocoder_checkpoint="microsoft/speecht5_hifigan"):
    """
    Load the SpeechT5 model, processor, and vocoder for text-to-speech.
    
    Args:
        checkpoint (str): The model checkpoint for SpeechT5 TTS.
        vocoder_checkpoint (str): The checkpoint for the HiFi-GAN vocoder.
    
    Returns:
        processor: The processor for the model.
        model: The loaded SpeechT5 model.
        vocoder: The loaded HiFi-GAN vocoder.
        device: The device (CPU or GPU) the model is loaded on.
    """
    # Check for GPU availability and set device accordingly
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    # Load the SpeechT5 processor and model
    processor = SpeechT5Processor.from_pretrained(checkpoint)
    model = SpeechT5ForTextToSpeech.from_pretrained(checkpoint).to(device)  # Move model to the correct device

    # Load the HiFi-GAN vocoder
    vocoder = SpeechT5HifiGan.from_pretrained(vocoder_checkpoint).to(device)  # Move vocoder to the correct device

    return processor, model, vocoder, device

# Example usage
processor, model, vocoder, device = load_speech_model()

# Verify the device being used
print(f"Model and vocoder loaded on device: {device}")

from datasets import load_dataset
# Load speaker embeddings (this dataset contains speaker-specific embeddings)
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)

import torch
from transformers import SpeechT5ForTextToSpeech, SpeechT5Processor, SpeechT5HifiGan
from IPython.display import Audio, display

def generate_speech_from_text(text, 
                              speaker_embedding=speaker_embedding,
                              processor=processor,
                              model=model,
                              vocoder=vocoder):            
    """
    Generates speech from a given text using SpeechT5 and HiFi-GAN vocoder.

    Args:
        text (str): The input text to be converted to speech.
        checkpoint (str): The model checkpoint for SpeechT5 TTS.
        vocoder_checkpoint (str): The checkpoint for the HiFi-GAN vocoder.
        speaker_embedding (torch.Tensor): The speaker embedding tensor.
        processor (SpeechT5Processor): The processor for the model.
        model (SpeechT5ForTextToSpeech): The loaded SpeechT5 model.
        vocoder (SpeechT5HifiGan): The loaded HiFi-GAN vocoder.

    Returns:
        None
    """
    # Parameters for text-to-speech generation
    max_text_positions = model.config.max_text_positions  # Token limit
    max_length = model.config.max_length * 1.2  # Slightly extended max_length
    min_length = max_length // 3  # Adjust based on max_length
    num_beams = 7  # Use beam search for better quality
    temperature = 0.6  # Reduce temperature for stability

    # Tokenize the input text and move input tensor to the correct device
    inputs = processor(text=text, return_tensors="pt", padding=True, truncation=True, max_length=max_text_positions)
    inputs = {key: value.to(model.device) for key, value in inputs.items()}  # Move inputs to device

    # Generate speech
    speech = model.generate(
        inputs["input_ids"],
        speaker_embeddings=speaker_embedding.to(model.device),  # Ensure speaker_embedding is also on the correct device
        vocoder=vocoder,
        max_length=int(max_length),
        min_length=int(min_length),
        num_beams=num_beams,
        temperature=temperature,
        no_repeat_ngram_size=3,
        repetition_penalty=1.5,
        eos_token_id=None,
        use_cache=True
    )

    # Detach the speech from the computation graph and move it to CPU
    speech = speech.detach().cpu().numpy()

    # Play the generated speech using IPython Audio
    display(Audio(speech, rate=16000))


# Example usage
text = "ñu ne ñoom ñooy nattukaay satélite yi"
generate_speech_from_text(text)

Intended uses & limitations

Intended uses:

Text-to-Speech Generation: This model can be used to convert Wolof text into natural-sounding speech. It can be integrated into applications that require voice interfaces, virtual assistants, or voice synthesis for Wolof-speaking communities.

Limitations:

Limited Scope: The model has been specifically fine-tuned for Wolof and may not perform well with other languages or accents.
Data Availability: While the model was fine-tuned on a Wolof dataset, the quality of the generated speech may vary depending on the complexity of the input text and the dataset used for training.
Vocabulary and Tokenizer Constraints: The tokenizer was specially trained for Wolof, so it may not handle out-of-vocabulary words or unknown characters effectively.

Training and evaluation data

The model was fine-tuned on a custom dataset consisting of text in the Wolof language. This dataset was used to adjust the model to generate speech that accurately reflects the phonetic and syntactic properties of Wolof.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

Learning Rate: 1e-05
Training Batch Size: 8
Evaluation Batch Size: 2
Seed: 42
Gradient Accumulation Steps: 8
Total Train Batch Size: 64
Optimizer: Adam with betas=(0.9, 0.999) and epsilon=1e-08
Learning Rate Scheduler Type: Linear
Warmup Steps: 500
Training Steps: 255000
Mixed Precision Training: Native AMP

Training results

Epoch	Training Loss	Validation Loss
26	0.3894	0.3687
27	0.3858	0.3712
28	0.3874	0.3669
29	0.3887	0.3685
30	0.3854	0.3670
32	0.3856	0.3697

The evaluation table only includes the last 5 epochs as requested.

Framework version

Transformers: 4.41.2
PyTorch: 2.4.0+cu121
Datasets: 3.2.0
Tokenizers: 0.19.1

Author

Bilal FAYE

bilalfaye
/

speecht5_tts-wolof