Malaysian Finetune Whisper Small V2

Finetune Whisper Small on Malaysian STT Whisper

WanDB at https://wandb.ai/huseinzol05/malaysian-whisper-small-v2, still on training

Improvement

  1. Distilled from Whisper Large V3 on Malaysian and Science context.
  2. Better translation for Malay, Manglish, Mandarin, Tamil and Science context.
  3. Word level timestamp, introduced <|transcribeprecise|> token, a new task!

how to

Load the model,

import torch
from transformers.models.whisper import tokenization_whisper

tokenization_whisper.TASK_IDS = ["translate", "transcribe", 'transcribeprecise']

from transformers import WhisperForConditionalGeneration, WhisperProcessor

processor = WhisperProcessor.from_pretrained(
    'mesolitica/malaysian-whisper-small-v2'
)
tokenizer = processor.tokenizer
model = WhisperForConditionalGeneration.from_pretrained(
    'mesolitica/malaysian-whisper-small-v2', torch_dtype = torch.bfloat16
).cuda().eval()

Transcribe

from datasets import Audio
import requests

sr = 16000
audio = Audio(sampling_rate=sr)

r = requests.get('https://github.com/mesolitica/malaya-speech/raw/master/speech/assembly.mp3')
y = audio.decode_example(audio.encode_example(r.content))['array']

with torch.no_grad():
    p = processor([y], return_tensors='pt')
    p['input_features'] = p['input_features'].to(torch.bfloat16)
    r = model.generate(
        p['input_features'].cuda(),
        output_scores=True,
        return_dict_in_generate=True,
        language='ms',
        return_timestamps=True, task = 'transcribe')

tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(r.sequences[0]))
<|startoftranscript|><|ms|><|transcribe|><|0.02|> Assembly on Aging di Vienna, Australia<|3.78|><|3.78|> yang telah diadakan pada tahun 1982<|6.50|><|6.50|> dan berasaskan unjuran tersebut<|8.82|><|8.82|> maka Jabatan Perangkaan Malaysia<|10.40|><|10.40|> menganggarkan menjelang tahun 2035<|13.72|><|13.72|> sejumlah 15% penduduk kita adalah daripada kalangan warga emas.<|18.72|><|19.28|> Untuk makluman Tuan Yang Pertua dan juga Alia Mbahumat,<|22.12|><|22.26|> pembangunan sistem pendaftaran warga emas<|24.02|><|24.02|> ataupun kita sebutkan event<|25.38|><|25.38|> adalah usaha kerajaan ke arah merealisasikan<|28.40|><|endoftext|>

Transcribe word level timestamp

You must use transcribeprecise for the task, or <|transcribeprecise|> token,

from datasets import Audio
import requests

sr = 16000
audio = Audio(sampling_rate=sr)

r = requests.get('https://github.com/mesolitica/malaya-speech/raw/master/speech/assembly.mp3')
y = audio.decode_example(audio.encode_example(r.content))['array']

with torch.no_grad():
    p = processor([y], return_tensors='pt')
    p['input_features'] = p['input_features'].to(torch.bfloat16)
    r = model.generate(
        p['input_features'].cuda(),
        output_scores=True,
        return_dict_in_generate=True,
        language='ms',
        return_timestamps=True, task = 'transcribeprecise')

tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(r.sequences[0]))
<|startoftranscript|><|ms|><|transcribeprecise|><|0.02|> Assembly<|1.20|><|1.56|> on<|1.64|><|1.74|> Aging<|2.04|><|2.14|> di<|2.22|><|2.26|> Vienna<|2.50|><|2.72|> Australia<|3.12|><|4.26|> yang<|4.38|><|4.42|> telah<|4.58|><|4.62|> diadakan<|5.08|><|5.16|> pada<|5.30|><|5.36|> tahun<|5.60|><|5.62|> 1982<|6.92|><|7.12|> dan<|7.24|><|7.32|> berasaskan<|7.88|><|7.98|> unjuran<|8.36|><|8.42|> tersebut<|8.80|><|8.88|> maka<|9.06|><|9.12|> Jabatan<|9.48|><|9.56|> Perangkaan<|9.98|><|10.04|> Malaysia<|10.36|><|10.84|> menganggarkan<|11.56|><|11.98|> menjelang<|12.34|><|12.40|> tahun<|12.64|><|12.66|> 2035<|14.08|><|14.50|> sejumlah<|14.96|><|14.98|> 15%<|16.14|><|16.26|> penduduk<|16.62|><|16.68|> kita<|16.90|><|17.02|> adalah<|17.30|><|17.40|> daripada<|17.80|><|17.86|> kalangan<|18.16|><|18.22|> warga<|18.40|><|18.46|> emas.<|18.68|><|19.24|> Untuk<|19.40|><|19.46|> makluman<|19.86|><|20.64|> Tuan<|20.76|><|20.82|> Yang<|20.90|><|20.94|> Pertua<|21.14|><|21.20|> dan<|21.28|><|21.34|> juga<|21.50|><|21.58|> Alia<|21.70|><|21.76|> Mbah<|21.88|><|21.92|> Ahmad,<|22.08|><|22.22|> pembangunan<|22.66|><|22.72|> sistem<|23.00|><|23.06|> pendaftaran<|23.48|><|23.54|> warga<|23.72|><|23.78|> emas<|23.98|><|24.06|> ataupun<|24.36|><|24.42|> kita<|24.56|><|24.62|> sebutkan<|24.94|><|25.08|> event<|25.38|><|25.86|> adalah<|26.10|><|26.18|> usaha<|26.46|><|26.60|> kerajaan<|27.06|><|27.16|> kearah<|27.44|><|27.50|> merealisasikan<|28.36|><|28.86|> objektif<|29.36|><|29.42|> yang<|29.52|><|29.56|> telah<|29.72|><|29.76|> digarakan<|30.00|><|endoftext|>
Downloads last month
26
Safetensors
Model size
242M params
Tensor type
BF16
·
Inference API
Unable to determine this model's library. Check the docs .

Collection including mesolitica/malaysian-whisper-small-v2