Skip to content

daniEL2371/STT-engine

 
 

Repository files navigation

Amharic Speech-to-Text engine

Introduction

Speech recognition technology allows for hands-free control of smartphones, speakers, and even vehicles in a wide variety of languages. Companies have moved towards the goal of enabling machines to understand and respond to more and more of our verbalized commands. There are many matured speech recognition systems available, such as Google Assistant, Amazon Alexa, and Apple’s Siri. However, all of those voice assistants work for limited languages only.

The World Food Program wants to deploy an intelligent form that collects nutritional information of food bought and sold at markets in two different countries in Africa - Ethiopia and Kenya. The design of this intelligent form requires selected people to install an app on their mobile phone, and whenever they buy food, they use their voice to activate the app to register the list of items they just bought in their own language. The intelligent systems in the app are expected to live to transcribe the speech-to-text and organize the information in an easy-to-process way in a database.

Our responsibility was to build a deep learning model that is capable of transcribing a speech to text in the Amharic language. The model we produce will be accurate and is robust against background noise.

Code

The code of our analysis can be found in the notebooks folder. The data preprocessing and visualization, and model training parts can be found in the Amharic_STT_preprocessing.ipynb jupyter notebook. This notebook can be run in google colab. The Amharic_Speech_To_Text.ipynb contains a modularized version of the first notebook. The scripts folder contains the data loading and preprocessing functions. The trained models will be stored in the models folder.

Dependencies

To install the necessary dependencies, execute the command $ pip install -r requirements.txt"

About

In STT-engine project, a speech to text model is build for Amharic language. Feature extraction is implemented by generating Mel spectrogram images. CNN is used to learn the feature maps from spectrogram images and BI-RNN is applied to predict the transcription given a time series feature map.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Jupyter Notebook 99.7%
  • Other 0.3%