This project aims to combine Speech-to-Text (STT), Speech Emotion Recognition (SER), and Facial Emotion Recognition (FER) technologies to evaluate your presentation skills. We employ a hybrid architecture that combines Convolutional Neural Networks (CNN) for the SER model. For pitch detection, an autocorrelation method is used.
- Speech-to-Text (STT): Transcribes your speech.
- Speech Emotion Recognition (SER): Identifies the emotional tone in your speech.
- Facial Emotion Recognition (FER): Recognizes facial expressions.
- Pitch Detection: Monitors the pitch of your voice.
- Words per Minute (WPM) Monitoring: Calculates your speaking speed.
- Loudness Monitoring: Monitors the loudness of your voice.
Install the required packages:
pip install -r requirements.txt
To run the demo, use the following command:
python.exe -m streamlit run app.py
speech_emotion_recognition.py
: Contains the code for the SER model and inference functions.pitch_detection.py
: The code for pitch detection.utils.py
: Utility functions for audio processing.app.py
: Streamlit app for the demo.models/ser/
: Folder containing pre-trained models.views.py
: Streamlit UI structures.
For a more detailed understanding of the project, you can refer to the Blog Post.
This project is inspired by various research papers and open-source contributions in the area of audio signal processing and machine learning.