Skip to content

hse-scila/Aphasia

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Aphasia Classification Based on Patient Speech

Repository Navigation

  • models — Python files containing model classes
  • notebooks — Jupyter notebooks with experiments
  • src — helper functions/classes and Streamlit web app

Problem Statement

Assistive systems are among the most demanded areas in machine learning.
Even today, some doctors use artificial intelligence in their daily practice. It helps simplify diagnosis
and enables personalized treatment for each patient.



Our work focuses on building a model to predict the presence of aphasia in a patient. Aphasia is a speech disorder
that affects speech comprehension. It often occurs after a stroke, traumatic brain injury, or diseases
related to the central nervous system. This condition can severely impact a person’s ability to communicate,
especially in elderly individuals. However, if therapy starts early enough, recovery is possible.
Therefore, having a tool that can detect the first signs of aphasia is crucial.

Dataset

The dataset was provided by Center for Language and Brain, HSE University. It includes 253 participants with aphasia
and 101 without. Each participant has approximately two audio recordings. The participants belong to different age groups.
The average age of aphasic participants is 58, and the distribution is close to normal. The non-aphasic group’s age
distribution is more uniform, containing both young and elderly subjects.

Participant-level demographic information (such as aphasia status and age) is collected in the pwa.demographic.info.csv file.

We have tested various methods

Classical ML

As a baseline, classical machine learning was chosen, since in some cases it can be sufficient.
FLAML was used because it automatically selects models and tunes their hyperparameters.
Feature sets included MFCC+ZCR, Prosody Features+ZCR, and a combination of several types
(MFCC, Chromagram, Spectral Features, Prosody Features, ZCR, Timestamps). Additionally we used optuna to tune catboost.

MFCC

MFCC represents audio using coefficients for time segments obtained by
convolving the spectrogram with “special filters.” Physically,
it simulates how human hearing processes speech (similar to Mel-spectrograms).
Since our data consists of speech recordings, this representation captures relevant
speech-related features.


In the literature, both classical ML and 1D CNNs are commonly used with MFCCs.

Waveform

One straightforward idea is to feed raw audio directly into a transformer. Below are the Wav2Vec model scheme:

Wav2Vec architecture

Spectrograms

Spectrograms remain one of the most commonly used representations, so it was reasonable to test them as well

Conclusion and Future Work

Various methods were tested, and for the final Streamlit application,
Wav2Vec was chosen for its accuracy, and MobileNet on MFCC due to its speed and good performance.


Although the classifier itself is complete, there is still room for exploration. For example,
severity prediction remains an open goal.

References

About

This is repository for paper "..."

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors