Skip to content

McMasterAI-Society/BOLLD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

67 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧠 Body and

πŸ—£οΈ Oral

πŸ“ Language

πŸ“š Learning

🧩 Decoder

BOLLD employs a multi-modal approach, integrating body language analysis, lip transcriptions, and reinforcement learning to detect threats in real-time using computer vision and natural language processing.

Team Members:

πŸš€ Use Cases

πŸ”’ Security Applications

  • Detecting potential threats or violent language when audio is corrupted or unavailable during meetings. πŸ“žβš οΈ
  • Aimed at enhancing safety and providing an alternative threat detection system that doesn't rely on sound. πŸŽ₯πŸ”

πŸ›‘οΈ Violence Mitigation

  • Can be applied in public safety scenarios, such as campus surveillance, to alert authorities of potential threats in real-time. πŸŽ“πŸš¨
  • Assistive technology: Can be implemented in glasses with cameras to help people with disabilities, like blindness, by notifying them of potential threats they might not visually perceive. πŸ‘“πŸ€–πŸ‘€

πŸš€ Tech Stack

Python NumPy Shape Predictor 68

πŸŽ₯ Computer Vision

OpenCV Dlib Mediapipe

πŸ€– Machine Learning

TensorFlow Scikit-learn

πŸ“Š Visualization & Data Processing

Matplotlib Plotly Pandas CSV

🧠 Algorithms

Q-Learning

Run the app using the following command

streamlit run app.py

Currently the app.py contains the body language code training and details about which can be found in the body_lang_decoder folder, the lip transcription component details about the model can be found in the lip_to_text folder where each key word is compared to a list of threatening words and the threat level is calculated. The threat level is then used to determine the state of the system. Passed into the Q-Learning table, the state is used to determine the action to take using reinforcement learning.

πŸš€ High-Level Overview

process_flow_diagram

1️⃣ First Stage

  • Use a trained body language model πŸ•Ί and lip reading (via Mediapipe landmarks) πŸ‘„ to compute a numerical threat probability (0-1) for each.
  • Combine both values to get a combined threat score πŸ”’.

2️⃣ Second Stage

  • Based on the two inputs from the first stage, train a reinforcement learning model πŸ€– to recognize sequences of actions and lip movements that suggest malicious behavior.
    • Output: 0 βž” Non-malicious, 1 βž” Malicious, and a scale (0-1) representing the threat level of key words (0 = non-threatening, 1 = threatening).
  • The model will influence the environment state 🌍:
    • De-escalate if the threat is correctly identified πŸ•ŠοΈ.
    • All clear! if the threat is incorrectly identified 🚨.

πŸ“š Decisions + Documentation

🧠 Body Language Detection

  • Using the EMOLIPS model (CNN-LSTM) to detect emotions from lip movement based on face details. πŸ‘„πŸ˜ 
  • Negative emotions (e.g. anger, disgust) πŸ₯΄ can assist in identifying potential threats. ⚠️
  • Oct 27: Shifted to a facial emotion recognition model using DeepFace due to better performance. πŸ§‘β€πŸŽ¨
  • Integrating body language into a threat vs. non-threat classification using Mediapipe. πŸ§‘β€πŸ’» The model trains on coordinates from landmarks in frames with associated labels.
  • Jan 13: Decided to use one body language model (Mediapipe) after facing multiprocessing conflicts with running two models simultaneously (initial goal was to get an average). πŸ€–βŒ

πŸ‘„ Lip Movement to Text

  • Closely following the methods of LipNet, as it's proven and well-documented. πŸ“„
  • Methodology: Uses Dlib for facial landmark detection, preprocessing the GRID dataset, followed by a CNN architecture with bidirectional GRUs. CTC training used for model optimization. πŸ“Š
  • Jan 13: Switching models as the previous one couldn’t handle live video streams. Transitioning to a more suitable approach (e.g., Whisper model) to transcribe lip movement to text, then applying custom models to detect violence levels. πŸ’»
  • Jan 21: Exploring a new technique using lip/mouth landmarks to detect phonemes and then identify key words stored in a dictionary with associated threat levels. πŸ“–
  • Jan 27: Enhanced LipNet model to process live video streams πŸŽ₯ and detect mouth region with Dlib + ShapePredictor68.
  • Jan 29: Added algorithm to detect key words and produce a violence value. πŸ”‘
  • Jan 31: Integrated into app.py. πŸŽ‰

πŸ“… Rough Milestone Timelines

Weeks 1-2:

  • πŸš€ Project Kickoff: Setup environment and tools
  • πŸ‘₯ Task Assignment
  • 🎯 Define goals and objectives
  • πŸ“Š Data exploration and preparation
  • 🌐 Create basic frontend & backend
  • πŸŽ₯ Set up OpenCV for video processing

Weeks 3-4:

  • πŸ”„ Split into lip reading and reinforcement learning (RL) stages
  • πŸ€” Research different models and methods for both stages
  • πŸ’» Start implementation

Weeks 5-6:

  • βœ… Finish body language part of stage 1
  • 🌱 Set up RL environment
  • πŸ“ Finish preprocessing for lip to text part of stage 1
  • πŸ”„ Continue implementation of lip to text training

Weeks 7-8:

  • πŸŽ“ Finish training lip to text part of stage 1
  • 🏁 Complete RL stage 2
  • πŸŽ₯ Create a demo video

Weeks 8-10:

  • πŸ”— Connect stage 1 and 2
  • 🧠 Continue reinforcement learning model training

Weeks 11-13:

  • 🌐 Frontend & Backend integration with ML scripts
  • βœ… Finalize body language model
  • βœ… Finalize lip to text model
  • 🧠 Continue working on RL

Weeks 13-14:

  • πŸ”§ Finish lip to text model
  • πŸ”Œ Integrate lip to text into the main app.py

Week 15:

  • ✨ Final touches
  • βš™οΈ Improve accuracy and fine-tuning
  • πŸ–₯️ Test the model with webcam integration

For more details, please refer to the research document.

About

Multi-modal model that integrates body language analysis, lip transcriptions, and reinforcement learning to detect threats in real-time using computer vision and natural language processing.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors