Skip to content

samnbe/ASL_live_detection

Repository files navigation

Personal ASL Webcam Recognition Project

A real-time American Sign Language (ASL) hand landmark detection and letter prediction system built with Python, OpenCV, MediaPipe, and PyTorch. This project uses your webcam to detect hand landmarks and predict ASL letters using a CNN trained on the Sign MNIST dataset.

Demo

ASL webcam landmark detection


Features

  • Real-time webcam feed with hand landmark overlay
  • Detects up to 2 hands simultaneously
  • Draws 21 landmarks and color-coded skeletal connections per finger
  • Real-time ASL letter prediction with confidence score overlay for 1 hand
  • Hand region cropping and preprocessing for model input
  • Custom landmark data collection tool
  • Mirrored display for natural interaction
  • Graceful exit via q key or closing the window

Requirements

  • Python 3.8+
  • JupyterLab or Jupyter Notebook
  • Webcam

Python Dependencies

opencv-python
mediapipe
torch
torchvision
pandas
matplotlib

Install them with:

pip install opencv-python mediapipe --user
pip install torch torchvision --user
pip install pandas matplotlib --user

Project Structure

ASL_live_detection/
├── README.md
├── hand_landmarker.task            # MediaPipe pre-trained hand landmark model (download separately)
├── utils.py                        # Model class definition (MyConvBlock, MyLinearBlock, landmarker_and_result, draw_landmarks_on_image)
├── main.ipynb                      # Main webcam notebook
├── collect_data.ipynb              # Landmark data collection notebook
├── landmark_aug.ipynb              # Model training notebook for 1D data 
├── pixel_aug.ipynb                 # Model training notebook for 2D data from Nvidia Workshop
├── asl_predictions.ipynb           # Model evaluation notebook for 2D data from Nvidia Workshop
├── demo_images/                    # Demo screenshots
├── models/
│   ├── README.md                   # Model descriptions and results
│   ├── asl_model1.pth              # Original grayscale model
│   ├── asl_model2.pth              # Color channel model 
│   ├── asl_model3.pth              # Original landmark model 
│   └── asl_model4.pth              # Landmark model with more data (current)
└── train_data/
    ├── sign_mnist_train.csv        # Sign MNIST training data
    ├── sign_mnist_valid.csv        # Sign MNIST validation data
    ├── asl_landmarks_train_1.csv     # landmark training data, 100 samples per letter
    ├── asl_landmarks_valid_1.csv     # landmark validating data, 100 samples per letter
    ├── asl_landmarks_train_2.csv     # landmark training data, 300 samples per letter
    └── asl_landmarks_valid_2.csv     # landmark validating data, 300 samples per letter

Setup

1. Clone or download the project

git clone https://github.com/samnbe/ASL_live_detection.git
cd ASL_live_detection

2. Install dependencies

pip install opencv-python mediapipe --user
pip install torch torchvision --user
pip install pandas matplotlib --user

3. Download the MediaPipe hand landmark model

Download hand_landmarker.task and place it in the project root folder:

https://storage.googleapis.com/mediapipe-models/hand_landmarker/hand_landmarker/float16/latest/hand_landmarker.task

4. Launch Jupyter

python -m jupyter lab

Usage

Collecting landmark data (optional)

Open collect_data.ipynb and run all cells. Press a letter key to start collecting samples for that letter, and press Space to stop. Repeat for all 24 letters. Collected data is saved to train_data/.

Running the webcam prediction

Open main.ipynb and run the cells in order:

Cell Description
Cell 1 Imports and opens the webcam
Cell 2 Model loading — loads trained ASL CNN from asl_model2.pth
Cell 3 landmarker_and_result class — loads the MediaPipe model
Cell 4 draw_landmarks_on_image drawing function
Cell 5 get_hand_crop function — crops and preprocesses hand region for prediction
Cell 6 Main webcam loop — runs detection, cropping, prediction and displays the feed
Cell 7 Cleanup — releases webcam and closes windows

Tip: If the loop crashes or you interrupt it, manually run Cell 7 to release the webcam.

To quit the webcam feed:

  • Press q with the OpenCV window in focus, or
  • Click the X button on the webcam window

How Using Model 1 and 2 Works

  1. Each video frame is captured from the webcam and mirrored
  2. The frame is converted from BGR (OpenCV format) to RGB (MediaPipe format)
  3. The frame is passed asynchronously to the MediaPipe hand landmarker
  4. When landmarks are detected, 21 key points are mapped onto the hand
  5. Color-coded connections are drawn between landmarks per finger
  6. The hand region is cropped using the landmark bounding box and resized to 28x28 — the color (RGB) image is passed directly to the model without grayscale conversion
  7. The cropped image is passed through a CNN to predict the ASL letter
  8. The predicted letter and confidence score are overlaid on the frame
  9. The annotated frame is converted back to BGR and displayed

How Using Model 3 Works

  1. Each video frame is captured from the webcam and mirrored
  2. The frame is converted from BGR (OpenCV format) to RGB (MediaPipe format)
  3. The frame is passed asynchronously to the MediaPipe hand landmarker
  4. When landmarks are detected, 21 key points are mapped onto the hand
  5. Color-coded connections are drawn between landmarks per finger
  6. The 21 landmark coordinates (x, y, z) are flattened into a vector of 63 values
  7. The 63 values are passed through a fully connected neural network to predict the ASL letter
  8. The predicted letter and confidence score are overlaid on the frame
  9. The annotated frame is converted back to BGR and displayed

Landmark Map

MediaPipe tracks 21 landmarks per hand:

Wrist:         0
Thumb:         1-4
Index finger:  5-8
Middle finger: 9-12
Ring finger:   13-16
Pinky:         17-20

Roadmap

  • Real-time webcam hand landmark detection
  • Color-coded landmark skeleton per finger
  • ASL letter prediction with confidence score
  • Retrain model on color input (asl_model2)
  • Custom landmark data collection tool
  • Train landmark-based model using MediaPipe coordinates (asl_model3)
  • Improve prediction stability across frames
  • Collect more landmark data to improve accuracy
  • Word/phrase prediction from letter sequences

Known Issues

  • Prediction instability — the predicted letter can change rapidly even when the hand is held still, likely due to minor frame-to-frame variation in the cropped hand region
  • Limited letter accuracy — asl_model2 correctly predicts approximately 8 out of 24 letters reliably across most test conditions. Note that J and Z are excluded from the dataset as they require motion
  • Training vs webcam data mismatch — Sign MNIST images are tightly cropped, plain background, and taken under controlled conditions which are very different from a real webcam environment regardless of color
  • User ASL accuracy — predictions may also be affected by inexperience with ASL hand positions
  • On some systems, closing the OpenCV window with the X button may behave inconsistently — use q as a reliable fallback
  • The first few frames may not show landmarks due to the async nature of MediaPipe's live stream mode
  • mediapipe.solutions is deprecated in newer versions of MediaPipe (0.10+) — this project uses the newer mediapipe.tasks API

References

About

A real-time American Sign Language (ASL) hand landmark detection system

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors