Personal ASL Webcam Recognition Project

A real-time American Sign Language (ASL) hand landmark detection and letter prediction system built with Python, OpenCV, MediaPipe, and PyTorch. This project uses your webcam to detect hand landmarks and predict ASL letters using a CNN trained on the Sign MNIST dataset.

Demo

Features

Real-time webcam feed with hand landmark overlay
Detects up to 2 hands simultaneously
Draws 21 landmarks and color-coded skeletal connections per finger
Real-time ASL letter prediction with confidence score overlay for 1 hand
Hand region cropping and preprocessing for model input
Custom landmark data collection tool
Mirrored display for natural interaction
Graceful exit via q key or closing the window

Requirements

Python 3.8+
JupyterLab or Jupyter Notebook
Webcam

Python Dependencies

opencv-python
mediapipe
torch
torchvision
pandas
matplotlib

Install them with:

pip install opencv-python mediapipe --user
pip install torch torchvision --user
pip install pandas matplotlib --user

Project Structure

ASL_live_detection/
├── README.md
├── hand_landmarker.task            # MediaPipe pre-trained hand landmark model (download separately)
├── utils.py                        # Model class definition (MyConvBlock, MyLinearBlock, landmarker_and_result, draw_landmarks_on_image)
├── main.ipynb                      # Main webcam notebook
├── collect_data.ipynb              # Landmark data collection notebook
├── landmark_aug.ipynb              # Model training notebook for 1D data 
├── pixel_aug.ipynb                 # Model training notebook for 2D data from Nvidia Workshop
├── asl_predictions.ipynb           # Model evaluation notebook for 2D data from Nvidia Workshop
├── demo_images/                    # Demo screenshots
├── models/
│   ├── README.md                   # Model descriptions and results
│   ├── asl_model1.pth              # Original grayscale model
│   ├── asl_model2.pth              # Color channel model 
│   ├── asl_model3.pth              # Original landmark model 
│   └── asl_model4.pth              # Landmark model with more data (current)
└── train_data/
    ├── sign_mnist_train.csv        # Sign MNIST training data
    ├── sign_mnist_valid.csv        # Sign MNIST validation data
    ├── asl_landmarks_train_1.csv     # landmark training data, 100 samples per letter
    ├── asl_landmarks_valid_1.csv     # landmark validating data, 100 samples per letter
    ├── asl_landmarks_train_2.csv     # landmark training data, 300 samples per letter
    └── asl_landmarks_valid_2.csv     # landmark validating data, 300 samples per letter

Setup

1. Clone or download the project

git clone https://github.com/samnbe/ASL_live_detection.git
cd ASL_live_detection

2. Install dependencies

pip install opencv-python mediapipe --user
pip install torch torchvision --user
pip install pandas matplotlib --user

3. Download the MediaPipe hand landmark model

Download hand_landmarker.task and place it in the project root folder:

https://storage.googleapis.com/mediapipe-models/hand_landmarker/hand_landmarker/float16/latest/hand_landmarker.task

4. Launch Jupyter

python -m jupyter lab

Usage

Collecting landmark data (optional)

Open collect_data.ipynb and run all cells. Press a letter key to start collecting samples for that letter, and press Space to stop. Repeat for all 24 letters. Collected data is saved to train_data/.

Running the webcam prediction

Open main.ipynb and run the cells in order:

Cell	Description
Cell 1	Imports and opens the webcam
Cell 2	Model loading — loads trained ASL CNN from `asl_model2.pth`
Cell 3	`landmarker_and_result` class — loads the MediaPipe model
Cell 4	`draw_landmarks_on_image` drawing function
Cell 5	`get_hand_crop` function — crops and preprocesses hand region for prediction
Cell 6	Main webcam loop — runs detection, cropping, prediction and displays the feed
Cell 7	Cleanup — releases webcam and closes windows

Tip: If the loop crashes or you interrupt it, manually run Cell 7 to release the webcam.

To quit the webcam feed:

Press q with the OpenCV window in focus, or
Click the X button on the webcam window

How Using Model 1 and 2 Works

Each video frame is captured from the webcam and mirrored
The frame is converted from BGR (OpenCV format) to RGB (MediaPipe format)
The frame is passed asynchronously to the MediaPipe hand landmarker
When landmarks are detected, 21 key points are mapped onto the hand
Color-coded connections are drawn between landmarks per finger
The hand region is cropped using the landmark bounding box and resized to 28x28 — the color (RGB) image is passed directly to the model without grayscale conversion
The cropped image is passed through a CNN to predict the ASL letter
The predicted letter and confidence score are overlaid on the frame
The annotated frame is converted back to BGR and displayed

How Using Model 3 Works

Each video frame is captured from the webcam and mirrored
The frame is converted from BGR (OpenCV format) to RGB (MediaPipe format)
The frame is passed asynchronously to the MediaPipe hand landmarker
When landmarks are detected, 21 key points are mapped onto the hand
Color-coded connections are drawn between landmarks per finger
The 21 landmark coordinates (x, y, z) are flattened into a vector of 63 values
The 63 values are passed through a fully connected neural network to predict the ASL letter
The predicted letter and confidence score are overlaid on the frame
The annotated frame is converted back to BGR and displayed

Landmark Map

MediaPipe tracks 21 landmarks per hand:

Wrist:         0
Thumb:         1-4
Index finger:  5-8
Middle finger: 9-12
Ring finger:   13-16
Pinky:         17-20

Roadmap

Real-time webcam hand landmark detection
Color-coded landmark skeleton per finger
ASL letter prediction with confidence score
Retrain model on color input (asl_model2)
Custom landmark data collection tool
Train landmark-based model using MediaPipe coordinates (asl_model3)
Improve prediction stability across frames
Collect more landmark data to improve accuracy
Word/phrase prediction from letter sequences

Known Issues

Prediction instability — the predicted letter can change rapidly even when the hand is held still, likely due to minor frame-to-frame variation in the cropped hand region
Limited letter accuracy — asl_model2 correctly predicts approximately 8 out of 24 letters reliably across most test conditions. Note that J and Z are excluded from the dataset as they require motion
Training vs webcam data mismatch — Sign MNIST images are tightly cropped, plain background, and taken under controlled conditions which are very different from a real webcam environment regardless of color
User ASL accuracy — predictions may also be affected by inexperience with ASL hand positions
On some systems, closing the OpenCV window with the X button may behave inconsistently — use q as a reliable fallback
The first few frames may not show landmarks due to the async nature of MediaPipe's live stream mode
mediapipe.solutions is deprecated in newer versions of MediaPipe (0.10+) — this project uses the newer mediapipe.tasks API

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Personal ASL Webcam Recognition Project

Demo

Features

Requirements

Python Dependencies

Project Structure

Setup

1. Clone or download the project

2. Install dependencies

3. Download the MediaPipe hand landmark model

4. Launch Jupyter

Usage

Collecting landmark data (optional)

Running the webcam prediction

How Using Model 1 and 2 Works

How Using Model 3 Works

Landmark Map

Roadmap

Known Issues

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
demo_images		demo_images
models		models
train_data		train_data
.gitignore		.gitignore
README.md		README.md
asl_predictions.ipynb		asl_predictions.ipynb
collect_data.ipynb		collect_data.ipynb
landmark_aug.ipynb		landmark_aug.ipynb
main.ipynb		main.ipynb
pixel_aug.ipynb		pixel_aug.ipynb
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

Personal ASL Webcam Recognition Project

Demo

Features

Requirements

Python Dependencies

Project Structure

Setup

1. Clone or download the project

2. Install dependencies

3. Download the MediaPipe hand landmark model

4. Launch Jupyter

Usage

Collecting landmark data (optional)

Running the webcam prediction

How Using Model 1 and 2 Works

How Using Model 3 Works

Landmark Map

Roadmap

Known Issues

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages