A real-time American Sign Language (ASL) hand landmark detection and letter prediction system built with Python, OpenCV, MediaPipe, and PyTorch. This project uses your webcam to detect hand landmarks and predict ASL letters using a CNN trained on the Sign MNIST dataset.
- Real-time webcam feed with hand landmark overlay
- Detects up to 2 hands simultaneously
- Draws 21 landmarks and color-coded skeletal connections per finger
- Real-time ASL letter prediction with confidence score overlay for 1 hand
- Hand region cropping and preprocessing for model input
- Custom landmark data collection tool
- Mirrored display for natural interaction
- Graceful exit via
qkey or closing the window
- Python 3.8+
- JupyterLab or Jupyter Notebook
- Webcam
opencv-python
mediapipe
torch
torchvision
pandas
matplotlib
Install them with:
pip install opencv-python mediapipe --user
pip install torch torchvision --user
pip install pandas matplotlib --userASL_live_detection/
├── README.md
├── hand_landmarker.task # MediaPipe pre-trained hand landmark model (download separately)
├── utils.py # Model class definition (MyConvBlock, MyLinearBlock, landmarker_and_result, draw_landmarks_on_image)
├── main.ipynb # Main webcam notebook
├── collect_data.ipynb # Landmark data collection notebook
├── landmark_aug.ipynb # Model training notebook for 1D data
├── pixel_aug.ipynb # Model training notebook for 2D data from Nvidia Workshop
├── asl_predictions.ipynb # Model evaluation notebook for 2D data from Nvidia Workshop
├── demo_images/ # Demo screenshots
├── models/
│ ├── README.md # Model descriptions and results
│ ├── asl_model1.pth # Original grayscale model
│ ├── asl_model2.pth # Color channel model
│ ├── asl_model3.pth # Original landmark model
│ └── asl_model4.pth # Landmark model with more data (current)
└── train_data/
├── sign_mnist_train.csv # Sign MNIST training data
├── sign_mnist_valid.csv # Sign MNIST validation data
├── asl_landmarks_train_1.csv # landmark training data, 100 samples per letter
├── asl_landmarks_valid_1.csv # landmark validating data, 100 samples per letter
├── asl_landmarks_train_2.csv # landmark training data, 300 samples per letter
└── asl_landmarks_valid_2.csv # landmark validating data, 300 samples per letter
git clone https://github.com/samnbe/ASL_live_detection.git
cd ASL_live_detectionpip install opencv-python mediapipe --user
pip install torch torchvision --user
pip install pandas matplotlib --userDownload hand_landmarker.task and place it in the project root folder:
https://storage.googleapis.com/mediapipe-models/hand_landmarker/hand_landmarker/float16/latest/hand_landmarker.task
python -m jupyter labOpen collect_data.ipynb and run all cells. Press a letter key to start collecting samples for that letter, and press Space to stop. Repeat for all 24 letters. Collected data is saved to train_data/.
Open main.ipynb and run the cells in order:
| Cell | Description |
|---|---|
| Cell 1 | Imports and opens the webcam |
| Cell 2 | Model loading — loads trained ASL CNN from asl_model2.pth |
| Cell 3 | landmarker_and_result class — loads the MediaPipe model |
| Cell 4 | draw_landmarks_on_image drawing function |
| Cell 5 | get_hand_crop function — crops and preprocesses hand region for prediction |
| Cell 6 | Main webcam loop — runs detection, cropping, prediction and displays the feed |
| Cell 7 | Cleanup — releases webcam and closes windows |
Tip: If the loop crashes or you interrupt it, manually run Cell 7 to release the webcam.
To quit the webcam feed:
- Press
qwith the OpenCV window in focus, or - Click the
Xbutton on the webcam window
- Each video frame is captured from the webcam and mirrored
- The frame is converted from BGR (OpenCV format) to RGB (MediaPipe format)
- The frame is passed asynchronously to the MediaPipe hand landmarker
- When landmarks are detected, 21 key points are mapped onto the hand
- Color-coded connections are drawn between landmarks per finger
- The hand region is cropped using the landmark bounding box and resized to 28x28 — the color (RGB) image is passed directly to the model without grayscale conversion
- The cropped image is passed through a CNN to predict the ASL letter
- The predicted letter and confidence score are overlaid on the frame
- The annotated frame is converted back to BGR and displayed
- Each video frame is captured from the webcam and mirrored
- The frame is converted from BGR (OpenCV format) to RGB (MediaPipe format)
- The frame is passed asynchronously to the MediaPipe hand landmarker
- When landmarks are detected, 21 key points are mapped onto the hand
- Color-coded connections are drawn between landmarks per finger
- The 21 landmark coordinates (x, y, z) are flattened into a vector of 63 values
- The 63 values are passed through a fully connected neural network to predict the ASL letter
- The predicted letter and confidence score are overlaid on the frame
- The annotated frame is converted back to BGR and displayed
MediaPipe tracks 21 landmarks per hand:
Wrist: 0
Thumb: 1-4
Index finger: 5-8
Middle finger: 9-12
Ring finger: 13-16
Pinky: 17-20
- Real-time webcam hand landmark detection
- Color-coded landmark skeleton per finger
- ASL letter prediction with confidence score
- Retrain model on color input (asl_model2)
- Custom landmark data collection tool
- Train landmark-based model using MediaPipe coordinates (asl_model3)
- Improve prediction stability across frames
- Collect more landmark data to improve accuracy
- Word/phrase prediction from letter sequences
- Prediction instability — the predicted letter can change rapidly even when the hand is held still, likely due to minor frame-to-frame variation in the cropped hand region
- Limited letter accuracy — asl_model2 correctly predicts approximately 8 out of 24 letters reliably across most test conditions. Note that J and Z are excluded from the dataset as they require motion
- Training vs webcam data mismatch — Sign MNIST images are tightly cropped, plain background, and taken under controlled conditions which are very different from a real webcam environment regardless of color
- User ASL accuracy — predictions may also be affected by inexperience with ASL hand positions
- On some systems, closing the OpenCV window with the
Xbutton may behave inconsistently — useqas a reliable fallback - The first few frames may not show landmarks due to the async nature of MediaPipe's live stream mode
mediapipe.solutionsis deprecated in newer versions of MediaPipe (0.10+) — this project uses the newermediapipe.tasksAPI
- MediaPipe Hand Landmarker Documentation
- MediaPipe Python Tasks API
- OpenCV Documentation
- Sign MNIST Dataset
- NVIDIA Deep Learning Institute — Fundamentals of Deep Learning
- Finger Counting in Real-Time Video with OpenCV and MediaPipe — Medium
- Learn the ASL Alphabet Fast | American Sign Language ABCs
- American Sign Language Alphabet for Adults | ASL ABCs for Grownups
