Skip to content

Real-time American Sign Language (ASL) recognition system using PyTorch and MediaPipe. Recognizes 25 common ASL gestures with 76.05% accuracy, optimized for RTX4070 GPUs. Features live webcam recognition, hybrid TCN+LSTM+Transformer architecture, and comprehensive training pipeline.

Notifications You must be signed in to change notification settings

jaganov/google_asl_recognition

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

6 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐ŸคŸ Google ASL Recognition

A comprehensive American Sign Language (ASL) recognition system based on the winning solution from Google's ASL Signs competition, adapted for PyTorch and optimized for RTX4070 GPUs.

๐ŸŽฏ Project Overview

This project implements a state-of-the-art ASL recognition model that can:

  • Train on the Google ASL Signs dataset (25 most common gestures)
  • Capture real-time gestures using MediaPipe
  • Recognize ASL signs from video input with 76.05% accuracy
  • Compare captured gestures with dataset samples

The model architecture combines TCN, LSTM, and Transformers with adaptive regularization, achieving competitive accuracy while being optimized for modern GPUs.

๐Ÿ“Š Dataset Source

This project uses a subset of the Google ASL Signs dataset from the Kaggle competition:

๐ŸŽฏ Recognized Gestures (25 ASL Signs)

Greetings & Courtesy: hello, please, thankyou, bye
Family: mom, dad, boy, girl, man, child
Actions: drink, sleep, go
Emotions: happy, sad, hungry, thirsty, sick, bad
Colors: red, blue, green, yellow, black, white

๐Ÿ“ˆ Project Evolution

Initial Vision vs. Reality

The project started with an ambitious 7-day roadmap (see PREPARE.md) targeting 80-85% accuracy using Vision Transformers and CNN ensembles. However, reality led to a more focused and practical approach:

Original Plan:

  • Vision Transformer + CNN ensemble
  • 250 ASL gestures
  • 80-85% target accuracy
  • 7-day intensive development

Actual Implementation:

  • Hybrid TCN + LSTM + Transformer architecture
  • 25 most common ASL gestures
  • 76.05% achieved accuracy
  • Focused, production-ready solution

This evolution resulted in a more practical, maintainable system optimized for real-world deployment rather than competition-level performance.

๐Ÿ—๏ธ Architecture

Model Components

  • TCN Blocks: 3 layers with dilations (1,2,4) for local temporal patterns
  • Bidirectional LSTM: 2 layers for long-term dependencies
  • Temporal Attention: 8 heads for focusing on important frames
  • Conv1D + Transformer: Hybrid processing for final classification
  • Adaptive Regularization: Dropout, LateDropout, AWP for overfitting prevention

Model Structure

Input (seq_len, landmarks, 3)
    โ†“
Preprocessing (normalization, motion features)
    โ†“
Stem (Linear + BatchNorm + AdaptiveDropout)
    โ†“
TCN Blocks (3 layers, dilations 1,2,4)
    โ†“
Bidirectional LSTM (2 layers)
    โ†“
Temporal Attention (8 heads)
    โ†“
Conv1D Block ร— 3 + Transformer Block
    โ†“
Multi-scale Pooling (avg + max + attention)
    โ†“
Classifier (25 classes)

๐Ÿ“Š Results & Visualizations

Training Progress

Training History

The training history shows stable convergence with the model achieving 76.05% validation accuracy. The loss curves demonstrate effective regularization preventing overfitting while maintaining learning capacity.

Live Recognition Demo

Live Recognition

Real-time ASL recognition in action, demonstrating the system's ability to capture and classify gestures with MediaPipe landmarks and visual feedback.

๐Ÿš€ Quick Start

Prerequisites

  • Python 3.10+
  • CUDA-compatible GPU (RTX4070 recommended)
  • Webcam for gesture capture

Installation

  1. Clone the repository
git clone <repository-url>
cd google_asl_recognition
  1. Create and activate virtual environment
# Create virtual environment
python -m venv .venv

# Activate on Windows
.venv\Scripts\activate

# Activate on Linux/Mac
source .venv/bin/activate
  1. Install PyTorch with CUDA 12.1
# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
  1. Install all dependencies
# Install all required packages
pip install -r requirements.txt

๐Ÿ“ Project Structure

google_asl_recognition/
โ”œโ”€โ”€ data/                          # Dataset storage
โ”‚   โ””โ”€โ”€ google_asl_signs/         # Original dataset
โ”œโ”€โ”€ manual/                        # Main project directory
โ”‚   โ”œโ”€โ”€ dataset25/                # Processed dataset (25 signs)
โ”‚   โ”œโ”€โ”€ dataset25_split/          # Train/test split
โ”‚   โ”œโ”€โ”€ models/                   # Trained models and manifests
โ”‚   โ”œโ”€โ”€ utils/                    # Utility scripts
โ”‚   โ”œโ”€โ”€ step1_extract_words.py    # Data extraction
โ”‚   โ”œโ”€โ”€ step1.2_split_train_test.py # Dataset splitting
โ”‚   โ”œโ”€โ”€ step2_prepare_dataset.py  # Dataset preparation
โ”‚   โ”œโ”€โ”€ step3_prepare_train.py    # Training script
โ”‚   โ”œโ”€โ”€ step4_capture_mediapipe.py # Gesture capture
โ”‚   โ””โ”€โ”€ step5_live_recognition.py # Live recognition
โ”œโ”€โ”€ docs/                         # Documentation
โ”‚   โ”œโ”€โ”€ README.md                 # Documentation overview
โ”‚   โ”œโ”€โ”€ training.md               # Training guide
โ”‚   โ”œโ”€โ”€ data-preparation.md       # Data preparation guide
โ”‚   โ”œโ”€โ”€ live-recognition.md       # Live recognition guide
โ”‚   โ”œโ”€โ”€ manifest-system.md        # Manifest system guide
โ”‚   โ””โ”€โ”€ pictures/                 # Screenshots and images
โ”œโ”€โ”€ demo/                         # Demo scripts
โ”œโ”€โ”€ PREPARE.md                    # Original project roadmap
โ””โ”€โ”€ requirements.txt              # Dependencies

๐ŸŽฎ Usage

1. Data Preparation

cd manual

# Extract and prepare the dataset
python step1_extract_words.py
python step1.2_split_train_test.py
python step2_prepare_dataset.py

2. Model Training

# Start training
python step3_prepare_train.py

Training Configuration for RTX4070:

  • Epochs: 300 (full training) or 50 (testing)
  • Batch Size: 32 (optimized for RTX4070)
  • Expected Time: ~1.75 hours for full training
  • Expected Accuracy: ~76.05% validation score

3. Live Recognition

# Test camera
python test_camera.py

# Run live ASL recognition
python step5_live_recognition.py

๐Ÿ”ง RTX4070 Optimizations

The project includes specific optimizations for RTX4070:

  • TF32: Enabled for Ampere architecture acceleration
  • Mixed Precision: Automatic FP16 usage
  • cuDNN Benchmark: Optimized convolution algorithms
  • Memory Optimization: Automatic batch size tuning
  • Adaptive Dropout: Gradual regularization activation

๐Ÿ“Š Latest Results

Model Performance

  • Model: asl_model_v20250720_080209
  • Architecture: Enhanced TCN + LSTM + Transformer v2
  • Validation Accuracy: 76.05%
  • Parameters: 1.97M
  • Training Time: ~1.75 hours
  • Classes: 25 ASL gestures

Training Configuration

  • Loss: CrossEntropyLoss with label smoothing (0.05)
  • Optimizer: AdamW with weight decay (0.005)
  • Scheduler: CosineAnnealingWarmRestarts
  • Regularization: Adaptive Dropout, TCN Dropout, Attention Dropout

๐ŸŽฅ Live Recognition Features

MediaPipe Integration

  • Face Landmarks: 468 points
  • Pose Landmarks: 33 points
  • Hand Landmarks: 21 points per hand
  • Total: 543 landmarks per frame

Recognition Controls

  • 'q': Stop recognition
  • 's': Save current screenshot
  • 'r': Reset frame buffer
  • 'h': Show/hide help
  • Real-time preview: Visual feedback during recognition

๐Ÿ“ˆ Model Versioning

The project includes a comprehensive manifest system for tracking model versions:

  • Date-based naming: asl_model_vYYYYMMDD_HHMMSS.pth
  • Training manifests: Detailed training history and parameters
  • Final manifests: Best model selection with performance metrics
  • Complete tracking: All experiments and improvements documented

๐Ÿ“š Documentation

Comprehensive documentation is available in the docs/ directory:

๐Ÿค Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

๐Ÿ“„ License

This project is based on the winning solution from Google's ASL Signs competition. Please refer to the original competition terms and conditions.

๐Ÿ™ Acknowledgments

  • Kaggle Competition: ASL Signs - American Sign Language Recognition organizers and participants
  • Dataset: Google ASL Signs dataset provided by the competition
  • MediaPipe team for hand tracking and landmark extraction
  • PyTorch community for the deep learning framework
  • Original competition winners for the model architecture inspiration

๐Ÿ“ž Support

For issues and questions:

  1. Check the documentation in the docs/ directory
  2. Review existing issues
  3. Create a new issue with detailed information

Note: This project requires a CUDA-compatible GPU for optimal performance. CPU-only mode is available but significantly slower.

Ready to recognize ASL gestures! ๐ŸคŸ

About

Real-time American Sign Language (ASL) recognition system using PyTorch and MediaPipe. Recognizes 25 common ASL gestures with 76.05% accuracy, optimized for RTX4070 GPUs. Features live webcam recognition, hybrid TCN+LSTM+Transformer architecture, and comprehensive training pipeline.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published