A comprehensive American Sign Language (ASL) recognition system based on the winning solution from Google's ASL Signs competition, adapted for PyTorch and optimized for RTX4070 GPUs.
This project implements a state-of-the-art ASL recognition model that can:
- Train on the Google ASL Signs dataset (25 most common gestures)
- Capture real-time gestures using MediaPipe
- Recognize ASL signs from video input with 76.05% accuracy
- Compare captured gestures with dataset samples
The model architecture combines TCN, LSTM, and Transformers with adaptive regularization, achieving competitive accuracy while being optimized for modern GPUs.
This project uses a subset of the Google ASL Signs dataset from the Kaggle competition:
- Competition: ASL Signs - American Sign Language Recognition
- Original Dataset: Contains 250 ASL signs with hand landmark data
- Project Dataset: Extracted 25 most common ASL signs for focused training
- Citation: Kaggle Competition Citation
- Format: Parquet files with MediaPipe landmarks (543 points per frame)
Greetings & Courtesy: hello, please, thankyou, bye
Family: mom, dad, boy, girl, man, child
Actions: drink, sleep, go
Emotions: happy, sad, hungry, thirsty, sick, bad
Colors: red, blue, green, yellow, black, white
The project started with an ambitious 7-day roadmap (see PREPARE.md) targeting 80-85% accuracy using Vision Transformers and CNN ensembles. However, reality led to a more focused and practical approach:
Original Plan:
- Vision Transformer + CNN ensemble
- 250 ASL gestures
- 80-85% target accuracy
- 7-day intensive development
Actual Implementation:
- Hybrid TCN + LSTM + Transformer architecture
- 25 most common ASL gestures
- 76.05% achieved accuracy
- Focused, production-ready solution
This evolution resulted in a more practical, maintainable system optimized for real-world deployment rather than competition-level performance.
- TCN Blocks: 3 layers with dilations (1,2,4) for local temporal patterns
- Bidirectional LSTM: 2 layers for long-term dependencies
- Temporal Attention: 8 heads for focusing on important frames
- Conv1D + Transformer: Hybrid processing for final classification
- Adaptive Regularization: Dropout, LateDropout, AWP for overfitting prevention
Input (seq_len, landmarks, 3)
โ
Preprocessing (normalization, motion features)
โ
Stem (Linear + BatchNorm + AdaptiveDropout)
โ
TCN Blocks (3 layers, dilations 1,2,4)
โ
Bidirectional LSTM (2 layers)
โ
Temporal Attention (8 heads)
โ
Conv1D Block ร 3 + Transformer Block
โ
Multi-scale Pooling (avg + max + attention)
โ
Classifier (25 classes)
The training history shows stable convergence with the model achieving 76.05% validation accuracy. The loss curves demonstrate effective regularization preventing overfitting while maintaining learning capacity.
Real-time ASL recognition in action, demonstrating the system's ability to capture and classify gestures with MediaPipe landmarks and visual feedback.
- Python 3.10+
- CUDA-compatible GPU (RTX4070 recommended)
- Webcam for gesture capture
- Clone the repository
git clone <repository-url>
cd google_asl_recognition- Create and activate virtual environment
# Create virtual environment
python -m venv .venv
# Activate on Windows
.venv\Scripts\activate
# Activate on Linux/Mac
source .venv/bin/activate- Install PyTorch with CUDA 12.1
# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121- Install all dependencies
# Install all required packages
pip install -r requirements.txtgoogle_asl_recognition/
โโโ data/ # Dataset storage
โ โโโ google_asl_signs/ # Original dataset
โโโ manual/ # Main project directory
โ โโโ dataset25/ # Processed dataset (25 signs)
โ โโโ dataset25_split/ # Train/test split
โ โโโ models/ # Trained models and manifests
โ โโโ utils/ # Utility scripts
โ โโโ step1_extract_words.py # Data extraction
โ โโโ step1.2_split_train_test.py # Dataset splitting
โ โโโ step2_prepare_dataset.py # Dataset preparation
โ โโโ step3_prepare_train.py # Training script
โ โโโ step4_capture_mediapipe.py # Gesture capture
โ โโโ step5_live_recognition.py # Live recognition
โโโ docs/ # Documentation
โ โโโ README.md # Documentation overview
โ โโโ training.md # Training guide
โ โโโ data-preparation.md # Data preparation guide
โ โโโ live-recognition.md # Live recognition guide
โ โโโ manifest-system.md # Manifest system guide
โ โโโ pictures/ # Screenshots and images
โโโ demo/ # Demo scripts
โโโ PREPARE.md # Original project roadmap
โโโ requirements.txt # Dependencies
cd manual
# Extract and prepare the dataset
python step1_extract_words.py
python step1.2_split_train_test.py
python step2_prepare_dataset.py# Start training
python step3_prepare_train.pyTraining Configuration for RTX4070:
- Epochs: 300 (full training) or 50 (testing)
- Batch Size: 32 (optimized for RTX4070)
- Expected Time: ~1.75 hours for full training
- Expected Accuracy: ~76.05% validation score
# Test camera
python test_camera.py
# Run live ASL recognition
python step5_live_recognition.pyThe project includes specific optimizations for RTX4070:
- TF32: Enabled for Ampere architecture acceleration
- Mixed Precision: Automatic FP16 usage
- cuDNN Benchmark: Optimized convolution algorithms
- Memory Optimization: Automatic batch size tuning
- Adaptive Dropout: Gradual regularization activation
- Model:
asl_model_v20250720_080209 - Architecture: Enhanced TCN + LSTM + Transformer v2
- Validation Accuracy: 76.05%
- Parameters: 1.97M
- Training Time: ~1.75 hours
- Classes: 25 ASL gestures
- Loss: CrossEntropyLoss with label smoothing (0.05)
- Optimizer: AdamW with weight decay (0.005)
- Scheduler: CosineAnnealingWarmRestarts
- Regularization: Adaptive Dropout, TCN Dropout, Attention Dropout
- Face Landmarks: 468 points
- Pose Landmarks: 33 points
- Hand Landmarks: 21 points per hand
- Total: 543 landmarks per frame
- 'q': Stop recognition
- 's': Save current screenshot
- 'r': Reset frame buffer
- 'h': Show/hide help
- Real-time preview: Visual feedback during recognition
The project includes a comprehensive manifest system for tracking model versions:
- Date-based naming:
asl_model_vYYYYMMDD_HHMMSS.pth - Training manifests: Detailed training history and parameters
- Final manifests: Best model selection with performance metrics
- Complete tracking: All experiments and improvements documented
Comprehensive documentation is available in the docs/ directory:
- Documentation Overview - Complete documentation guide
- Training Guide - How to train the model
- Data Preparation - Dataset preparation process
- Live Recognition - Real-time gesture recognition
- Manifest System - Model versioning and tracking
- RTX4070 Optimizations - GPU-specific optimizations
- Model Architecture - Detailed model architecture
- Model Results - Performance analysis and results
- Quick Start - Quick start guide
- Installation - Detailed installation instructions
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is based on the winning solution from Google's ASL Signs competition. Please refer to the original competition terms and conditions.
- Kaggle Competition: ASL Signs - American Sign Language Recognition organizers and participants
- Dataset: Google ASL Signs dataset provided by the competition
- MediaPipe team for hand tracking and landmark extraction
- PyTorch community for the deep learning framework
- Original competition winners for the model architecture inspiration
For issues and questions:
- Check the documentation in the
docs/directory - Review existing issues
- Create a new issue with detailed information
Note: This project requires a CUDA-compatible GPU for optimal performance. CPU-only mode is available but significantly slower.
Ready to recognize ASL gestures! ๐ค

