A web scraper and REST API for retrieving ASL (American Sign Language) videos from SignASL.org.
Part of GestureGPT - This scraper populates the video repository for the GestureGPT ASL learning platform.
This project scrapes ASL sign language videos from SignASL.org and provides a FastAPI-based REST API to:
- Check if a word/sign exists on SignASL.org
- Retrieve video URLs without downloading
- Download and cache videos locally
- Batch download multiple videos
- Manage the local video cache
- 🔍 Word Lookup - Search for ASL signs by word
- 📥 Video Download - Scrape and download ASL videos
- 🚀 REST API - Simple API interface for video retrieval
- 💾 Local Cache - Cache downloaded videos to avoid re-scraping
- 🎯 URL Extraction - Get direct video URLs from SignASL.org
signaslAPI/
├── scraper/
│ ├── __init__.py
│ ├── signasl_scraper.py # Core scraping logic
│ └── video_downloader.py # Video download utilities
├── api/
│ ├── __init__.py
│ └── main.py # FastAPI application
├── cache/ # Downloaded videos cache
├── venv/ # Virtual environment
├── requirements.txt
├── Dockerfile # Docker image configuration
├── docker-compose.yml # Docker Compose for production
├── docker-compose.dev.yml # Docker Compose for development
├── .dockerignore # Docker ignore patterns
├── test_scraper.py # Test HTML inspection
├── test_full_scraper.py # Test scraper functions
├── test_api.py # Test API endpoints
└── README.md
Prerequisites:
- Docker 20.10+
- Docker Compose v2.0+
Using Pre-built Image from GHCR:
# Pull the latest image from GitHub Container Registry
docker pull ghcr.io/notyusheng/signasl-api:latest
# Run the container
docker run -d \
--name signasl-api \
-p 8000:8000 \
-v $(pwd)/cache:/app/cache \
ghcr.io/notyusheng/signasl-api:latest
# Or use a specific version
docker pull ghcr.io/notyusheng/signasl-api:v0.0.0-abc1234Building from Source:
# Clone or navigate to the project
cd Desktop/signaslAPI
# Build and start the container
docker compose up -d
# View logs
docker compose logs -f
# Stop the container
docker compose down
# Stop and remove volumes (clears cache)
docker compose down -vThe API will be available at http://localhost:8000
Development Mode with Hot Reload:
# Use the development compose file
docker compose -f docker-compose.dev.yml up -d
# Code changes will automatically reload the serverPrerequisites:
- Python 3.11+
- pip
Setup:
# Clone or navigate to the project
cd Desktop/signaslAPI
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtfrom scraper.signasl_scraper import SignASLScraper
from scraper.video_downloader import VideoDownloader
scraper = SignASLScraper()
downloader = VideoDownloader()
# Check if word exists
exists = scraper.word_exists("hello")
print(f"Word exists: {exists}")
# Get all video URLs for a word
video_urls = scraper.get_video_urls("hello")
print(f"Found {len(video_urls)} videos")
print(f"First URL: {video_urls[0]}")
# Get primary video URL
primary_url = scraper.get_primary_video_url("hello")
print(f"Primary URL: {primary_url}")
# Get detailed video information
details = scraper.get_video_details("hello")
for detail in details:
print(f"Video ID: {detail['id']}, URL: {detail['url']}")
# Download all videos for a word
cached_paths = downloader.download_all_videos("hello", video_urls)
print(f"Downloaded to: {cached_paths}")
# Check if a video is cached
is_cached = downloader.is_cached("hello", video_urls[0])
print(f"Cached: {is_cached}")Using Docker (Recommended):
# Start the API with Docker Compose
docker compose up -d
# The API will be available at http://localhost:8000
# Test the endpoints:
curl http://localhost:8000/api/check/helloUsing Python Virtual Environment:
# Start the API server
./venv/bin/uvicorn api.main:app --host 0.0.0.0 --port 8000
# Or use Python directly
./venv/bin/python -m uvicorn api.main:app --host 0.0.0.0 --port 8000
# In a new terminal, test the endpoints:
# Check if word exists
curl http://localhost:8000/api/check/hello
# Get video URLs
curl http://localhost:8000/api/video-url/hello
# Download videos
curl http://localhost:8000/api/download/hello
# Batch download
curl -X POST http://localhost:8000/api/batch/download \
-H "Content-Type: application/json" \
-d '{"words": ["hello", "world", "thank-you"], "force": false}'
# List cached videos
curl http://localhost:8000/api/cache/list
# Clear cache for a specific word
curl -X DELETE http://localhost:8000/api/cache/clear?word=hello
# Clear entire cache
curl -X DELETE http://localhost:8000/api/cache/clear# Test HTML inspection
./venv/bin/python test_scraper.py
# Test scraper functions
./venv/bin/python test_full_scraper.py
# Test API endpoints (requires API server running)
./venv/bin/python test_api.pyEndpoint: GET /api/check/{word}
Description: Check if a word/sign exists on SignASL.org
Path Parameters:
word(string, required) - The word to search for (case-insensitive)
Success Response (200 OK):
{
"word": "hello",
"exists": true,
"video_count": 7
}Word Not Found (200 OK):
{
"word": "nonexistentword",
"exists": false,
"video_count": 0
}Error Response (500 Internal Server Error):
{
"detail": "Error checking word: Connection timeout"
}Endpoint: GET /api/video-url/{word}
Description: Get the direct video URL for a word without downloading
Path Parameters:
word(string, required) - The word to get video URL for
Success Response (200 OK):
{
"word": "hello",
"video_urls": [
"https://media.signbsl.com/videos/asl/startasl/mp4/hello.mp4",
"https://media.signbsl.com/videos/asl/elementalaslconcepts/mp4/hello.mp4",
"https://media.signbsl.com/videos/asl/youtube/mp4/6kvCOzxP9_A.mp4"
]
}Not Found Response (404 Not Found):
{
"detail": "No videos found for word: nonexistentword"
}Error Response (500 Internal Server Error):
{
"detail": "Error fetching video URLs: Connection timeout"
}Endpoint: GET /api/download/{word}
Description: Download the video and save to local cache
Path Parameters:
word(string, required) - The word to download video for
Query Parameters:
force(boolean, optional, default=false) - Force re-download even if cached
Success Response (200 OK):
{
"word": "hello",
"success": true,
"cached_videos": [
"cache/hello_d6975bb6.mp4",
"cache/hello_8a3f2c1e.mp4",
"cache/hello_4b9e7d2a.mp4"
],
"message": "Successfully downloaded 3 video(s)"
}Not Found Response (404 Not Found):
{
"detail": "No videos found for word: nonexistentword"
}Error Response (500 Internal Server Error):
{
"detail": "Error downloading video: Connection timeout"
}Endpoint: POST /api/batch/download
Description: Download multiple videos in batch
Request Body:
{
"words": ["hello", "world", "thank", "you"],
"force": false
}Request Schema:
words(array of strings, required) - List of words to downloadforce(boolean, optional, default=false) - Force re-download cached videos
Success Response (200 OK):
{
"total_words": 3,
"successful": 2,
"failed": 1,
"results": [
{
"word": "hello",
"success": true,
"video_count": 7,
"cached_videos": [
"cache/hello_d6975bb6.mp4",
"cache/hello_8a3f2c1e.mp4"
]
},
{
"word": "world",
"success": true,
"video_count": 27,
"cached_videos": [
"cache/world_a1b2c3d4.mp4"
]
},
{
"word": "nonexistent",
"success": false,
"error": "No videos found"
}
]
}Endpoint: GET /api/cache/list
Description: List all videos in local cache
Success Response (200 OK):
{
"total_videos": 58,
"cache_size_bytes": 6205440,
"cache_size_mb": 5.92,
"videos": [
"cache/hello_d6975bb6.mp4",
"cache/world_a1b2c3d4.mp4",
"cache/computer_5e8f9a2b.mp4"
]
}Endpoint: DELETE /api/cache/clear
Description: Clear cached videos (all or for a specific word)
Query Parameters:
word(string, optional) - If provided, only clear videos for this word
Success Response (200 OK) - Clear all:
{
"deleted_count": 58,
"message": "Cleared all 58 cached video(s)"
}Success Response (200 OK) - Clear specific word:
{
"deleted_count": 7,
"message": "Cleared 7 video(s) for word: hello"
}beautifulsoup4==4.12.3 # HTML parsing
requests==2.31.0 # HTTP requests
fastapi==0.109.0 # API framework
uvicorn[standard]==0.27.0 # ASGI server
aiofiles==23.2.1 # Async file operations
lxml==5.1.0 # XML/HTML parser# Build the image
docker compose build
# Start the container
docker compose up -d
# View logs
docker compose logs -f
# Check container status
docker compose ps
# Stop the container
docker compose down
# Rebuild and restart
docker compose up -d --build
# Access container shell
docker compose exec signasl-api sh
# Remove everything including volumes
docker compose down -vProduction (docker-compose.yml):
- Runs on port 8000
- Persistent cache volume
- Auto-restart enabled
- Health checks configured
Development (docker-compose.dev.yml):
- Hot reload enabled
- Source code mounted as volumes
- Immediate code changes reflection
The Docker setup uses a volume to persist the video cache:
# View cache contents
docker compose exec signasl-api ls -lah /app/cache
# Backup cache
docker compose exec signasl-api tar -czf /tmp/cache-backup.tar.gz /app/cache
docker compose cp signasl-api:/tmp/cache-backup.tar.gz ./cache-backup.tar.gz
# Clear cache via API
curl -X DELETE http://localhost:8000/api/cache/clear- Word Lookup: The scraper constructs a URL to SignASL.org using the word
- Page Fetch: Retrieves the HTML page using requests with proper headers
- Video Extraction: Parses HTML with BeautifulSoup to find all
<video>and<source>tags - Multiple Videos: SignASL.org typically provides 5-10+ videos per word from different sources
- Video Download: Downloads video files to local cache with unique filenames (word + URL hash)
- Caching: Checks cache before downloading to avoid redundant requests
- Response: Returns video URLs or local file paths via REST API
SignASL.org URLs follow the pattern:
https://www.signasl.org/sign/{word}
Examples:
https://www.signasl.org/sign/hello(9 videos found)https://www.signasl.org/sign/world(27 videos found)https://www.signasl.org/sign/thank-you
Video Sources: SignASL.org aggregates videos from multiple sources:
media.signbsl.com/videos/asl/startasl/media.signbsl.com/videos/asl/elementalaslconcepts/media.signbsl.com/videos/asl/youtube/media.signbsl.com/videos/asl/aslsignbank/player.vimeo.com/external/(for ASL Study videos)
This scraper is designed to populate the video repository for GestureGPT.
Workflow:
- Use this scraper to download ASL videos
- Update GestureGPT's
data/video_index.jsonwith video URLs - GestureGPT API serves these videos to clients
- Rate limiting: Be respectful of SignASL.org's servers
- Video availability: Not all words may have videos
- Network dependency: Requires internet connection
- Copyright: Videos belong to SignASL.org - respect their terms of use
- Respect SignASL.org's robots.txt
- Implement rate limiting
- Cache videos to minimize requests
- Credit SignASL.org as the source
- Do not redistribute videos commercially
The SignASL API is automatically published to GitHub Container Registry on every push to main and on releases.
- Push to main:
ghcr.io/notyusheng/signasl-api:v0.0.0-{short-sha} - Release:
ghcr.io/notyusheng/signasl-api:v1.0.0(and tagged aslatest)
# Pull latest version
docker pull ghcr.io/notyusheng/signasl-api:latest
# Run directly
docker run -d -p 8000:8000 \
-v ./cache:/app/cache \
ghcr.io/notyusheng/signasl-api:latest
# Use in docker-compose.yml (uncomment the image line)
# See docker-compose.yml for details- Add robots.txt parser
- Make rate limiting configurable
- Support for other sign language websites (WLASL, ASL-LEX)
- Video quality selection
- Metadata extraction (poster images, video IDs, sources)
- Progress tracking for batch downloads
- Database integration for video metadata
- Video format conversion
- Async download support for better performance
MIT License - See LICENSE file for details
- SignASL.org - ASL video source
- Built to support the GestureGPT project