SignASL Scraper API

A web scraper and REST API for retrieving ASL (American Sign Language) videos from SignASL.org.

Part of GestureGPT - This scraper populates the video repository for the GestureGPT ASL learning platform.

Overview

This project scrapes ASL sign language videos from SignASL.org and provides a FastAPI-based REST API to:

Check if a word/sign exists on SignASL.org
Retrieve video URLs without downloading
Download and cache videos locally
Batch download multiple videos
Manage the local video cache

Features

🔍 Word Lookup - Search for ASL signs by word
📥 Video Download - Scrape and download ASL videos
🚀 REST API - Simple API interface for video retrieval
💾 Local Cache - Cache downloaded videos to avoid re-scraping
🎯 URL Extraction - Get direct video URLs from SignASL.org

Project Structure

signaslAPI/
├── scraper/
│   ├── __init__.py
│   ├── signasl_scraper.py    # Core scraping logic
│   └── video_downloader.py   # Video download utilities
├── api/
│   ├── __init__.py
│   └── main.py               # FastAPI application
├── cache/                    # Downloaded videos cache
├── venv/                     # Virtual environment
├── requirements.txt
├── Dockerfile               # Docker image configuration
├── docker-compose.yml       # Docker Compose for production
├── docker-compose.dev.yml   # Docker Compose for development
├── .dockerignore            # Docker ignore patterns
├── test_scraper.py          # Test HTML inspection
├── test_full_scraper.py     # Test scraper functions
├── test_api.py              # Test API endpoints
└── README.md

Installation

Option 1: Docker (Recommended)

Prerequisites:

Docker 20.10+
Docker Compose v2.0+

Using Pre-built Image from GHCR:

# Pull the latest image from GitHub Container Registry
docker pull ghcr.io/notyusheng/signasl-api:latest

# Run the container
docker run -d \
  --name signasl-api \
  -p 8000:8000 \
  -v $(pwd)/cache:/app/cache \
  ghcr.io/notyusheng/signasl-api:latest

# Or use a specific version
docker pull ghcr.io/notyusheng/signasl-api:v0.0.0-abc1234

Building from Source:

# Clone or navigate to the project
cd Desktop/signaslAPI

# Build and start the container
docker compose up -d

# View logs
docker compose logs -f

# Stop the container
docker compose down

# Stop and remove volumes (clears cache)
docker compose down -v

The API will be available at http://localhost:8000

Development Mode with Hot Reload:

# Use the development compose file
docker compose -f docker-compose.dev.yml up -d

# Code changes will automatically reload the server

Option 2: Python Virtual Environment

Prerequisites:

Python 3.11+
pip

Setup:

# Clone or navigate to the project
cd Desktop/signaslAPI

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Usage

As a Python Module

from scraper.signasl_scraper import SignASLScraper
from scraper.video_downloader import VideoDownloader

scraper = SignASLScraper()
downloader = VideoDownloader()

# Check if word exists
exists = scraper.word_exists("hello")
print(f"Word exists: {exists}")

# Get all video URLs for a word
video_urls = scraper.get_video_urls("hello")
print(f"Found {len(video_urls)} videos")
print(f"First URL: {video_urls[0]}")

# Get primary video URL
primary_url = scraper.get_primary_video_url("hello")
print(f"Primary URL: {primary_url}")

# Get detailed video information
details = scraper.get_video_details("hello")
for detail in details:
    print(f"Video ID: {detail['id']}, URL: {detail['url']}")

# Download all videos for a word
cached_paths = downloader.download_all_videos("hello", video_urls)
print(f"Downloaded to: {cached_paths}")

# Check if a video is cached
is_cached = downloader.is_cached("hello", video_urls[0])
print(f"Cached: {is_cached}")

As a REST API

Using Docker (Recommended):

# Start the API with Docker Compose
docker compose up -d

# The API will be available at http://localhost:8000

# Test the endpoints:
curl http://localhost:8000/api/check/hello

Using Python Virtual Environment:

# Start the API server
./venv/bin/uvicorn api.main:app --host 0.0.0.0 --port 8000

# Or use Python directly
./venv/bin/python -m uvicorn api.main:app --host 0.0.0.0 --port 8000

# In a new terminal, test the endpoints:

# Check if word exists
curl http://localhost:8000/api/check/hello

# Get video URLs
curl http://localhost:8000/api/video-url/hello

# Download videos
curl http://localhost:8000/api/download/hello

# Batch download
curl -X POST http://localhost:8000/api/batch/download \
  -H "Content-Type: application/json" \
  -d '{"words": ["hello", "world", "thank-you"], "force": false}'

# List cached videos
curl http://localhost:8000/api/cache/list

# Clear cache for a specific word
curl -X DELETE http://localhost:8000/api/cache/clear?word=hello

# Clear entire cache
curl -X DELETE http://localhost:8000/api/cache/clear

Running Tests

# Test HTML inspection
./venv/bin/python test_scraper.py

# Test scraper functions
./venv/bin/python test_full_scraper.py

# Test API endpoints (requires API server running)
./venv/bin/python test_api.py

API Endpoints

1. Check Word Existence

Endpoint: GET /api/check/{word}

Description: Check if a word/sign exists on SignASL.org

Path Parameters:

word (string, required) - The word to search for (case-insensitive)

Success Response (200 OK):

{
  "word": "hello",
  "exists": true,
  "video_count": 7
}

Word Not Found (200 OK):

{
  "word": "nonexistentword",
  "exists": false,
  "video_count": 0
}

Error Response (500 Internal Server Error):

{
  "detail": "Error checking word: Connection timeout"
}

2. Get Video URL

Endpoint: GET /api/video-url/{word}

Description: Get the direct video URL for a word without downloading

Path Parameters:

word (string, required) - The word to get video URL for

Success Response (200 OK):

{
  "word": "hello",
  "video_urls": [
    "https://media.signbsl.com/videos/asl/startasl/mp4/hello.mp4",
    "https://media.signbsl.com/videos/asl/elementalaslconcepts/mp4/hello.mp4",
    "https://media.signbsl.com/videos/asl/youtube/mp4/6kvCOzxP9_A.mp4"
  ]
}

Not Found Response (404 Not Found):

{
  "detail": "No videos found for word: nonexistentword"
}

Error Response (500 Internal Server Error):

{
  "detail": "Error fetching video URLs: Connection timeout"
}

3. Download Video

Endpoint: GET /api/download/{word}

Description: Download the video and save to local cache

Path Parameters:

word (string, required) - The word to download video for

Query Parameters:

force (boolean, optional, default=false) - Force re-download even if cached

Success Response (200 OK):

{
  "word": "hello",
  "success": true,
  "cached_videos": [
    "cache/hello_d6975bb6.mp4",
    "cache/hello_8a3f2c1e.mp4",
    "cache/hello_4b9e7d2a.mp4"
  ],
  "message": "Successfully downloaded 3 video(s)"
}

Not Found Response (404 Not Found):

{
  "detail": "No videos found for word: nonexistentword"
}

Error Response (500 Internal Server Error):

{
  "detail": "Error downloading video: Connection timeout"
}

4. Batch Download

Endpoint: POST /api/batch/download

Description: Download multiple videos in batch

Request Body:

{
  "words": ["hello", "world", "thank", "you"],
  "force": false
}

Request Schema:

words (array of strings, required) - List of words to download
force (boolean, optional, default=false) - Force re-download cached videos

Success Response (200 OK):

{
  "total_words": 3,
  "successful": 2,
  "failed": 1,
  "results": [
    {
      "word": "hello",
      "success": true,
      "video_count": 7,
      "cached_videos": [
        "cache/hello_d6975bb6.mp4",
        "cache/hello_8a3f2c1e.mp4"
      ]
    },
    {
      "word": "world",
      "success": true,
      "video_count": 27,
      "cached_videos": [
        "cache/world_a1b2c3d4.mp4"
      ]
    },
    {
      "word": "nonexistent",
      "success": false,
      "error": "No videos found"
    }
  ]
}

5. List Cached Videos

Endpoint: GET /api/cache/list

Description: List all videos in local cache

Success Response (200 OK):

{
  "total_videos": 58,
  "cache_size_bytes": 6205440,
  "cache_size_mb": 5.92,
  "videos": [
    "cache/hello_d6975bb6.mp4",
    "cache/world_a1b2c3d4.mp4",
    "cache/computer_5e8f9a2b.mp4"
  ]
}

6. Clear Cache

Endpoint: DELETE /api/cache/clear

Description: Clear cached videos (all or for a specific word)

Query Parameters:

word (string, optional) - If provided, only clear videos for this word

Success Response (200 OK) - Clear all:

{
  "deleted_count": 58,
  "message": "Cleared all 58 cached video(s)"
}

Success Response (200 OK) - Clear specific word:

{
  "deleted_count": 7,
  "message": "Cleared 7 video(s) for word: hello"
}

Dependencies

beautifulsoup4==4.12.3     # HTML parsing
requests==2.31.0           # HTTP requests
fastapi==0.109.0           # API framework
uvicorn[standard]==0.27.0  # ASGI server
aiofiles==23.2.1           # Async file operations
lxml==5.1.0                # XML/HTML parser

Docker Deployment

Docker Commands

# Build the image
docker compose build

# Start the container
docker compose up -d

# View logs
docker compose logs -f

# Check container status
docker compose ps

# Stop the container
docker compose down

# Rebuild and restart
docker compose up -d --build

# Access container shell
docker compose exec signasl-api sh

# Remove everything including volumes
docker compose down -v

Docker Configuration

Production (docker-compose.yml):

Runs on port 8000
Persistent cache volume
Auto-restart enabled
Health checks configured

Development (docker-compose.dev.yml):

Hot reload enabled
Source code mounted as volumes
Immediate code changes reflection

Volume Management

The Docker setup uses a volume to persist the video cache:

# View cache contents
docker compose exec signasl-api ls -lah /app/cache

# Backup cache
docker compose exec signasl-api tar -czf /tmp/cache-backup.tar.gz /app/cache
docker compose cp signasl-api:/tmp/cache-backup.tar.gz ./cache-backup.tar.gz

# Clear cache via API
curl -X DELETE http://localhost:8000/api/cache/clear

How It Works

Word Lookup: The scraper constructs a URL to SignASL.org using the word
Page Fetch: Retrieves the HTML page using requests with proper headers
Video Extraction: Parses HTML with BeautifulSoup to find all <video> and <source> tags
Multiple Videos: SignASL.org typically provides 5-10+ videos per word from different sources
Video Download: Downloads video files to local cache with unique filenames (word + URL hash)
Caching: Checks cache before downloading to avoid redundant requests
Response: Returns video URLs or local file paths via REST API

SignASL.org Structure

SignASL.org URLs follow the pattern:

https://www.signasl.org/sign/{word}

Examples:

https://www.signasl.org/sign/hello (9 videos found)
https://www.signasl.org/sign/world (27 videos found)
https://www.signasl.org/sign/thank-you

Video Sources: SignASL.org aggregates videos from multiple sources:

media.signbsl.com/videos/asl/startasl/
media.signbsl.com/videos/asl/elementalaslconcepts/
media.signbsl.com/videos/asl/youtube/
media.signbsl.com/videos/asl/aslsignbank/
player.vimeo.com/external/ (for ASL Study videos)

Integration with GestureGPT

This scraper is designed to populate the video repository for GestureGPT.

Workflow:

Use this scraper to download ASL videos
Update GestureGPT's data/video_index.json with video URLs
GestureGPT API serves these videos to clients

Limitations

Rate limiting: Be respectful of SignASL.org's servers
Video availability: Not all words may have videos
Network dependency: Requires internet connection
Copyright: Videos belong to SignASL.org - respect their terms of use

Ethical Considerations

⚠️ Important: This scraper is for educational and accessibility purposes only.

Respect SignASL.org's robots.txt
Implement rate limiting
Cache videos to minimize requests
Credit SignASL.org as the source
Do not redistribute videos commercially

GitHub Container Registry (GHCR)

The SignASL API is automatically published to GitHub Container Registry on every push to main and on releases.

Image Naming Convention

Push to main: ghcr.io/notyusheng/signasl-api:v0.0.0-{short-sha}
Release: ghcr.io/notyusheng/signasl-api:v1.0.0 (and tagged as latest)

Using GHCR Images

# Pull latest version
docker pull ghcr.io/notyusheng/signasl-api:latest

# Run directly
docker run -d -p 8000:8000 \
  -v ./cache:/app/cache \
  ghcr.io/notyusheng/signasl-api:latest

# Use in docker-compose.yml (uncomment the image line)
# See docker-compose.yml for details

Future Enhancements

Add robots.txt parser
Make rate limiting configurable
Support for other sign language websites (WLASL, ASL-LEX)
Video quality selection
Metadata extraction (poster images, video IDs, sources)
Progress tracking for batch downloads
Database integration for video metadata
Video format conversion
Async download support for better performance

License

MIT License - See LICENSE file for details

Acknowledgments

SignASL.org - ASL video source
Built to support the GestureGPT project

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
api		api
scraper		scraper
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.ghcr.yml		docker-compose.ghcr.yml
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
test_api.py		test_api.py
test_full_scraper.py		test_full_scraper.py
test_scraper.py		test_scraper.py

NotYuSheng/signaslAPI

Folders and files

Latest commit

History

Repository files navigation

SignASL Scraper API

Overview

Features

Project Structure

Installation

Option 1: Docker (Recommended)

Option 2: Python Virtual Environment

Usage

As a Python Module

As a REST API

Running Tests

API Endpoints

1. Check Word Existence

2. Get Video URL

3. Download Video

4. Batch Download

5. List Cached Videos

6. Clear Cache

Dependencies

Docker Deployment

Docker Commands

Docker Configuration

Volume Management

How It Works

SignASL.org Structure

Integration with GestureGPT

Limitations

Ethical Considerations

GitHub Container Registry (GHCR)

Image Naming Convention

Using GHCR Images

Future Enhancements

License

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages