Advanced speech-to-text transcription service with web interface
Azpidatzi is a modern, containerized audio transcription platform that converts speech to text with high accuracy. Built with FastAPI and Astro, it supports multiple audio formats, YouTube integration, and provides word-level timestamps with optimized subtitle generation.
Upload an audio file or download a video from YouTube.
Download or edit the transcript directly in the browser.
More screenshots here.
- Multi-format Support: Upload WAV, MP3, M4A, FLAC, and other audio formats
- YouTube Video Download: Direct transcription from YouTube video URLs
- Advanced Transcription: WhisperX-powered with word-level alignment
- Multi-language: Support for custom Whisper models and alignment models
- GPU Acceleration: Automatic GPU detection and utilization
- Voice Activity Detection: SILERO VAD for improved accuracy
- Speaker Diarization: Speaker diarization with WhisperX
- Subtitle Generation: Multiple SRT formats (optimized, basic, unaligned)
- Modern Web Interface: Clean, responsive UI built with Astro and TailwindCSS
- Transcript Editor: Transcription web editor included
- Docker and Docker Compose v2
- (Optional) NVIDIA drivers and NVIDIA Container Toolkit for GPU acceleration. Recommended for better performance and larger models.
# Clone the repository
git clone <repository-url>
cd azpidatzi
# Start all services
docker compose up --build
# Access the application
# Frontend: http://localhost:4321
# Backend API: http://localhost:8000
# API Documentation: http://localhost:8000/docsNote: Docker Compose is the supported and recommended way to run Azpidatzi. All development, testing, and deployment should be done using Docker containers.
A comprehensive Makefile is provided for common operations:
# Start all services
make up
# View logs
make logs
# Run tests
make test
# Stop all services
make down
# Show all available commands
make helpAvailable Commands:
make build- Build all Docker imagesmake up- Start all servicesmake down- Stop all servicesmake logs- Show logs from all servicesmake logs-backend- Show backend logs onlymake logs-frontend- Show frontend logs onlymake test- Run backend testsmake test-verbose- Run tests with verbose outputmake rebuild- Stop, rebuild and restart servicesmake clean- Remove all containers and images
- FastAPI: Modern, fast web framework
- WhisperX: Advanced speech recognition with alignment
- FFmpeg: Audio processing and format conversion
- GPU Support: CUDA, MPS, and CPU fallback
- Astro: Static site generator for optimal performance
- TailwindCSS: Utility-first CSS framework
- Vanilla JavaScript: No framework dependencies
- API Documentation: Complete REST API reference
- Frontend Documentation: Frontend development guide
- Swagger UI:
http://localhost:8000/docs - ReDoc:
http://localhost:8000/redoc
All testing should be done using Docker Compose to ensure consistency:
# Run all tests
make test
# Run tests with verbose output
make test-verbose
# Run specific test
docker compose run --rm backend pytest -q -k test_name
# Run with output capture disabled
docker compose run --rm backend pytest -q -sThe recommended approach is to use Docker Compose for all development tasks:
# Start development environment
docker compose up --build
# Rebuild after code changes
docker compose build
# View logs during development
docker compose logs -f backend
docker compose logs -f frontend
# Run tests
docker compose run --rm backend pytest -qNote: Local development requires additional setup and is not the supported environment.
For backend local development (requires Python 3.8+ and FFmpeg):
cd backend
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
uvicorn app.main:app --reload --port 8000For frontend local development (requires Node.js 18+):
cd frontend
npm install
npm run devImportant: The frontend is configured to use http://backend:8000 in Docker and http://localhost:8000 for local development. Always ensure the backend is running when developing locally.
The application automatically detects and uses available GPUs:
# For NVIDIA GPUs
docker compose -f docker-compose.gpu.yml up --build
# Or with explicit GPU configuration
docker compose up --build --gpus allHUGGING_FACE_HUB_TOKEN: Required for SILERO VAD functionalityCUDA_VISIBLE_DEVICES: Control GPU usage (optional)
azpidatzi/
βββ backend/ # FastAPI backend service
β βββ app/ # Application code
β β βββ main.py # FastAPI application
β β βββ routers/ # API endpoints
β β βββ services/ # Business logic
β βββ tests/ # Test suite
β βββ data/ # Runtime data storage
β βββ API.md # API documentation
β βββ requirements.txt # Python dependencies
βββ frontend/ # Astro frontend
β βββ src/ # Source code
β βββ README.md # Frontend documentation
βββ docker-compose.yml # Docker Compose configuration
βββ docker-compose.gpu.yml # GPU-enabled configuration
βββ Makefile # Development commands
βββ README.md # This file
- FastAPI: Modern, fast web framework for building APIs
- WhisperX: Advanced speech recognition with alignment capabilities
- FFmpeg: Audio processing and format conversion
- Pydantic: Data validation and serialization
- Uvicorn: ASGI server for production deployment
Files, transcripts, and subtitles are stored under backend/data/:
data/files/- Uploaded audio filesdata/transcripts/- Generated transcripts (JSON)data/subtitles/- Generated subtitle files (SRT)
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Run the test suite
- Submit a pull request
This project is licensed under the GNU General Public License v3.0 (GPL-3.0).
See the LICENSE file for the full text.
For questions, issues, or contributions:
- Check the API Documentation for technical details
- Review this README for setup and development guidance
- Open an issue on the repository for bugs or feature requests
- WhisperX for the speech recognition model
- yt-dlp for the YouTube video downloader
- wscribe-editor for the subtitle editor which has been modified and forked to support media and subtitle files via URL parameters and other improvements
- FFmpeg for the audio processing


