Real-Time Speech Translation

What this is

This repository contains a minimal real-time speech translation pipeline built using open-source ASR and machine translation models. The focus is on streaming behavior, latency, and incremental decoding, rather than offline batch translation.

The application is implemented as a Gradio app and is intended to be run from a Google Colab notebook for ease of reproduction.

System overview

At a high level, the system works as follows:

Audio is captured from the microphone in short chunks.
Incoming audio is buffered and incrementally transcribed using a streaming ASR setup.
Partial transcripts are translated on the fly into a target language.
Transcription and translation outputs are updated continuously in the UI.

This mirrors the constraints of real-world speech translation systems, where both ASR and MT operate on incomplete context.

Language support

The current setup assumes English speech input and supports translation into the following languages:

Bengali
Hindi
Kannada
Malayalam
Marathi
Tamil
Telugu

Language selection only affects the translation stage.

ASR scope and limitations

Only English ASR is enabled by default.
While Indic-language speech recognition can be integrated using IndicConformer models, doing so requires NVIDIA NeMo, which significantly increases installation time and complexity—particularly in Colab environments. For this reason, Indic ASR is intentionally excluded from this demo.

Demo

▶ Watch Demo Video

Known limitations

Input language
The system expects English speech. Non-English input will produce unreliable transcriptions and translations.
Speech style sensitivity
Translation quality is noticeably better for structured, fluent speech (e.g., prepared talks) than for spontaneous conversational speech. This is a known limitation of real-time MT systems and not specific to this implementation.
Latency
End-to-end latency depends on network conditions and model inference time. Since audio is streamed over the network, unstable connections will directly impact responsiveness.
Context fragmentation
Because transcription and translation operate on partial audio segments, sentence boundaries are not always preserved, which can affect translation coherence.

Intended use

This project is meant as:

A reference for building streaming ASR + MT pipelines
A testbed for analyzing latency vs. quality trade-offs
A practical demonstration of real-time speech translation constraints

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
assets		assets
GradioApp.py		GradioApp.py
Live_translation.ipynb		Live_translation.ipynb
README.md		README.md
Transcriber.py		Transcriber.py
Translator.py		Translator.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Real-Time Speech Translation

What this is

System overview

Language support

ASR scope and limitations

Demo

Known limitations

Intended use

About

Uh oh!

Releases

Packages

Languages

CrazyCyberbug/Realtime-translation

Folders and files

Latest commit

History

Repository files navigation

Real-Time Speech Translation

What this is

System overview

Language support

ASR scope and limitations

Demo

Known limitations

Intended use

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages