| layout | default |
|---|---|
| title | Chapter 6: Language & Translation |
| nav_order | 6 |
| has_children | false |
| parent | Whisper.cpp Tutorial |
Welcome to Chapter 6: Language & Translation. In this part of Whisper.cpp Tutorial: High-Performance Speech Recognition in C/C++, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.
Multi-language support, translation mode, language detection, and speaker diarization with Whisper.cpp
By the end of this chapter, you'll understand:
- How Whisper.cpp supports 99+ languages
- Automatic language detection and its API
- Translation mode (any language to English)
- Token-level language mechanics
- Speaker diarization techniques
- Building multilingual transcription applications
OpenAI trained Whisper on 680,000 hours of multilingual audio spanning 99 languages. The model learns a shared representation across all languages, which means a single model file can transcribe and translate dozens of languages without any per-language fine-tuning.
flowchart TD
A[Audio Input<br>Any Language] --> B[Mel Spectrogram]
B --> C[Encoder]
C --> D{Task Token}
D -->|Transcribe| E[Same-Language Text]
D -->|Translate| F[English Text]
G[Language Token] --> D
H[SOT Token] --> G
classDef input fill:#e1f5fe,stroke:#01579b
classDef encoder fill:#fff3e0,stroke:#ef6c00
classDef task fill:#f3e5f5,stroke:#4a148c
classDef output fill:#e8f5e8,stroke:#1b5e20
class A,B input
class C encoder
class D,G,H task
class E,F output
| Model | Languages | Size | Accuracy (en) | Use Case |
|---|---|---|---|---|
tiny.en |
English only | 39 MB | Good | Fast English-only |
tiny |
99 languages | 39 MB | Good | Quick multilingual |
base.en |
English only | 74 MB | Better | English applications |
base |
99 languages | 74 MB | Better | General multilingual |
small.en |
English only | 244 MB | Very good | Balanced English |
small |
99 languages | 244 MB | Very good | Balanced multilingual |
medium.en |
English only | 769 MB | Excellent | High-quality English |
medium |
99 languages | 769 MB | Excellent | High-quality multilingual |
large-v3 |
99 languages | 1550 MB | Best | Maximum accuracy |
English-only models (.en suffix) are slightly more accurate for English because their entire capacity is devoted to one language. Use the multilingual models when you need to handle non-English audio or when you do not know the language in advance.
Whisper detects language by examining the first 30 seconds of audio. The encoder produces audio features, and the decoder predicts language probabilities over all supported language tokens.
#include "whisper.h"
#include <vector>
#include <algorithm>
#include <string>
struct LanguageProb {
std::string lang;
float probability;
};
// Detect language of an audio file
std::vector<LanguageProb> detect_language(
const char * model_path,
const float * audio,
int n_samples,
int top_k = 5
) {
struct whisper_context * ctx = whisper_init_from_file(model_path);
if (!ctx) return {};
// Process the first 30 seconds (or less)
const int max_samples = 16000 * 30;
int process_samples = std::min(n_samples, max_samples);
// Compute mel spectrogram
whisper_pcm_to_mel(ctx, audio, process_samples, 1);
// Run encoder
whisper_encode(ctx, 0, 4);
// Get language probabilities
std::vector<LanguageProb> results;
const int n_lang = whisper_lang_max_id() + 1;
// Allocate probability array
std::vector<float> probs(n_lang, 0.0f);
whisper_lang_auto_detect(ctx, 0, 4, probs.data());
for (int i = 0; i < n_lang; ++i) {
if (probs[i] > 0.001f) { // Filter very low probabilities
results.push_back({whisper_lang_str(i), probs[i]});
}
}
// Sort by probability (descending)
std::sort(results.begin(), results.end(),
[](const LanguageProb & a, const LanguageProb & b) {
return a.probability > b.probability;
});
// Keep top-k
if ((int)results.size() > top_k) {
results.resize(top_k);
}
whisper_free(ctx);
return results;
}
// Usage
void print_detected_languages(const char * model, const float * audio, int n) {
auto langs = detect_language(model, audio, n);
printf("Detected languages:\n");
for (const auto & lp : langs) {
printf(" %-10s %.1f%%\n", lp.lang.c_str(), lp.probability * 100.0f);
}
}import numpy as np
def detect_language_python(model_path, audio_path):
"""Detect the spoken language using whisper-cpp-python."""
from whisper_cpp_python import Whisper
import librosa
# Load audio
audio, sr = librosa.load(audio_path, sr=16000)
# Initialize Whisper
whisper = Whisper(model_path)
# Detect language
result = whisper.detect_language(audio)
print("Language detection results:")
for lang, prob in sorted(result.items(), key=lambda x: -x[1])[:5]:
print(f" {lang:10s} {prob * 100:5.1f}%")
return result| Code | Language | Code | Language | Code | Language |
|---|---|---|---|---|---|
en |
English | zh |
Chinese | de |
German |
es |
Spanish | ru |
Russian | ko |
Korean |
fr |
French | ja |
Japanese | pt |
Portuguese |
it |
Italian | ar |
Arabic | nl |
Dutch |
hi |
Hindi | tr |
Turkish | pl |
Polish |
uk |
Ukrainian | sv |
Swedish | cs |
Czech |
vi |
Vietnamese | th |
Thai | el |
Greek |
id |
Indonesian | he |
Hebrew | da |
Danish |
fi |
Finnish | no |
Norwegian | ro |
Romanian |
Whisper supports 99 languages total. The table above shows the most commonly used ones.
Whisper has a built-in translation mode that converts speech in any supported language directly to English text. This is not a two-step process (transcribe then translate) -- the model generates English tokens directly from non-English audio features.
// Translate non-English audio to English
int translate_to_english(
const char * model_path,
const char * audio_file,
const char * source_lang // nullptr for auto-detect
) {
struct whisper_context * ctx = whisper_init_from_file(model_path);
if (!ctx) return 1;
std::vector<float> pcmf32;
if (!read_wav_file(audio_file, pcmf32)) {
whisper_free(ctx);
return 1;
}
// Configure for translation
struct whisper_full_params wparams =
whisper_full_default_params(WHISPER_SAMPLING_GREEDY);
wparams.translate = true; // Enable translation to English
wparams.language = source_lang; // Source language (nullptr = auto)
wparams.n_threads = 4;
// Run translation
if (whisper_full(ctx, wparams, pcmf32.data(), pcmf32.size()) != 0) {
fprintf(stderr, "Translation failed\n");
whisper_free(ctx);
return 1;
}
// Print translated text with timestamps
const int n_segments = whisper_full_n_segments(ctx);
for (int i = 0; i < n_segments; ++i) {
const int64_t t0 = whisper_full_get_segment_t0(ctx, i);
const int64_t t1 = whisper_full_get_segment_t1(ctx, i);
const char * text = whisper_full_get_segment_text(ctx, i);
printf("[%02d:%02d.%03d --> %02d:%02d.%03d] %s\n",
(int)(t0 / 100) / 60, (int)(t0 / 100) % 60, (int)(t0 % 100) * 10,
(int)(t1 / 100) / 60, (int)(t1 / 100) % 60, (int)(t1 % 100) * 10,
text);
}
whisper_free(ctx);
return 0;
}# Translate Spanish audio to English
./main -m models/ggml-medium.bin -f spanish_audio.wav --translate --language es
# Auto-detect language and translate
./main -m models/ggml-medium.bin -f unknown_audio.wav --translate
# Translate with SRT subtitle output
./main -m models/ggml-medium.bin -f french_audio.wav --translate -osrtflowchart TD
A[Audio: French Speech] --> B[Encoder]
B --> C{Mode}
C -->|transcribe| D["Tokens: <fr> <transcribe>"]
D --> E["Output: Bonjour le monde"]
C -->|translate| F["Tokens: <fr> <translate>"]
F --> G["Output: Hello world"]
classDef input fill:#e1f5fe,stroke:#01579b
classDef process fill:#fff3e0,stroke:#ef6c00
classDef fr fill:#f3e5f5,stroke:#4a148c
classDef en fill:#e8f5e8,stroke:#1b5e20
class A input
class B,C process
class D,E fr
class F,G en
Whisper uses special tokens to control language and task behavior. Understanding these tokens is important for advanced usage.
// Important special tokens
void print_special_tokens(struct whisper_context * ctx) {
printf("Token IDs:\n");
printf(" SOT (start of transcript): %d\n", whisper_token_sot(ctx));
printf(" EOT (end of transcript): %d\n", whisper_token_eot(ctx));
printf(" SOT prev: %d\n", whisper_token_prev(ctx));
printf(" NOT (no timestamps): %d\n", whisper_token_not(ctx));
printf(" BEG (begin): %d\n", whisper_token_beg(ctx));
// Language tokens
printf("\nLanguage token IDs:\n");
const char * langs[] = {"en", "es", "fr", "de", "ja", "zh"};
for (const char * lang : langs) {
int id = whisper_lang_id(lang);
printf(" %-10s id=%d\n", lang, id);
}
}Transcription: <SOT> <lang:fr> <transcribe> <notimestamps> Bonjour le monde <EOT>
Translation: <SOT> <lang:fr> <translate> <notimestamps> Hello world <EOT>
The language token tells the model what language the audio is in, and the task token tells it whether to transcribe (same language) or translate (to English).
#include "whisper.h"
#include <vector>
#include <string>
#include <map>
struct TranscriptionResult {
std::string filename;
std::string detected_language;
float language_confidence;
std::string text;
std::string translation; // English translation (if applicable)
bool success;
};
class MultilingualTranscriber {
private:
struct whisper_context * ctx;
public:
MultilingualTranscriber(const char * model_path) {
ctx = whisper_init_from_file(model_path);
if (!ctx) throw std::runtime_error("Failed to load model");
}
~MultilingualTranscriber() {
if (ctx) whisper_free(ctx);
}
TranscriptionResult process(const std::string & audio_file) {
TranscriptionResult result;
result.filename = audio_file;
result.success = false;
// Load audio
std::vector<float> pcmf32;
if (!read_wav_file(audio_file.c_str(), pcmf32)) {
return result;
}
// Step 1: Detect language
whisper_pcm_to_mel(ctx, pcmf32.data(), pcmf32.size(), 1);
whisper_encode(ctx, 0, 4);
std::vector<float> probs(whisper_lang_max_id() + 1, 0.0f);
whisper_lang_auto_detect(ctx, 0, 4, probs.data());
int best_lang = 0;
for (int i = 1; i < (int)probs.size(); ++i) {
if (probs[i] > probs[best_lang]) best_lang = i;
}
result.detected_language = whisper_lang_str(best_lang);
result.language_confidence = probs[best_lang];
// Step 2: Transcribe in original language
struct whisper_full_params wparams =
whisper_full_default_params(WHISPER_SAMPLING_GREEDY);
wparams.language = result.detected_language.c_str();
wparams.translate = false;
wparams.n_threads = 4;
if (whisper_full(ctx, wparams, pcmf32.data(), pcmf32.size()) == 0) {
const int n_seg = whisper_full_n_segments(ctx);
for (int i = 0; i < n_seg; ++i) {
result.text += whisper_full_get_segment_text(ctx, i);
}
}
// Step 3: Translate to English (if not already English)
if (result.detected_language != "en") {
wparams.translate = true;
if (whisper_full(ctx, wparams, pcmf32.data(), pcmf32.size()) == 0) {
const int n_seg = whisper_full_n_segments(ctx);
for (int i = 0; i < n_seg; ++i) {
result.translation += whisper_full_get_segment_text(ctx, i);
}
}
}
result.success = true;
return result;
}
// Process a batch of files
std::vector<TranscriptionResult> process_batch(
const std::vector<std::string> & files
) {
std::vector<TranscriptionResult> results;
for (const auto & f : files) {
printf("Processing: %s\n", f.c_str());
results.push_back(process(f));
}
return results;
}
};import os
import json
from dataclasses import dataclass, asdict
from typing import List, Optional
@dataclass
class TranscriptionResult:
filename: str
detected_language: str
language_confidence: float
text: str
translation: Optional[str]
success: bool
class MultilingualPipeline:
"""Full multilingual transcription and translation pipeline."""
def __init__(self, model_path):
from whisper_cpp_python import Whisper
self.whisper = Whisper(model_path)
def process_file(self, audio_path) -> TranscriptionResult:
import librosa
audio, sr = librosa.load(audio_path, sr=16000)
# Detect language
lang_probs = self.whisper.detect_language(audio)
detected = max(lang_probs, key=lang_probs.get)
confidence = lang_probs[detected]
# Transcribe in original language
result = self.whisper.transcribe(
audio,
language=detected,
translate=False,
)
text = result.get("text", "").strip()
# Translate to English if needed
translation = None
if detected != "en":
trans_result = self.whisper.transcribe(
audio,
language=detected,
translate=True,
)
translation = trans_result.get("text", "").strip()
return TranscriptionResult(
filename=os.path.basename(audio_path),
detected_language=detected,
language_confidence=confidence,
text=text,
translation=translation,
success=True,
)
def process_directory(self, directory) -> List[TranscriptionResult]:
results = []
extensions = {".wav", ".mp3", ".m4a", ".flac", ".ogg"}
for fname in sorted(os.listdir(directory)):
if os.path.splitext(fname)[1].lower() in extensions:
path = os.path.join(directory, fname)
print(f"Processing: {fname}")
try:
results.append(self.process_file(path))
except Exception as e:
print(f" Error: {e}")
results.append(TranscriptionResult(
filename=fname,
detected_language="",
language_confidence=0.0,
text="",
translation=None,
success=False,
))
return results
def save_results(self, results, output_path):
data = [asdict(r) for r in results]
with open(output_path, "w", encoding="utf-8") as f:
json.dump(data, f, indent=2, ensure_ascii=False)
print(f"Results saved to {output_path}")
if __name__ == "__main__":
pipeline = MultilingualPipeline("models/ggml-medium.bin")
results = pipeline.process_directory("audio_files/")
pipeline.save_results(results, "transcriptions.json")
for r in results:
print(f"\n--- {r.filename} ---")
print(f"Language: {r.detected_language} ({r.language_confidence:.0%})")
print(f"Text: {r.text}")
if r.translation:
print(f"English: {r.translation}")Whisper itself does not perform speaker diarization (identifying who is speaking), but its per-segment timestamps and token-level timing make it straightforward to combine Whisper output with an external diarization model.
flowchart LR
A[Audio] --> B[Whisper<br>Transcription]
A --> C[Speaker<br>Embedding Model]
B --> D[Timestamped<br>Segments]
C --> E[Speaker<br>Clusters]
D --> F[Merge]
E --> F
F --> G[Speaker-Labeled<br>Transcript]
classDef whisper fill:#e1f5fe,stroke:#01579b
classDef diarize fill:#fff3e0,stroke:#ef6c00
classDef output fill:#e8f5e8,stroke:#1b5e20
class A,B,D whisper
class C,E diarize
class F,G output
import numpy as np
from sklearn.cluster import AgglomerativeClustering
class WhisperDiarizer:
"""Combine Whisper transcription with speaker diarization."""
def __init__(self, whisper_model, embedding_model_path=None):
from whisper_cpp_python import Whisper
self.whisper = Whisper(whisper_model)
# Load a speaker embedding model (e.g., speechbrain or resemblyzer)
self.embedding_model = self._load_embedding_model(embedding_model_path)
def _load_embedding_model(self, path):
"""Load a speaker embedding model for computing voice embeddings."""
try:
from resemblyzer import VoiceEncoder
return VoiceEncoder()
except ImportError:
print("Warning: resemblyzer not installed. "
"Install with: pip install resemblyzer")
return None
def process(self, audio_path, n_speakers=None, max_speakers=8):
import librosa
audio, sr = librosa.load(audio_path, sr=16000)
# Step 1: Transcribe with timestamps
result = self.whisper.transcribe(audio)
segments = result.get("segments", [])
if not segments:
return []
# Step 2: Extract speaker embeddings for each segment
embeddings = []
for seg in segments:
start_sample = int(seg["start"] * sr)
end_sample = int(seg["end"] * sr)
segment_audio = audio[start_sample:end_sample]
if len(segment_audio) < sr * 0.5:
# Pad short segments
segment_audio = np.pad(
segment_audio, (0, int(sr * 0.5) - len(segment_audio))
)
if self.embedding_model:
emb = self.embedding_model.embed_utterance(segment_audio)
else:
# Fallback: use simple MFCC-based features
emb = self._simple_embedding(segment_audio, sr)
embeddings.append(emb)
embeddings = np.array(embeddings)
# Step 3: Cluster speakers
if n_speakers is None:
# Try to estimate number of speakers
n_speakers = self._estimate_n_speakers(embeddings, max_speakers)
clustering = AgglomerativeClustering(
n_clusters=n_speakers
)
labels = clustering.fit_predict(embeddings)
# Step 4: Combine results
diarized = []
for i, seg in enumerate(segments):
diarized.append({
"speaker": f"SPEAKER_{labels[i]}",
"start": seg["start"],
"end": seg["end"],
"text": seg["text"],
})
return diarized
def _simple_embedding(self, audio, sr):
"""Simple MFCC-based embedding as a fallback."""
import librosa
mfcc = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=20)
return np.mean(mfcc, axis=1)
def _estimate_n_speakers(self, embeddings, max_speakers):
"""Estimate number of speakers using silhouette score."""
from sklearn.metrics import silhouette_score
if len(embeddings) < 3:
return 1
best_score = -1
best_k = 2
for k in range(2, min(max_speakers + 1, len(embeddings))):
clustering = AgglomerativeClustering(n_clusters=k)
labels = clustering.fit_predict(embeddings)
score = silhouette_score(embeddings, labels)
if score > best_score:
best_score = score
best_k = k
return best_k
# Usage
if __name__ == "__main__":
diarizer = WhisperDiarizer("models/ggml-medium.bin")
results = diarizer.process("meeting.wav", n_speakers=3)
for entry in results:
print(f"[{entry['start']:.1f}s - {entry['end']:.1f}s] "
f"{entry['speaker']}: {entry['text']}")| Language | Model | WER (%) | Notes |
|---|---|---|---|
| English | large-v3 | 2.7 | Best performance |
| Spanish | large-v3 | 3.5 | Excellent |
| French | large-v3 | 4.1 | Excellent |
| German | large-v3 | 4.8 | Very good |
| Japanese | large-v3 | 5.2 | Character-level |
| Chinese | large-v3 | 6.1 | Character-level |
| Russian | large-v3 | 5.5 | Very good |
| Arabic | large-v3 | 8.3 | Good, RTL text |
| Hindi | large-v3 | 9.1 | Good |
| Korean | large-v3 | 6.8 | Character-level |
WER measurements vary by dataset. Numbers above are approximate and based on common benchmarks.
// Optimize parameters for non-English languages
struct whisper_full_params get_multilingual_params(const char * lang) {
struct whisper_full_params wparams =
whisper_full_default_params(WHISPER_SAMPLING_BEAM_SEARCH);
wparams.language = lang;
wparams.n_threads = 4;
// Beam search works better for complex languages
wparams.beam_search.beam_size = 5;
wparams.beam_search.patience = 1.0f;
// Higher temperature for languages with more ambiguity
if (strcmp(lang, "ja") == 0 || strcmp(lang, "zh") == 0 ||
strcmp(lang, "ko") == 0) {
// CJK languages benefit from slightly higher temperature
wparams.temperature = 0.2f;
wparams.max_initial_ts = 1.0f;
}
// Use longer segments for tonal languages
if (strcmp(lang, "zh") == 0 || strcmp(lang, "th") == 0 ||
strcmp(lang, "vi") == 0) {
wparams.max_len = 0; // No segment length limit
}
return wparams;
}# Generate SRT subtitles in original language
./main -m models/ggml-medium.bin -f audio.wav --language fr -osrt
# Generate WebVTT subtitles with translation
./main -m models/ggml-medium.bin -f audio.wav --language fr --translate -ovtt
# Generate both original and translated subtitles
./main -m models/ggml-medium.bin -f audio.wav --language fr -osrt -otxt
./main -m models/ggml-medium.bin -f audio.wav --language fr --translate -osrt// Generate SRT subtitle content
std::string generate_srt(struct whisper_context * ctx) {
std::string srt;
const int n_segments = whisper_full_n_segments(ctx);
for (int i = 0; i < n_segments; ++i) {
const int64_t t0 = whisper_full_get_segment_t0(ctx, i);
const int64_t t1 = whisper_full_get_segment_t1(ctx, i);
const char * text = whisper_full_get_segment_text(ctx, i);
// Subtitle index
srt += std::to_string(i + 1) + "\n";
// Timestamps in SRT format: HH:MM:SS,mmm
char ts0[32], ts1[32];
snprintf(ts0, sizeof(ts0), "%02d:%02d:%02d,%03d",
(int)(t0 / 100) / 3600,
((int)(t0 / 100) % 3600) / 60,
(int)(t0 / 100) % 60,
(int)(t0 % 100) * 10);
snprintf(ts1, sizeof(ts1), "%02d:%02d:%02d,%03d",
(int)(t1 / 100) / 3600,
((int)(t1 / 100) % 3600) / 60,
(int)(t1 / 100) % 60,
(int)(t1 % 100) * 10);
srt += std::string(ts0) + " --> " + std::string(ts1) + "\n";
srt += std::string(text) + "\n\n";
}
return srt;
}Whisper.cpp's multilingual capabilities stem directly from the Whisper model's training on 680,000 hours of diverse audio. A single model file can detect, transcribe, and translate 99+ languages. The translation mode is particularly powerful because it produces English text directly from foreign-language audio in a single pass. Speaker diarization is not built in, but Whisper's precise timestamps make it straightforward to combine with external speaker-embedding models.
- Single Model, Many Languages: Multilingual Whisper models handle 99+ languages without per-language configuration
- Language Detection: Use
whisper_lang_auto_detectto identify the spoken language before transcription - Direct Translation: The
translateflag produces English output from any supported language in one pass - English-Only Models: The
.ensuffix models are slightly more accurate for English-only workloads - Speaker Diarization: Combine Whisper timestamps with speaker embedding models for who-said-what output
- Beam Search for Multilingual: Beam search decoding generally outperforms greedy decoding for non-English languages
With multilingual transcription and translation covered, let's explore how to deploy Whisper.cpp across different platforms and programming languages. Continue to Chapter 7: Platform Integration.
Built with insights from the whisper.cpp project.
Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for audio, results, self so behavior stays predictable as complexity grows.
In practical terms, this chapter helps you avoid three common failures:
- coupling core logic too tightly to one implementation path
- missing the handoff boundaries between setup, execution, and validation
- shipping changes without clear rollback or observability strategy
After working through this chapter, you should be able to reason about Chapter 6: Language & Translation as an operating subsystem inside Whisper.cpp Tutorial: High-Performance Speech Recognition in C/C++, with explicit contracts for inputs, state transitions, and outputs.
Use the implementation notes around language, wparams, result as your checklist when adapting these patterns to your own repository.
Under the hood, Chapter 6: Language & Translation usually follows a repeatable control path:
- Context bootstrap: initialize runtime config and prerequisites for
audio. - Input normalization: shape incoming data so
resultsreceives stable contracts. - Core execution: run the main logic branch and propagate intermediate state through
self. - Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
- Output composition: return canonical result payloads for downstream consumers.
- Operational telemetry: emit logs/metrics needed for debugging and performance tuning.
When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.
Use the following upstream sources to verify implementation details while reading this chapter:
- View Repo
Why it matters: authoritative reference on
View Repo(github.com).
Suggested trace strategy:
- search upstream code for
audioandresultsto map concrete implementation paths - compare docs claims against actual runtime/config code before reusing patterns in production