Natq is an open-source Arabic Text-to-Speech (TTS) system that uses deep learning models to convert Arabic text into natural-sounding speech.
We designed and trained multiple architectures on public datasets with a focus on Arabic language intricacies, including diacritization and phoneme modeling.
Graduation Project – Faculty of Computing and Artificial Intelligence, Helwan University
Department of Artificial Intelligence
Supervised by: Dr. Yasser Hifny & Dr. Ahmed Hesham
- Create a modular and efficient Arabic TTS system using public datasets.
- Handle Arabic-specific challenges such as diacritics and phoneme mapping.
- Compare performance of different spectrogram generators and vocoders.
- Support live speech synthesis via a web app (React + FastAPI).
The system pipeline is composed of:
- Text Preprocessing
- Diacritization: CATT Transformer
- Grapheme-to-Phoneme (G2P): Nawar Halabi’s Phenomizer
- Spectrogram Generation: FastPitch, FastSpeech2, Mixer-TTS, or Spark-TTS
- Vocoder: HiFi-GAN for waveform synthesis
| Dataset | Hours | Diacritized | Accent | Notes |
|---|---|---|---|---|
| Arabic_Speech_Corpus | 4.1 | ✅ | Levantine | Open-source |
| ClArTTS | 12 | ✅ | Classical MSA | Based on LibriVox audiobook |
- FastPitch – Parallel and pitch-controllable TTS
- FastSpeech2 – Variance-aware and efficient
- Mixer-TTS – MLP-Mixer based parallel synthesis
- Spark-TTS – End-to-end LLM-based TTS with zero-shot speaker cloning
- HiFi-GAN – Fast, high-fidelity waveform generation
- Spectrogram quality
- Inference time
- Model size
- Mean Opinion Score (MOS) from human listeners
| Model | Dataset | Sample |
|---|---|---|
| FastPitch | ClArTTS |
|
| Mixer-TTS | ASC |
|
| Spark-TTS | ClArTTS |
|
| VITS_facebook | — |
|
| T5_MBZUAI | — |
|
| FastSpeech2 | — |
|
d:/coding/Natq/
├── .gitignore
├── FastPitch/
│ ├── .vscode/
│ │ └── sttings.json
│ ├── arabic_phoneme_tokenizer.py
│ ├── catt/
│ ├── custom_arabic_to_phones.py
│ └── ...
├── FastSpeech2/
│ ├── arabic_phoneme_tokenizer.py
│ ├── catt/
│ └── ...
├── Media/
│ ├── audio_samples/
│ │ ├── FastPitch-TTS_MOS/
│ │ │ ├── bism_fp.mp4
│ │ │ ├── bism_fp.wav
│ │ │ └── ...
│ │ ├── Mixer-TTS_MOS/
│ │ │ ├── bism_mix.mp4
│ │ │ ├── bism_mix.wav
│ │ │ └── ...
│ │ └── Spark-TTS_MOS/
│ │ ├── bism_spark.mp4
│ │ ├── bism_spark.wav
│ │ └── ...
│ └── images/
│ ├── Demo.jpg
│ └── Fully_Logo_White.png
├── Natq-Frontend/
│ ├── .gitignore
│ ├── eslint.config.js
│ └── ...
├── README.md
└── Spark/
├── catt/
├── download_model_files.ipynb
└── ...
We built a simple TTS web interface using FastAPI for the backend and React for the frontend.
🧪 Test it locally:
cd backend
uvicorn main:app --reloadcd ../app
npm install
npm start




