FastSpeech 2 - PyTorch Implementation

This is a modified PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech, aim to deploy the API of FastSpeech 2 model trained on AISHELL-3. This project is based on xcmyz's implementation of FastSpeech and ming024's implementation of FastSpeech 2. Feel free to use/modify the code.

There are several versions of FastSpeech 2. This implementation is more similar to version 1, which uses F0 values as the pitch features. On the other hand, pitch spectrograms extracted by continuous wavelet transform are used as the pitch features in the later versions.

Updates

2024/1/12: Implement APIWrapper
2021/7/8: Release the checkpoint and audio samples of a multi-speaker English TTS model trained on LibriTTS
2021/2/26: Support English and Mandarin TTS
2021/2/26: Support multi-speaker TTS (AISHELL-3 and LibriTTS)
2021/2/26: Support MelGAN and HiFi-GAN vocoder

Audio Samples

Audio samples generated by this implementation can be found here.

Quickstart

Dependencies

You can install the Python dependencies with

pip3 install -r requirements.txt

Inference

You have to download the pretrained models and put them in output/ckpt/AISHELL3

For Mandarin multi-speaker TTS, try

python3 synthesize.py --text "大家好" --speaker_id SPEAKER_ID --restore_step 600000 --mode single -p config/AISHELL3/preprocess.yaml -m config/AISHELL3/model.yaml -t config/AISHELL3/train.yaml

The generated utterances will be put in output/result/.

Batch Inference

Batch inference is also supported, try

python3 synthesize.py --source preprocessed_data/AISHELL3/val.txt --restore_step 900000 --mode batch -p config/AISHELL3/preprocess.yaml -m config/AISHELL3/model.yaml -t config/AISHELL3/train.yaml

to synthesize all utterances in preprocessed_data/AISHELL3/val.txt

Controllability

The pitch/volume/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios. For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step 900000 --mode single -p config/AISHELL3/preprocess.yaml -m config/AISHELL3/model.yaml -t config/AISHELL3/train.yaml --duration_control 0.8 --energy_control 0.8

Training

Datasets

The supported datasets is

AISHELL-3: a Mandarin TTS dataset with 218 male and female speakers, roughly 85 hours in total.

Preprocessing

First, run

python3 prepare_align.py config/AISHELL3/preprocess.yaml

for some preparations.

As described in the paper, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Alignments of the supported datasets are provided here. You have to unzip the files in preprocessed_data/AISHELL3/TextGrid/.

After that, run the preprocessing script by

python3 preprocess.py config/AISHELL3/preprocess.yaml

Alternately, you can align the corpus by yourself. Download the official MFA package and run

./montreal-forced-aligner/bin/mfa_align raw_data/AISHELL3/ lexicon/librispeech-lexicon.txt english preprocessed_data/AISHELL3

or

./montreal-forced-aligner/bin/mfa_train_and_align raw_data/AISHELL3/ lexicon/librispeech-lexicon.txt preprocessed_data/AISHELL3

to align the corpus and then run the preprocessing script.

python3 preprocess.py config/AISHELL3/preprocess.yaml

Training

Train your model with

python3 train.py -p config/AISHELL3/preprocess.yaml -m config/AISHELL3/model.yaml -t config/AISHELL3/train.yaml

The model takes less than 10k steps (less than 1 hour on my GTX1080Ti GPU) of training to generate audio samples with acceptable quality, which is much more efficient than the autoregressive models such as Tacotron2.

Implementation Issues

Following xcmyz's implementation, I use an additional Tacotron-2-styled Post-Net after the decoder, which is not used in the original FastSpeech 2.
Gradient clipping is used in the training.
In my experience, using phoneme-level pitch and energy prediction instead of frame-level prediction results in much better prosody, and normalizing the pitch and energy features also helps. Please refer to config/README.md for more details.

Please inform me if you find any mistakes in this repo, or any useful tips to train the FastSpeech 2 model.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
audio		audio
config		config
demo/AISHELL3		demo/AISHELL3
hifigan		hifigan
img		img
lexicon		lexicon
model		model
preprocessed_data/AISHELL3		preprocessed_data/AISHELL3
preprocessor		preprocessor
text		text
transformer		transformer
utils		utils
.gitignore		.gitignore
APIWrapper.py		APIWrapper.py
LICENSE		LICENSE
README.md		README.md
dataset.py		dataset.py
evaluate.py		evaluate.py
index.html		index.html
prepare_align.py		prepare_align.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
synthesize.py		synthesize.py
train.py		train.py
tts_server.py		tts_server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FastSpeech 2 - PyTorch Implementation

Updates

Audio Samples

Quickstart

Dependencies

Inference

Batch Inference

Controllability

Training

Datasets

Preprocessing

Training

Implementation Issues

References

About

Uh oh!

Languages

License

Wozzilla/FastSpeech2

Folders and files

Latest commit

History

Repository files navigation

FastSpeech 2 - PyTorch Implementation

Updates

Audio Samples

Quickstart

Dependencies

Inference

Batch Inference

Controllability

Training

Datasets

Preprocessing

Training

Implementation Issues

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages