- Clone the repository
git clone https://github.com/natlamir/MeloTTS-Windows.git
cd MeloTTS-Windows
- Create conda environment and install dependencies
conda env create -f environment.yml
conda activate melotts-win
pip install -e .
python -m unidic download
If you have trouble doing the download with the python -m unidic download you can try this:
- Download the zip from: https://cotonoha-dic.s3-ap-northeast-1.amazonaws.com/unidic-3.1.0.zip
- Place it in: C:\Users\YOUR_USER_ID\miniconda3\envs\melotts-win\Lib\site-packages\unidic
- Rename it to unidic.zip
- Replace the downalod.py file in this same directory with the one from https://github.com/natlamir/ProjectFiles/blob/main/melotts/download.py
- Now re-run the
python -m unidic download. This info originally gotten from: myshell-ai#62 (comment)
- Install pytorch
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
- Prepare faster-whisper (optional for fast transcribing of audio files):
- Download cuda/cublas here https://github.com/Purfview/whisper-standalone-win/releases/download/libs/cuBLAS.and.cuDNN_CUDA11_win_v2.7z, extract and place the 5 dll files directly into the
MeloTTS-Windows/melo/folder - To install faster-whisper (and prevent conflicts with it) run this from the conda window:
- Download cuda/cublas here https://github.com/Purfview/whisper-standalone-win/releases/download/libs/cuBLAS.and.cuDNN_CUDA11_win_v2.7z, extract and place the 5 dll files directly into the
pip install faster-whisper==0.9.0
pip install transformers==4.30.2 huggingface_hub==0.16.4
- Run using:
melo-ui
- In the
melo/data/examplefolder, delete the examplemetadata.listfile. - MeloTTS expects wav audio files (with a sample rate of 44100Hz). If you need to convert audio to wav format (with 44100Hz sample rate), create a folder called
audioin the example folder and copy all your audio files into theaudiofolder - With a conda window activated with the enviroment open in the
melofolder, runConvertAudiotoWav.batfrom the conda prompt. This will create a folderdata/example/wavswith all of the converted wav files. - Create a transcript file by running
transcript_fast.batwhich will create adata/example/metadata.listfile using faster-whisper. Alternately, you can runpython transcript.pyto use the original whisper. - Run
python preprocess_text.py --metadata data/example/metadata.listto create thetrain.list,config.json, among other files in thedata/examplefolder. - Modify
config.jsonto change the batch size, epochs, learning rate, etc.
⚠️ Important, If you plan to Resume Training Later:- The
eval_intervalsetting determines how frequently your model is saved during training - For example, if
eval_interval=1000, the model saves only once every 1000 steps - If you stop training between save points, any progress since the last save will be lost
- For safer training sessions that you may need to resume later, use a smaller
eval_intervalvalue - You can also adjust
n_ckpts_to_keepto limit the max models kept (ifn_ckpts_to_keep=5, it will delete the oldest models when their are more than 5 saved models)
- The
- From the conda prompt run
train.batto start the training. - File will be created within the
data/example/configfolder with the checkpoints and other logging information. - To test out a checkpoint, run:
python infer.py --text "this is a test" -m "C:\ai\MeloTTS-Windows\melo\data\example\config\G_0.pth" -o outputchanging the G_0 to the checkpoint you want to test with G_1000, G2000, etc. - When you want to use a checkpoint from the UI, create a
melo/customfolder and copy the .pth andconfig.jsonfile over from thedata/example/config, rename the .pth to a user-friendly name, and launch the UI to see it in the custom voice dropdown. - To see the tensorboard, install
pip install tensorflow - Run
tensorboard --logdir=data\example\config - This will give you the local URL to view the tensorboard.
- From the conda prompt run
train.batagain to resume the training. The training will resume from the newest G_XXXX.pth file.
You can trim your model to make it a way smaller filesize (which will make it load faster during the model loading process). When testing, this made the model filesize about 66% smaller. Note the created trimmed model is for inference-only(using the model just to generate audio from text) and you won't be able to train it further.
- Open
trim_models.batfile in a text editor to change the directory to your G_XXXX.pth files and the save location, save the changes, then runtrim_models.batto create a trimmed model for inference only.
MeloTTS is a high-quality multi-lingual text-to-speech library by MIT and MyShell.ai. Supported languages include:
| Language | Example |
|---|---|
| English (American) | Link |
| English (British) | Link |
| English (Indian) | Link |
| English (Australian) | Link |
| English (Default) | Link |
| Spanish | Link |
| French | Link |
| Chinese (mix EN) | Link |
| Japanese | Link |
| Korean | Link |
Some other features include:
- The Chinese speaker supports
mixed Chinese and English. - Fast enough for
CPU real-time inference.
The Python API and model cards can be found in this repo or on HuggingFace.
Discord
Join our Discord community and select the Developer role upon joining to gain exclusive access to our developer-only channel! Don't miss out on valuable discussions and collaboration opportunities.
Contributing
If you find this work useful, please consider contributing to this repo.
- Many thanks to @fakerybakery for adding the Web UI and CLI part.
- Wenliang Zhao at Tsinghua University
- Xumin Yu at Tsinghua University
- Zengyi Qin at MIT and MyShell
Citation
@software{zhao2024melo,
author={Zhao, Wenliang and Yu, Xumin and Qin, Zengyi},
title = {MeloTTS: High-quality Multi-lingual Multi-accent Text-to-Speech},
url = {https://github.com/myshell-ai/MeloTTS},
year = {2023}
}
This library is under MIT License, which means it is free for both commercial and non-commercial use.
This implementation is based on TTS, VITS, VITS2 and Bert-VITS2. We appreciate their awesome work.
