Training a speech recogniser using data from speech synthesis
Domain: Audio Dr. Prasanta Kumar Ghosh (prasantg@iisc.ac.in)
Short Description:
Generally, training automatic speech recognition (ASRs) systems require paired data of speech and text. In this problem statement, you will be training an ASR with only text. This is done by using a pre-trained multispeaker text-to-speech (TTS) model to generate this. The generated speech, along with the corresponding text, is used to train ASR. The trained ASR will be evaluated on unseen sentences for seen and unseen speakers from the multi-speaker TTS system.
Pretrained multispeaker TTS using coqui-ai:
https://huggingface.co/projecte-aina/tts-ca-coqui-vits-multispeaker
Training ASR using speechbrain: https://colab.research.google.com/drive/1aFgzrUv3udM_gNJNUoLaHIm78QHtxdIz?usp=sharing
Reference(s): 1. https://arxiv.org/pdf/2306.00998.pdf