SLAM-ASR is an unofficial Python Lightning implementation that combines Whisper speech encoder and large language models (LLMs) for automatic speech recognition (ASR). It leverages the power of pre-trained models to transcribe speech into text.
- Utilizes Whisper speech encoder for speech feature extraction
- Supports various LLMs for text generation (e.g., Meta-Llama, Vicuna)
- Trains the model using PyTorch Lightning for scalability and ease of use
- Provides inference script for transcribing audio files
- Clone the repository:
git clone https://github.com/WithourAI/slam-asr.git
cd slam-asr- Install the required dependencies:
pip install -r requirements.txtTo train the SLAM-ASR model, change the parameters and run the following command:
bash train.shThis script will start the training process using the specified configuration, including the trainer settings, model hyperparameters, and callbacks.
From testing, the model can be trained on a single GPU with 24GB of memory with batch_size 1. With 40GB of memory, the batch_size can be increased to 6.
To transcribe an audio file using the trained SLAM-ASR model, use the inference.py script:
The SLAM-ASR model consists of the following components:
- Whisper speech encoder: Extracts features from the input speech
- Projector: Transforms speech embeddings to the LLM embedding space
- LLM (e.g., Meta-Llama, Vicuna): Generates text based on the projected speech embeddings
The model can be trained on various speech datasets, such as LibriSpeech or custom datasets. Modify the setup method in slam_asr.py to load your desired dataset.
The training configuration can be customized by modifying the arguments in the train.sh script. Adjust the trainer settings, model hyperparameters, and callbacks according to your requirements.
This project is licensed under the MIT License.
- OpenAI Whisper
- Meta-Llama
- Vicuna
- PyTorch Lightning
- The SLAM-ASR implementation is inspired by the paper "An Embarrassingly Simple Approach for LLM with Strong ASR Capacity" by Ziyang Ma et al.