AST: Audio Spectrogram Transformer Yuan Gong, Yu-An Chung, James Glass. The main idea is applying a visual transformer to the spectogram of a given audio signal in order to extract features for classification. The model was pretrained on AudioSet dataset which has a variety of labeled youtube audio signals(Labels include music,bark,engine etc.). [https://huggingface.co/docs/transformers/en/model_doc/audio-spectrogram-transformer]

"GTZAN is a dataset for musical genre classification of audio signals. The dataset consists of 1,000 audio tracks, each of 30 seconds long. It contains 10 genres, each represented by 100 tracks. The tracks are all 22,050Hz Mono 16-bit audio files in WAV format. The genres are: blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, and rock."[https://huggingface.co/datasets/marsyas/gtzan]
These are used to adapt AST to the new task during training:
- Freezing pretrained AST model layers.
- Replacing the last layer is with a 2 layer MLP wi dropout and Adding a possibility of a DoRA wrapper in the encoder feed forward's last layer.
- Adding sound augmentation to diversify the small dataset(Slow/fast, lowpass/highpass, echo and a mix of these).
- Optuna hyperparameter search.
Used 80% of the GTZAN samples as a training set and the rest were equally divided to a validation and test set(10% each).
Using the section above's method during training and validation, the test set classification accuracy achieved was 83-87% and its confusion matrix:

Cross entropy loss vs training iterations:

Validation accuracy vs training iterations(Best model validation accuracy is 87-89%):

Use git to clone the repository with the following command:
git clone https://github.com/taldatech/ee046211-deep-learning.git
If an ece046211 virtual environment is already installed on your machine, activate it(conda activate deep_learn), skip to the transformers package installation in the table below and continue from there.
Else:
- Get Anaconda with Python 3, follow the instructions according to your OS (Windows/Mac/Linux) at: https://www.anaconda.com/download
- Create a new environment for the course and install packages from scratch:
In Windows open
Anaconda Promptfrom the start menu, in Mac/Linux open the terminal and runconda create --name deep_learn python=3.9. Full guide at https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-with-commands - To activate the environment, open the terminal (or
Anaconda Promptin Windows) and runconda activate deep_learn - Install the required libraries according to the table below (to search for a specific library and the corresponding command you can also look at https://anaconda.org/)
| Library | Command to Run |
|---|---|
Jupyter Notebook |
conda install -c conda-forge notebook |
numpy |
conda install -c conda-forge numpy |
matplotlib |
conda install -c conda-forge matplotlib |
pandas |
conda install -c conda-forge pandas |
scipy |
conda install -c anaconda scipy |
scikit-learn |
conda install -c conda-forge scikit-learn |
seaborn |
conda install -c conda-forge seaborn |
tqdm |
conda install -c conda-forge tqdm |
opencv |
conda install -c conda-forge opencv |
optuna |
pip install optuna |
pytorch (cpu) |
conda install pytorch torchvision torchaudio cpuonly -c pytorch (get command from PyTorch.org) |
pytorch (gpu) |
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia (get command from PyTorch.org) |
torchtext |
conda install -c pytorch torchtext |
torchdata |
conda install -c pytorch torchdata + pip install portalocker |
transformers |
conda install -c conda-forge transformers |
accelerate |
conda install -c conda-forge accelerate |
datasets |
conda install -c conda-forge datasets |
evaluate |
conda install -c conda-forge evaluate |
pydub |
conda install -c conda-forge pydub |
audiomentations |
pip install audiomentations |
librosa |
conda install -c conda-forge librosa |
tensorboardX |
conda install -c conda-forge tensorboardX |
tqdm |
conda install -c conda-forge tqdm |
There are two jupyter notebooks in the repository:
train_test_gtzan.ipynb- Trains the AST model on GTZAN dataset using the suggested fine tuning method, save the most accurate model on the validation set and show the test set results of it.test_best_model.ipynb- Test the current saved best music genre classification model capabilities on your own music files.
To open a notebook, open Ananconda Navigator or run jupyter notebook in the terminal (or Anaconda Prompt in Windows) while the deep_learn environment is activated.
- Yuan Gong and Yu-An Chung and James Glass "AST: Audio Spectrogram Transformer", 2021,Proc. Interspeech 2021,571-575.
- https://huggingface.co/docs/transformers/en/model_doc/audio-spectrogram-transformer
- https://huggingface.co/learn/audio-course/en/chapter4/fine-tuning#conclusion
- https://huggingface.co/datasets/marsyas/gtzan
