This repository contains the code that used for the publication below:
Taejun Kim, Jongpil Lee, and Juhan Nam, "Comparison and Analysis of SampleCNN Architectures for Audio Classification" in IEEE Journal of Selected Topics in Signal Processing (JSTSP), 2019.
Contents:
- Install Dependencies
- Building Datasets
- Music auto-tagging: MagnaTagATune
- Keyword spotting: Speech Commands
- Acoustic scene tagging: DCASE 2017 Task 4
 
- Training a SampleCNN
NOTE: The code of this repository is written and tested on Python 3.6.
- tensorflow 1.10.X (strongly recommend to use 1.10.X because of version compatibility)
- librosa
- ffmpeg
- pandas
- numpy
- scikit-learn
- h5py
To install the required python packages using conda, run the command below:
conda install tensorflow-gpu=1.10.0 ffmpeg pandas numpy scikit-learn h5py
conda install -c conda-forge librosaDownload and preprocess a dataset that you want to train a model on.
Music auto-tagging: MagnaTagATune
Edith Law, Kris West, Michael Mandel, Mert Bay and J. Stephen Downie (2009). Evaluation of algorithms using games: the case of music annotation. In Proceedings of the 10th International Conference on Music Information Retrieval (ISMIR).
Create a directory for the dataset and download required one .csv file and three .zip files in the directory data/mtt/raw:
mkdir -p data/mtt/raw
cd data/mtt/raw
wget http://mi.soi.city.ac.uk/datasets/magnatagatune/annotations_final.csv
wget http://mi.soi.city.ac.uk/datasets/magnatagatune/mp3.zip.001
wget http://mi.soi.city.ac.uk/datasets/magnatagatune/mp3.zip.002
wget http://mi.soi.city.ac.uk/datasets/magnatagatune/mp3.zip.003After download the files, merge and expand the three .zip files:
cat mp3.zip.* > mp3_all.zip
unzip mp3_all.zip -d mp3Your directory structure should look like this:
data
└── mtt
    └── raw
        ├── annotations_final.csv
        └── mp3
            ├── 0
            ├── ...
            └── fFinally, segment and convert audios to TFRecords using following command:
python build_dataset.py mttKeyword spotting: Speech Commands
Pete Warden (2018). Speech commands: A dataset for limited-vocabulary speech recognition. arXiv:1804.03209.
After create a directory for the dataset, download and expand the dataset in the directory data/scd/raw:
mkdir -p data/scd/raw
cd data/scd/raw
wget http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz
tar zxvf speech_commands_v0.02.tar.gzFinally, segment and convert audios to TFRecords using following command:
python build_dataset.py scdAcoustic scene tagging: DCASE 2017 Task 4
Annamaria Mesaros, Toni Heittola, Aleksandr Diment, Benjamin Elizalde, Ankit Shah, Emmanuel Vincent, Bhiksha Raj and Tuomas Virtanen (2017). DCASE 2017 challenge setup: tasks, datasets and baseline system. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017).
mkdir -p data/dcs/raw
cd data/dcs/raw
wget --no-check-certificate -r 'https://docs.google.com/uc?export=download&id=1HOQaUHbTgCRsS6Sr9I9uE6uCjiNPC3d3' -O Task_4_DCASE_2017_training_set.zip
wget --no-check-certificate -r 'https://docs.google.com/uc?export=download&id=1GfP5JATSmCqD8p3CBIkk1J90mfJuPI-k' -O Task_4_DCASE_2017_testing_set.zip
wget https://dl.dropboxusercontent.com/s/bbgqfd47cudwe9y/DCASE_2017_evaluation_set_audio_files.zip
unzip -P DCASE_2017_training_set Task_4_DCASE_2017_training_set.zip
unzip -P DCASE_2017_testing_set Task_4_DCASE_2017_testing_set.zip
unzip -P DCASE_2017_evaluation_set DCASE_2017_evaluation_set_audio_files.zip
wget https://github.com/ankitshah009/Task-4-Large-scale-weakly-supervised-sound-event-detection-for-smart-cars/raw/master/groundtruth_release/groundtruth_weak_label_training_set.csv
wget https://github.com/ankitshah009/Task-4-Large-scale-weakly-supervised-sound-event-detection-for-smart-cars/raw/master/groundtruth_release/groundtruth_weak_label_testing_set.csv
wget https://github.com/ankitshah009/Task-4-Large-scale-weakly-supervised-sound-event-detection-for-smart-cars/raw/master/groundtruth_release/groundtruth_weak_label_evaluation_set.csvFinally, segment and convert audios to TFRecords using following command:
python build_dataset.py dcsYou can train a SampleCNN with a block on a dataset that you want. Here are several examples to run training:
# Train a SampleCNN with SE block (default) on MagnaTagATune dataset (music auto-tagging)
python train.py mtt
# Train a SampleCNN with ReSE-2 block on Speech Commands dataset (keyword spotting)
python train.py scd --block rese2
# Train a SampleCNN with basic block on DCASE 2017 Task 4 dataset (acoustic scene tagging
python train.py dcs --block basicTrained models are saved under log directory with a datetime that you started running.
Here is an example of saved model:
log/
    └── 20190424_213449-scd-se/
        └── final-auc_0.XXXXXX-acc_0.XXXXXX-f1_0.XXXXXX.h5You can see the available options for training using the command below:
$ python train.py -h
usage: train.py [-h] [--data-dir PATH] [--log-dir PATH]
                [--block {basic,se,res1,res2,rese1,rese2}]
                [--amplifying-ratio N] [--multi] [--batch-size N]
                [--momentum M] [--lr LR] [--lr-decay DC] [--dropout DO]
                [--weight-decay WD] [--num-stages N] [--patience N]
                [--num-readers N]
                DATASET [NAME]
Train a SampleCNN.
positional arguments:
  DATASET               Dataset for training: {mtt|scd|dcs}
  NAME                  Name of log directory.
optional arguments:
  -h, --help            show this help message and exit
  --data-dir PATH
  --log-dir PATH        Directory where to write event logs and models.
  --block {basic,se,res1,res2,rese1,rese2}
                        Convolutional block to build a model (default: se,
                        options: basic/se/res1/res2/rese1/rese2).
  --amplifying-ratio N
  --multi               Use multi-level feature aggregation.
  --batch-size N        Mini-batch size.
  --momentum M          Momentum for SGD.
  --lr LR               Learning rate.
  --lr-decay DC         Learning rate decay rate.
  --dropout DO          Dropout rate.
  --weight-decay WD     Weight decay.
  --num-stages N        Number of stages to train.
  --patience N          Stop training stage after #patiences.
  --num-readers N       Number of TFRecord readers.