Skip to content
Aswin Kumar edited this page Aug 22, 2016 · 15 revisions

Forced-Alignment

Is the process of taking the text transcription of an audio speech segment and determining where in time particular words occur in the speech segment. A project developed under RedHenLab during Google Summer of Code - 2016.

Contents

Target Functionality

Red Hen records more than a hundred hours of television news each day. Inside the video files we capture are embedded closed captions (teletext in the DVB standard) like this from the file 2016-08-03_2000_US_MSNBC_MSNBC_Live.txt:

20160803200006.067|20160803200006.568|CC1|WE'VE BEEN HUMILIATED BY
20160803200006.701|20160803200008.536|CC1|PRESIDENT OBAMA AND HIS
20160803200008.703|20160803200008.937|CC1|POLICIES.
20160803200009.070|20160803200011.372|CC1|WE'VE BEEN HUMILIATED BY THE
20160803200011.506|20160803200013.708|CC1|IRAN DEAL TO START OFF WITH
20160803200013.842|20160803200017.745|CC1|WHERE THEY GET BACK 50
20160803200017.879|20160803200018.079|CC1|BILLION.
20160803200018.213|20160803200021.216|CC1|WE'VE BEEN HUMILIATED AS A
20160803200021.349|20160803200023.885|CC1|COUNTRY WHEN THEY TOOK OUR
20160803200024.018|20160803200024.219|CC1|SAILORS.
20160803200024.352|20160803200026.087|CC1|THEY FORCED THEM TO THEIR KNEES,
20160803200026.221|20160803200028.523|CC1|AND THE ONLY REASON WE GOT THEM
20160803200028.656|20160803200031.392|CC1|BACK IS BECAUSE WE HADN'T PAID
20160803200031.526|20160803200032.560|CC1|THE MONEY YET.
20160803200032.694|20160803200035.063|CC1|AND THAT'S THE ONLY REASON WE
20160803200035.196|20160803200036.364|CC1|GOT THEM BACK, OTHERWISE THEY
20160803200036.497|20160803200038.232|CC1|WOULD HAVE HAD TO WAIT UNTIL I
20160803200038.366|20160803200039.767|CC1|BECAME PRESIDENT.
20160803200039.901|20160803200042.403|CC1|BELIEVE ME, THEY WOULD HAVE COME
20160803200042.537|20160803200043.037|CC1|BACK FAST.
20160803200043.171|20160803200044.105|CC1|THEY WOULD HAVE COME BACK VERY
20160803200044.238|20160803200046.174|CC1|FAST.
20160803200046.341|20160803200048.343|CC1|YOU LOOK AT OUR NUMBERS, JUST
20160803200048.476|20160803200051.546|CC1|TAKE A LOOK AT THE NUMBERS.
20160803200051.679|20160803200052.814|CC1|HOMEOWNERSHIP, THE LOWEST
20160803200052.947|20160803200054.749|CC1|NUMBER, THE WORST NUMBER THAT
20160803200054.882|20160803200058.119|CC1|IT'S BEEN IN 50 YEARS.
20160803200058.252|20160803200058.619|CC1|HOMEOWNERSHIP.

This is a transcript of Donald Trump speaking, provided as a service by the broadcast station (MSNBC in this case) to people who are hard of hearing, as mandated by the Federal Communications Commission in the US, and similar legislatioin in other countries.

This transcript is created live by professional captioners -- they type in the text as he speaks. Even though they're excellent at this, there is of course a slight gap in time between when he says something, and when the captioners has typed the text and it appears on the screen. This means that all the timestamps (on the left side -- the start time and the end time for each line -- are slightly wrong -- typically by something between three and ten seconds.

Forced alignment brings the text into synchronization with the video and audio. Concretely, this means the forced alignment process needs to modify the time stamps so that they are exactly correct instead of delayed.

So for instance the lines that now read:

20160803200043.171|20160803200044.105|CC1|THEY WOULD HAVE COME BACK VERY 20160803200044.238|20160803200046.174|CC1|FAST.

Would be changed to read:

20160803200040.151|20160803200042.305|CC1|THEY WOULD HAVE COME BACK VERY 20160803200042.338|20160803200043.084|CC1|FAST.

The only change is in the timing. This is the task requirement.

##Overview The process takes a video file and its text transcription as input. From which we obtain the audio file and segment it into smaller files. These segments are then fed into acoustic and language models for phone level and word level transcription. This is then compared to the input text transcription and aligned according to their respective time stamps.

The following list of files executes the mentioned operation:

  • 001_audio_from_video.py

    • Function: Extracts audio from the given video file.
    • I/P: Path to directory containing Video files and Path to directory where the Audio file should be stored.
    • O/P: Extracted Audio files are stored in the given directory.
  • run.sh

    • Function: Initialize the required directories and calls the scripts "chunk_big_file.sh" and "run_kaldi.sh".
    • I/P: Original audio file.
    • O/P: The Forced aligned textgrid is obtained.
  • chunk_big_file.sh

    • Function: Segmenting the original audio file into smaller segments.
    • I/P: Original audio file.
    • O/P: segmented audio files.
  • run_kaldi.sh

    • Function:
      1. Prepares data
      2. Prepares LM
      3. Extracts MFCC features for audiobook data
      4. Decodes audiobook data using acoustic model trained on 100 hrs of clean Librispeech data
      5. Adapts acoustic model to the audio book data and decodes again (OPTIONAL)
    • I/P: segmented audio files.
    • O/P: Text file, word level and phone level transcription.
  • path.sh

    • Function: It initializes the path variables.
  • output_lab_and_confidence_info.sh

    • Function: Provides the confidence scores for word and phone, which is used for alignment purpose.
    • I/P: Word lattices generated by the decoder.
    • O/P: Two lab dirs (lab_wd_level/ and lab_phn_level/ containing files with lab info and confidence scores at word and phone level respectively)

##Dependencies The following softwares are required for executing this project. And present in Case HPC.

  • Kaldi
  • SRILM
  • IRSTLM
  • ATLAS
  • SOX
  • PYTHON
  • EDINBURGH SPEECH TOOLS
  • ffmpeg

###Installation of Dependencies

  • Kaldi:

    • git clone https://github.com/kaldi-asr/kaldi.git
    • make -j <num_free_CPUs>
    • cd ../src
    • ./configure.
      Add .../kaldi-trunk/src/*bin,
      .../kaldi-trunk/tools/openfst-1.3.4/src/bin and
      .../kaldi-trunk/tools/irstlm/bin and
      .../kaldi-trunk/tools/srilm/bin to $PATH in ~/.bashrc file.
      The above installation procedure can also be found in Kald_Install
  • SRILM:

    • There is no proper guide for installation on HPC. So, I used 'scp' command to transfer the tar file from my local machine to remote machine.
    • Download the setup from the official site. SRILM
    • Run from your local machine: scp path/to/file/ username@hpc1:/path/to/file/
    • Untar it using tar -xvcf file.tar.gz
  • IRSTLM:

    • wget <url for download>
    • untar it using tar -xvcf file.tar.gz
  • ATLAS:

    • This dependency is available as a supportive model along with kaldi.
  • SOX:

    • Request your admin to install it using yum install sox.
  • Python:

    • First check whether it is installed using module avail python
    • If not, then install it using module load python.
  • Edinburgh Speech tools:

    • wget http://www.cstr.ed.ac.uk/projects/speech_tools/
    • Untar it and follow the INSTALL file.
  • ffmpeg:

    • First check for the dependency using module spider ffmpeg.
    • If not, then install it using module load ffmpeg.
      All these Dependencies can be found in HPC ~/Pipeline/kaldi-trunks/tools/

##Steps Involved:

Training the Models:

  • Acoustic model is trained with 100 hours of clear speech data from Librispeech.

            `du -h data/tinyshakespeare/input.txt `    
            `th train.lua -help`   
            `du -h data/philosopher/input.txt `  
            `th train.lua -data_dir data/philosopher/ -rnn_size 300 -num_layers 3 -dropout 0.5`  
            `th train.lua -data_dir data/philosopher/ -rnn_size 300 -num_layers 3 -dropout 0.5 -init_from'                      
            `cv/lm_lstm_epoch6.48_1.0080.t7 `  
    

Obtaining the video and text files:

  • RedHen holds more than hundered hours of Television news every day. This can be obtained using

    ssh redalpha
    rsync redalpha:~/falign/2006/2006-01/2006-01-02 ~/tv/ -avn (gets 8-hour files if you remove the final n)
    rsync redalpha:~/falign/2016/2016-07/2016-07-15 ~/tv/ -avn (gets regular mainly one-hour files if you remove the final n)

Extracting Audio from the video files:

  • Run the python as mentioned below:
    python 001_audio_from_video.py <path/containing/video_file> <path/to/audio_file>

Decoding the Audio file:

  • Running the following instruction will give you the text, word_level transcription, phone_level transcription.
    sh run.sh <path/to/audio_file>
    The outputs can be found in data_for_TTS/etc, data_for_TTS/wd_level, data_for_TTS/phn_level.

Alignment of the Textgrid:

  • Compares the word_level transcription with the provided text and aligns it.
    Currently,this work is still in progress will be updated soon.