Skip to content

Leveraging advanced ASR systems and BERT embeddings to preprocess lecture videos for downstream indexing and retrieval

License

Notifications You must be signed in to change notification settings

Alexiuszz/E2E-Video-Processing-system

Repository files navigation

Evaluating Automatic Speech Recognition Models for Semantic Topic Segmentation in Educational Video Retrieval

This project aims to explore the trade-off between accuracy and processing time of ASR models and to develop a robust topic segmentation algorithm. It also develops a FastAPI web-based MVP.

Table of Contents

ASR Evaluation

The first phase of this project evaluates five ASR models on 34 videos collected from MIT OpenCourseWare. The dataset includes manually corrected transcripts, which serve as the ground truth and can be provided on request. The evaluation also explores the effect of audio enhancement techniques on Word Error Rate (WER) and Real-Time Factor (RTF).

To get started clone the repo:

git clone https://github.com/Alexiuszz/E2E-Video-Processing-system.git

Dataprocessing and Audio Enhancement

  1. Enter the dataset processing directory:
cd DatasetProcessing
  1. Create a virtual environment and install dependencies:
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
  1. Convert video to audio and create CSV metadata:
python helper/create_csv_from_dir.py
python video2audio.py
  1. Apply audio enhancement:
python audio_enhancement.py

Batch Transcription

  1. Navigate to the ASR batch scripts directory:
cd ../ASR/batch_scripts
  1. Create a virtual environment and install dependencies:
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
  1. Set environment variables in a .env file:
OPENAI_API_KEY=your_openai_key
BASE_DIR=/path/to/your/log/output
  1. Run any of the batch transcription scripts:
python whisper_batch.py  

Get WER

To evaluate Word Error Rate:

  • For local models:
python WER_hpc.py
  • For OpenAI API:
python WER_hpc.py

Topic Segmentation

The second phase of the project focuses on developing a robust topic segmentation algorithm, evaluated against three benchmark datasets:

Baseline comparison models include:

  • Random segmentation
  • Even segmentation
  • Solbiati et al. (unsupervised segmentation)

The topic segmentation algorithm can be found at: topic_segment.py To begin:

cd Segmentation

Dataset Processing

The YTSeg dataset must be preprocessed before running the evaluation:

python3 datasets_/ytseg_data_preparation.py --input_dir "/path/to/raw/dataset/directory" --output_dir "/path/to/clean/dataset/directory"

Run Evaluation Script in main.py

python main.py --model <model_name> --dataset <dataset_name> [--test_size <num_samples>]
  • --model: Segmentation model. Options: random, bertseg, default, simple, even
  • --dataset: Dataset to use. Options: ytseg, ami, icsi
  • --test_size: (Optional) Limit number of samples

Example:

python main.py --model bertseg --dataset ytseg --test_size 10

FastAPI MVP

Instructions for running the FastAPI MVP can be found in the README.md of the FastAPI directory.

Acknowledgements

About

Leveraging advanced ASR systems and BERT embeddings to preprocess lecture videos for downstream indexing and retrieval

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published