Evaluating Automatic Speech Recognition Models for Semantic Topic Segmentation in Educational Video Retrieval
This project aims to explore the trade-off between accuracy and processing time of ASR models and to develop a robust topic segmentation algorithm. It also develops a FastAPI web-based MVP.
The first phase of this project evaluates five ASR models on 34 videos collected from MIT OpenCourseWare. The dataset includes manually corrected transcripts, which serve as the ground truth and can be provided on request. The evaluation also explores the effect of audio enhancement techniques on Word Error Rate (WER) and Real-Time Factor (RTF).
git clone https://github.com/Alexiuszz/E2E-Video-Processing-system.git- Enter the dataset processing directory:
cd DatasetProcessing- Create a virtual environment and install dependencies:
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt- Convert video to audio and create CSV metadata:
python helper/create_csv_from_dir.py
python video2audio.py- Apply audio enhancement:
python audio_enhancement.py- Navigate to the ASR batch scripts directory:
cd ../ASR/batch_scripts- Create a virtual environment and install dependencies:
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt- Set environment variables in a
.envfile:
OPENAI_API_KEY=your_openai_key
BASE_DIR=/path/to/your/log/output- Run any of the batch transcription scripts:
python whisper_batch.py To evaluate Word Error Rate:
- For local models:
python WER_hpc.py- For OpenAI API:
python WER_hpc.pyThe second phase of the project focuses on developing a robust topic segmentation algorithm, evaluated against three benchmark datasets:
Baseline comparison models include:
- Random segmentation
- Even segmentation
- Solbiati et al. (unsupervised segmentation)
The topic segmentation algorithm can be found at: topic_segment.py To begin:
cd SegmentationThe YTSeg dataset must be preprocessed before running the evaluation:
python3 datasets_/ytseg_data_preparation.py --input_dir "/path/to/raw/dataset/directory" --output_dir "/path/to/clean/dataset/directory"python main.py --model <model_name> --dataset <dataset_name> [--test_size <num_samples>]--model: Segmentation model. Options:random,bertseg,default,simple,even--dataset: Dataset to use. Options:ytseg,ami,icsi--test_size: (Optional) Limit number of samples
Example:
python main.py --model bertseg --dataset ytseg --test_size 10Instructions for running the FastAPI MVP can be found in the README.md of the FastAPI directory.