Evaluating Automatic Speech Recognition Models for Semantic Topic Segmentation in Educational Video Retrieval

This project aims to explore the trade-off between accuracy and processing time of ASR models and to develop a robust topic segmentation algorithm. It also develops a FastAPI web-based MVP.

ASR Evaluation

The first phase of this project evaluates five ASR models on 34 videos collected from MIT OpenCourseWare. The dataset includes manually corrected transcripts, which serve as the ground truth and can be provided on request. The evaluation also explores the effect of audio enhancement techniques on Word Error Rate (WER) and Real-Time Factor (RTF).

To get started clone the repo:

git clone https://github.com/Alexiuszz/E2E-Video-Processing-system.git

Dataprocessing and Audio Enhancement

Enter the dataset processing directory:

cd DatasetProcessing

Create a virtual environment and install dependencies:

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Convert video to audio and create CSV metadata:

python helper/create_csv_from_dir.py
python video2audio.py

Apply audio enhancement:

python audio_enhancement.py

Batch Transcription

Navigate to the ASR batch scripts directory:

cd ../ASR/batch_scripts

Create a virtual environment and install dependencies:

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Set environment variables in a .env file:

OPENAI_API_KEY=your_openai_key
BASE_DIR=/path/to/your/log/output

Run any of the batch transcription scripts:

python whisper_batch.py

Get WER

To evaluate Word Error Rate:

For local models:

python WER_hpc.py

For OpenAI API:

python WER_hpc.py

Topic Segmentation

The second phase of the project focuses on developing a robust topic segmentation algorithm, evaluated against three benchmark datasets:

Baseline comparison models include:

Random segmentation
Even segmentation
Solbiati et al. (unsupervised segmentation)

The topic segmentation algorithm can be found at: topic_segment.py To begin:

cd Segmentation

Dataset Processing

The YTSeg dataset must be preprocessed before running the evaluation:

python3 datasets_/ytseg_data_preparation.py --input_dir "/path/to/raw/dataset/directory" --output_dir "/path/to/clean/dataset/directory"

Run Evaluation Script in `main.py`

python main.py --model <model_name> --dataset <dataset_name> [--test_size <num_samples>]

--model: Segmentation model. Options: random, bertseg, default, simple, even
--dataset: Dataset to use. Options: ytseg, ami, icsi
--test_size: (Optional) Limit number of samples

Example:

python main.py --model bertseg --dataset ytseg --test_size 10

FastAPI MVP

Instructions for running the FastAPI MVP can be found in the README.md of the FastAPI directory.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
ASR		ASR
DatasetProcessing		DatasetProcessing
E2E_Video_Processing_System		E2E_Video_Processing_System
HPC_FastAPI_Server		HPC_FastAPI_Server
OpenAI_API		OpenAI_API
Segmentation		Segmentation
Video2TranscriptServer		Video2TranscriptServer
.gitignore		.gitignore
Author's Statement.txt		Author's Statement.txt
LICENSE		LICENSE
README.md		README.md
notes.txt		notes.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Evaluating Automatic Speech Recognition Models for Semantic Topic Segmentation in Educational Video Retrieval

Table of Contents

ASR Evaluation

To get started clone the repo:

Dataprocessing and Audio Enhancement

Batch Transcription

Get WER

Topic Segmentation

Dataset Processing

Run Evaluation Script in `main.py`

FastAPI MVP

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

License

Alexiuszz/E2E-Video-Processing-system

Folders and files

Latest commit

History

Repository files navigation

Evaluating Automatic Speech Recognition Models for Semantic Topic Segmentation in Educational Video Retrieval

Table of Contents

ASR Evaluation

To get started clone the repo:

Dataprocessing and Audio Enhancement

Batch Transcription

Get WER

Topic Segmentation

Dataset Processing

Run Evaluation Script in main.py

FastAPI MVP

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Run Evaluation Script in `main.py`

Packages