Skip to content

juliaxchen/Generating-Audio-Descriptions

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Abstract

Audio descriptions are a form of narration that provide blind and low vision individuals information about key visual elements of a video. This project aims to develop a machine learning model that generates audio descriptions to improve the accessibility of videos. The model analyzes the frames of a video by quantifying their complexity using JPG image size. Hierarchical clustering is performed on the JPG image sizes to identify the most representative frames. These frames are processed by the Contrastive Language-Image Pre-training (CLIP) Interrogator which generates descriptions for each of the selected frames. The descriptions are then added to the video in text and audio. The limitations of this model include lengthy processing time and inaccurate descriptions generated by the CLIP Interrogator model.

About

Generating audio descriptions for videos using Contrastive Language-Image Pre-Training (CLIP) interrogation on frames selected through hierarchical clustering

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors