Audio descriptions are a form of narration that provide blind and low vision individuals information about key visual elements of a video. This project aims to develop a machine learning model that generates audio descriptions to improve the accessibility of videos. The model analyzes the frames of a video by quantifying their complexity using JPG image size. Hierarchical clustering is performed on the JPG image sizes to identify the most representative frames. These frames are processed by the Contrastive Language-Image Pre-training (CLIP) Interrogator which generates descriptions for each of the selected frames. The descriptions are then added to the video in text and audio. The limitations of this model include lengthy processing time and inaccurate descriptions generated by the CLIP Interrogator model.
juliaxchen/Generating-Audio-Descriptions
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|