Skip to content

A comprehensive framework for evaluating temporal coherence in multimodal foundation models, featuring novel metrics like CLIPGain and BERTScore, tested on benchmarks such as TOMATO and MSR-VTT.

License

Notifications You must be signed in to change notification settings

StarkGoku10/Video-Temporal-Consistency-Analysis

 
 

Repository files navigation

Video-Temporal-Consistency-Analysis

Prerequisits

Setup the python virtual environment follow these commands (for linux):

# Go to the root directory
python3 -m venv <env_name>
source <env_name>/bin/activate
pip install -r requirements.txt

Due to dependency isse after installing the packages from requirements.txt, install folowing packages:

pip install decord
pip install numpy==1.26.4
pip install wheel
pip install flash-attn
pip install git+https://github.com/huggingface/transformers

You can also create a conda virtual environment:

conda env create -f environment.yml
conda activate video_vl_env

Clone this repository

git clone https://github.com/MayankD409/Video-Temporal-Consistency-Analysis.git
cd Video-Temporal-Consistency-Analysis

Download the videos and unzip into the /Video-Temporal-Consistency-Analysis directory

After downloading the videos, your file structure should look like this.
.
├── data/
├── src/
├── videos/
│   ├── human/
│   ├── object/
│   ├── simulated/

Create a .env file in the root directory with the following format:

OPENAI_API_KEY="your_openai_api_key"
GEMINI_API_KEY="your_gemini_api_key"
REKA_API_KEY="your_reka_api_key"

Create a pretrained folder to download pretrained model:

mkdir pretrained

Commands for running

Below are the commands to setup and run the specific models:

Standard command:

python src/evaluate.py --model $model_name --reasoning_type ALL --demonstration_type ALL --total_frames $total_frames

InternVL2 (WORKING in LAPTOP)

First download the pretrained model:

cd pretrained
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-1B --local-dir InternVL2-1B
# For InternVL-1B
python src/evaluate.py --model InternVL2-1B --reasoning_type ALL --demonstration_type ALL --total_frames 8

Gemini (WORKING in LAPTOP)

# For gemini-1.5-flash
python src/evaluate.py --model gemini-1.5-flash --reasoning_type ALL
# For gemini-1.5-pro
python src/evaluate.py --model gemini-1.5-pro --reasoning_type ALL

VCCAM (Working but need Nexus)

Make sure you have cloned the github repo in the generate_lib folder. Otherwise here is the command:

git clone git@github.com:QQ-MM/Video-CCAM.git      

Download the pretrained model:

cd pretrained
# 4B
huggingface-cli download --resume-download --local-dir-use-symlinks False JaronTHU/Video-CCAM-4B-v1.1 --local-dir Video-CCAM-4B-v1.1
# Phi-3-mini
huggingface-cli download --resume-download --local-dir-use-symlinks False microsoft/Phi-3-mini-4k-instruct --local-dir Phi-3-mini-4k-instruct
# vision encoder
huggingface-cli download --resume-download --local-dir-use-symlinks False google/siglip-so400m-patch14-384 --local-dir siglip-so400m-patch14-384

Run the evaluation script:

python src/evaluate.py --model Video-CCAM-4B-v1.1 --reasoning_type ALL --total_frames 8

Qwen2-VL (Working but need Nexus)

The code of Qwen2-VL has been in the latest Hugging face transformers and we advise you to build from source with command:

pip install git+https://github.com/huggingface/transformers

First download the Pretrained model:

cd pretrained
huggingface-cli download --resume-download --local-dir-use-symlinks False Qwen/Qwen2-VL-2B-Instruct --local-dir Qwen2-VL-2B-Instruct
# For InternVL-1B
python src/evaluate.py --model Qwen2-VL-2B-Instruct --reasoning_type ALL --demonstration_type ALL --total_frames 8

Video-Llama (Working but need Nexus)

Make sure you have cloned the github repo in the generate_lib folder. Otherwise here is the command:

git clone git@github.com:DAMO-NLP-SG/VideoLLaMA2.git

Download the pretrained model:

cd pretrained
# video LLaMA 2 7B
huggingface-cli download --resume-download --local-dir-use-symlinks False DAMO-NLP-SG/VideoLLaMA2-7B --local-dir VideoLLaMA2-7B

Run the evaluation script:

python src/evaluate.py --model VideoLLaMA2-7B --reasoning_type ALL --total_frames 16

GPT (WORKING but need to buy credits)

# For gpt-4-turbo-preview
python src/evaluate.py --model gpt-4-turbo-preview --reasoning_type ALL --total_frames 8
# For gpt-4o
python src/evaluate.py --model gpt-4o --reasoning_type ALL --total_frames 8
# For gpt-4o-mini
python src/evaluate.py --model gpt-4o-mini --reasoning_type ALL --total_frames 8

Reka (WORKING but need to buy credits)

Make sure to add api-key for reka in .env

# For reka-core-20240501
python src/evaluate.py --model reka-core-20240501 --reasoning_type ALL
# For reka-flash-20240226
python src/evaluate.py --model reka-flash-20240226 --reasoning_type ALL
# For reka-edge-20240208
python src/evaluate.py --model reka-edge-20240208--reasoning_type ALL

Video-Llava (Not working but fixable)

Download the pretrained model:

cd pretrained
huggingface-cli download --resume-download --local-dir-use-symlinks False LanguageBind/Video-LLaVA-7B-hf --local-dir Video-LLaVA-7B-hf

Run the evaluation script:

python src/evaluate.py --model Video-LLaVA-7B-hf --reasoning_type ALL 

VILA (Not Working yet)

Make sure you have cloned the github repo in the generate_lib folder. Otherwise here is the command:

git clone git@github.com:NVlabs/VILA.git

Download the pretrained model:

cd pretrained
# video LLaMA 2 7B
huggingface-cli download --resume-download --local-dir-use-symlinks Efficient-Large-Model/VILA1.5-13b --local-dir VILA1.5-13b

Run the evaluation script:

python src/evaluate.py --model VILA1.5-13B --reasoning_type ALL --total_frames 8

About

A comprehensive framework for evaluating temporal coherence in multimodal foundation models, featuring novel metrics like CLIPGain and BERTScore, tested on benchmarks such as TOMATO and MSR-VTT.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%