I’m a PhD researcher in Computer Science at the University of Houston, working on computer vision, multimodal AI, and video understanding.
Lately, I’ve been exploring how multimodal systems can make better sense of long videos, connect language with visuals, and generate more useful summaries and chapter-level understanding.
This GitHub is where I share projects and tools across multimodal summarization, vision-language systems, object detection, and practical ML workflows.
If any of this sounds interesting, feel free to reach out: dipayan1109033@gmail.com
Multimodal AI · Computer Vision · Video Understanding · Vision-Language Models · Multimodal Summarization · Visual Grounding
Python · SQL · PyTorch · Lightning · Transformers · OpenCV · Hydra · MLflow · DeepEval · Git · AWS