- [2025.01.15] We are excited to share that our evaluation datasets, Charades-CON and ActivityNet-CON, are now available on Hugging Face! 🎉 Additionally, the training annotations for VTune have also been released.
- [2025.01.14] We have released our four checkpoints using VTune: VideoLLaMA-7B-Charades-VTune, VideoLLaMA-7B-ActvityNet-VTune, TimeChat-7B-Charades-VTune, TimeChat-7B-ActvityNet-VTune. Additionally, checkpoints with naive fine-tuning: VideoLLaMA-7B-Charades-FT, VideoLLaMA-7B-ActvityNet-FT, TimeChat-7B-ActivityNet-FT have been released.
- [2024.11.20] Our paper has been released on arXiv.
- We study the model’s consistency in temporal comprehension by assessing whether its responses align with the initial grounding, using dedicated probes and datasets. We specifically focus on video temporal grounding, where the task involves identifying timestamps in a video that correspond to language queries.
If you find our paper useful, please consider citing our paper.
@article{jung2024consistency,
title={On the Consistency of Video Large Language Models in Temporal Comprehension},
author={Jung, Minjoon and Xiao, Junbin and Zhang, Byoung-Tak and Yao, Angela},
journal={arXiv preprint arXiv:2411.12951},
year={2024}
}
We appreciate for the following awesome Video-LLMs: