VidIL: Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners (NeurIPS 2022)
-
Download dataset annotations zip from box or google drive. Then unzip the downloaded datasets under
shared_datasets/
. The resulting shared_dataset folder structure is expected to be:shared_datasets ├── README.md ├── MSRVTT_caption ├── MSRVTT_qa ...
Then, please refer to Dataset Instruction for downloading and processing raw videos.
-
Download BLIP checkpoints:
bash download_blip_checkpoints.sh
-
Download Input & Output Examples zip from box or google dirve. Unzip the folders under
output_example/
, the resultingoutput_example/
folder structure is expected to be:output_example ├── msrvtt ├── msvd_test ├── vlep_test └── README.md
-
[Update 6/17] GPT-3 Results for Video Captioning, Video Question Answering and VLEP can be downloaded here.
-
launch the docker environment:
- (1) set up variable "CKPT" and "DATASETS" as commented in
run_docker_vidil.sh
- (2) run docker image
bash run_docker_vidil.sh
- (1) set up variable "CKPT" and "DATASETS" as commented in
-
set up GPU devices: within the docker image, set up the following environment variables to config GPT devices
export N_GPU=<num of gpus> export CUDA_VISIBLE_DEVICES=<0,1,2...>
- [Update 6/15] Quick Start with generated video representation: Frame captions and visual tokens for five datasets can be downloaded here if you don't want to go through the entire pipeline. You can copy the json files following the data structure as mentioned below.
The following scripts runs the entire pipeline which, (1) generates frame captions; (2) generates visual tokens (3) generates few-shot prompt readily for GPT-3. The output folder have the following structure:
{dataset_split}
├── frame_caption
│ ├── config.yaml # config for frame captioning
│ ├── video_text_Cap.json # frame captions w/o filtering
│ ├── video_text_CapFilt.json # frame captions w/ filtering
├── input_prompts
│ ├── {output_name}.jsonl # config for frame captioning
│ ├── {output_name}__idx_2_videoid.json # line idx to video id
│ ├── {output_name}__chosen_samples.json # chosen examples in the support
│ ...
├── visual_tokenization_{encoder_name}
│ ├── config.yaml # config for visual tokenization
│ └── visual_tokens.json # raw visual tokens of each frame
└──
All scripts should be run at /src
dir, namely, the root directory after running the docker image. The following are examples for running the pipeline with in-context example selection for some datasets. For additional notes on running pipeline scripts, please refer to Pipeline Instruction.
Since we need to sample few-shot support set from training sets, for each dataset, at the first time running the pipeline, we need to do frame captioning and visual tokenization on the training set.
For <dataset> in ["msrvtt","youcook2","vatex","msvd","vlep"]
:
bash pipeline/scripts/run_frame_captioning_and_visual_tokenization.sh <dataset> train <output_root>
An example of the frame caption and visual token dir can be found at: output_example/msrvtt/frame_caption
, output_example/msrvtt/visual_tokenization_clip
For <dataset> in ["msrvtt","youcook2","vatex"]
:
-
(1) Run the Standalone Frame Captioning and Visaul Tokenization pipieline for the chosen
<dataset>
-
(2) Run pipeline for generating video captioning prompts for
<dataset>
<split> in ["train","val","test"]
- w/o ASR:
bash pipeline/scripts/generate_gpt3_query_pipeline_caption_with_in_context_selection.sh <dataset> <split> <output_root> 10 42 5 caption
- w/ ASR:
bash pipeline/scripts/generate_gpt3_query_pipeline_caption_with_in_context_selection_with_asr.sh <dataset> <split> <output_root> 10 42 5 caption_asr
An example of the output prompt jsonl can be found at
output_example/msrvtt/input_prompts/temp_0.0_msrvtt_caption_with_in_context_selection_clip_shot_10_seed_42_N_5.jsonl
.
For <dataset> in ["msrvtt","msvd"]
:
-
(1) Run the Standalone Frame Captioning and Visaul Tokenization pipieline for the chosen
<dataset>
-
(2) Run pipeline for generating video question answering prompts for
<dataset>
<split> in ["train","val","test"]
bash pipeline/scripts/generate_gpt3_query_pipeline_qa_with_in_context_selection.sh <dataset> <split> <output_root> 5 42 5 question
An example of the output prompt jsonl can be found at
output_example/msvd_test/input_prompts/temp_0.0_gpt3_queries_msvd_qa_clip_shot_5_seed_42.jsonl
.
-
(1) Run the Standalone Frame Captioning and Visaul Tokenization pipieline for the chosen
vlep
-
(2) Run pipeline for generating vlep prompts
bash pipeline/scripts/generate_gpt3_query_pipeline_vlep_with_random_context_asr_multichoice.sh <dataset> <split> <output_root> 10 42
An example of the output prompt jsonl can be found at
output_example/vlep_test/input_prompts/temp_0.0_vlep_test_clip_shot_10_seed_42_multichoice.jsonl
.
For semi-supervised setting, we first generate pseudo label on the training set, we then train BLIP on the pseudo labeled dataset for retrieval.
-
(1) Generate pseudo labeled training set annotation json: suppose we have the raw gpt3 response stored at
<gpt3_response_dir>
, the input_prompt dir is at<input_prompts_dir>
, run:python utils_gpt3/process_gpt3_response.py --gpt3_response_dir <gpt3_response_dir> --input_prompts_dir <input_prompts_dir> --output_dir <processed_response_dir> python utils_gpt3/gpt3_response_to_jsonl.py --dataset <dataset_name> --gpt3_processed_dir <processed_response_dir> --output_dir <pseudo_label_ann_output_dir>
An example of the
<gpt3_response_dir>
,<input_prompts_dir>
,<processed_response_dir>
andpseudo_label_ann_output_dir
can be found atoutput_example/msrvtt/gpt3_response
,output_example/msrvtt/input_prompts
,output_example/msrvtt/processed_response_dir
,output_example/msrvtt/pseudo_label_ann
. -
(2) Finetune pretrained BLIP from pseudo labeled data: For
<dataset> in ["msrvtt","vatex"]
, set the value of the field namedtrain_ann_jsonl
inconfigs/train_blip_video_retrieval_<dataset>_pseudo.yaml
to be the path to the output jsonl from step one in<pseudo_label_ann_output_dir>
. Then run:bash scripts/train_caption_video.sh train_blip_video_retrieval_<dataset>_pseudo
Scripts for evaluating generation results from GPT-3:
-
Video Captioning: please refer to the example written in the script for more details about the required inputs
bash scripts/evaluation/eval_caption_from_gpt3_response.sh
-
Question Answering: please refer to the example written in the script for more details about the required inputs
bash scripts/evaluation/eval_qa_from_gpt3_response.sh
-
VLEP:
-
(1) get the processed gpt3 response; an example of the:
<gpt3_response_dir>
,<input_prompts_dir>
and<processed_response_dir>
can be found at:output_example/vlep_test/gpt3_response
,output_example/vlep_test/input_prompts
,output_example/vlep_test/gpt3_response_processed
python utils_gpt3/process_gpt3_response.py --gpt3_response_dir <gpt3_response_dir> --input_prompts_dir <input_prompts_dir> --output_dir <processed_response_dir>
-
(2) run the following script to generate the output in the official format for CodaLab submission; an example of the output jsonl can be found at
output_example/vlep_test/evaluation/temp_0.0_vlep_test_clip_shot_10_seed_42_multichoice_eval.jsonl
python eval_vlep.py --gpt3_processed_response <processed_response_json> --output_path <output_jsonl_path>
-
@article{wang2022language,
title={Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners},
author={Wang, Zhenhailong and Li, Manling and Xu, Ruochen and Zhou, Luowei and Lei, Jie and Lin, Xudong and Wang, Shuohang and Yang, Ziyi and Zhu, Chenguang and Hoiem, Derek and others},
journal={arXiv preprint arXiv:2205.10747},
year={2022}
}
The implementation of VidIL relies on resources from BLIP, ALPRO, transformers. We thank the original authors for their open-sourced code and encourage users to cite their works when applicable.