For the VIST dataset, extracted image regions can be obtained from RoViST-VG.
For AESOP, VWP, and other custom datasets, image regions can be extracted using the FasterRCNN model (with ResNet-101 backbone) trained on Visual Genome data - code.
For evaluating the sequence(s) of interest, a mapping between the corresponding image-id
s and the extracted image region bounding boxes is needed for the metric. For the three visual storytelling datasets, the mapping is available at the respective links:
- VIST: mapping info file
- AESOP (test set only): mapping info file
- VWP: mapping info file
For new/custom datasets, a similar mapping file can be created by leveraging information during the image regions extraction step.
For connecting sequences to corresponding images, a mapping between story/scene ids and respective image ids is needed for the metric. For VIST and VWP datasets, the mapping is available at the respective links:
- VIST: story id to image ids
- AESOP: not required - since all sequences are made up of 3 images and all image ids follow a defined namespace.
- VWP: story id to image ids
After obtaining the data needed for I, II,
and III
, make necessary changes to the configuration file.