Adapting Image-based Vision-Language Models to Videos
We adapt an image-based vision-language model (e.g. PaLI-3) to the video domain in two stages.
Stage (1): visual adaptation, where we freeze the language component while fine-tuning the visualpart with a relatively large video dataset with short captions (e.g. Spoken Moments-in-Times);
Stage (2): language adaptation, where we instruction-tune the language component while freezing the visual part with a smaller video dataset with detailed captions. In our experiments, we generated instruction-answer pairs from Video-Localized-Narratives by prompting PaLM-2.