Distilling Vision-Language Models on Millions of Videos

CVPR 2024

1Google Research 2UT Austin

Abstract

The recent advance in vision-language models is largely attributed to the abundance of image-text data. We aim to replicate this success for video-language models, but there simply is not enough human-curated video-text data available. We thus resort to fine-tuning a video-language model from a strong image-language baseline with synthesized instructional data. The resulting video-instruction-tuned model (VIIT) is then used to auto-label millions of videos to generate high-quality captions.

We show the adapted video-language model performs well on a wide range of video-language benchmarks. For instance, it surpasses the best prior result on open-ended NExT-QA by 2.8%. Besides, our model generates detailed descriptions for previously unseen videos, which provide better textual supervision than existing methods. Experiments show that a video-language dual-encoder model contrastively trained on these auto-generated captions is 3.8% better than the strongest baseline that also leverages vision-language models. Our best model outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video retrieval by 6%.

Overview

Our video-language model takes a video along with any form of instruction as input and generates text according to the instruction. It generates textual descriptions with multiple granularities, including static appearance, general action, and detailed body movements.

method overview

Adapting Image-based Vision-Language Models to Videos

We adapt an image-based vision-language model (e.g. PaLI-3) to the video domain in two stages.

Stage (1): visual adaptation, where we freeze the language component while fine-tuning the visualpart with a relatively large video dataset with short captions (e.g. Spoken Moments-in-Times);

Stage (2): language adaptation, where we instruction-tune the language component while freezing the visual part with a smaller video dataset with detailed captions. In our experiments, we generated instruction-answer pairs from Video-Localized-Narratives by prompting PaLM-2.

training

Evaluating the Video-Language Model

We evaluate the adapted Video-Language Model in comparison to the baseline PaLI-3 and prior SOTAs. We focus on the zero-shot performance where we apply the model to the testing split of downstream tasks without any tuning.

PaLI-3 zero-shot

Harnessing the Distilled Pseudo-Captions

We show the model's zero-shot video understanding performance by pre-training on the pseudo-captions, which is a solid indicator of the pseudo-captions' quality.

dual-encoder videocc dual-encoder internvid

BibTeX

@article{zhao2024viit,
  author    = {Zhao, Yue and Zhao, Long and Zhou, Xingyi and Wu, Jialin and Chu, Chun-Te and Miao, Hui and Schroff, Florian and Adam, Hartwig and Liu, Ting and Gong, Boqing and Krähenbühl, Philipp and Yuan, Liangzhe},
  title     = {Distilling Vision-Language Models on Millions of Videos},
  journal   = {arXiv preprint arXiv:2401.06129},
  year      = {2024},
}