Published On: Sat, Aug 6th, 2022

Expanding Language-Image Pretrained Models for General Video Recognition


Video recognition is used for numerous vision applications, such as micro-video recommendation, sports video analysis, or autonomous driving. Language-image pretraining has shown great potential in addressing this task. However, directly training a language-video model requires large-scale video-text pretraining data.

Image credit: Rawpixel, CC0 Public Domain

A recent paper on arXiv.org proposes a new architecture for video temporal modeling. Novel cross-frame communication attention is proposed for video temporal modeling. It is light and efficient and can be seamlessly plugged into existing language-image pretrained models.

Researchers design a video-specific prompting technique to yield instance-level textual representation automatically. Experiments demonstrate the superiority and good generalization ability of the proposed method under various learning configurations.

Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data, demonstrating remarkable “zero-shot” generalization ability for various image tasks. However, how to effectively expand such new language-image pretraining methods to video domains is still an open problem. In this work, we present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly, instead of pretraining a new model from scratch. More concretely, to capture the long-range dependencies of frames along the temporal dimension, we propose a cross-frame attention mechanism that explicitly exchanges information across frames. Such module is lightweight and can be plugged into pretrained language-image models seamlessly. Moreover, we propose a video-specific prompting scheme, which leverages video content information for generating discriminative textual prompts. Extensive experiments demonstrate that our approach is effective and can be generalized to different video recognition scenarios. In particular, under fully-supervised settings, our approach achieves a top-1 accuracy of 87.1% on Kinectics-400, while using 12 times fewer FLOPs compared with Swin-L and ViViT-H. In zero-shot experiments, our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols. In few-shot scenarios, our approach outperforms previous best methods by +32.1% and +23.1% when the labeled data is extremely limited. Code and models are available at this https URL

Research article: Ni, B., “Expanding Language-Image Pretrained Models for General Video Recognition”, 2022. Link: https://arxiv.org/abs/2208.02816




Source link

Most Popular News

Local Business Directory, Search Engine Submission & SEO Tools