Skip to main navigation Skip to search Skip to main content

Improving the generalization of ViTs for action understanding with VLM pre-training

Research output: Contribution to journalReview articlepeer-review

Abstract

Owing to their ability to extract powerful video embeddings, Vision Transformers (ViTs) are currently the best performing models in video action understanding. However, when these models are frozen and applied to downstream tasks, their performance drops significantly, revealing limited generalization. In this paper, we describe the Four-Tiered Prompts (FTP) framework that introduces feature processors to transform the ViT's output. In a pre-training stage, each feature processor is trained using contrastive learning to align the ViT's visual embeddings with a vision language model's (VLM) textual embeddings. We use four feature processors, each linked to the output of a VLM prompt that reflects the fundamental aspects of human action: category, components, description, and context. With the FTP framework, we increase the ViT's generalization ability by forcing the visual encoder to incorporate relevant, semantic information. Importantly, we only employ the VLM during training. Subsequently, inference incurs a limited computation cost. For video action recognition and detection, employing the FTP framework consistently yields state-of-the-art performance after fine-tuning. Extensive experiments demonstrate how different choices contribute to the overall increase in performance.1

Original languageEnglish
Article number113794
JournalPattern Recognition
Volume179
DOIs
Publication statusPublished - Nov 2026

Bibliographical note

Publisher Copyright:
© 2026 The Authors

Keywords

  • Action understanding
  • Parameter-efficient fine-tuning
  • Pre-training
  • Vision transformer

Fingerprint

Dive into the research topics of 'Improving the generalization of ViTs for action understanding with VLM pre-training'. Together they form a unique fingerprint.

Cite this