Abstract
Owing to their ability to extract powerful video embeddings, Vision Transformers (ViTs) are currently the best performing models in video action understanding. However, when these models are frozen and applied to downstream tasks, their performance drops significantly, revealing limited generalization. In this paper, we describe the Four-Tiered Prompts (FTP) framework that introduces feature processors to transform the ViT's output. In a pre-training stage, each feature processor is trained using contrastive learning to align the ViT's visual embeddings with a vision language model's (VLM) textual embeddings. We use four feature processors, each linked to the output of a VLM prompt that reflects the fundamental aspects of human action: category, components, description, and context. With the FTP framework, we increase the ViT's generalization ability by forcing the visual encoder to incorporate relevant, semantic information. Importantly, we only employ the VLM during training. Subsequently, inference incurs a limited computation cost. For video action recognition and detection, employing the FTP framework consistently yields state-of-the-art performance after fine-tuning. Extensive experiments demonstrate how different choices contribute to the overall increase in performance.1
| Original language | English |
|---|---|
| Article number | 113794 |
| Journal | Pattern Recognition |
| Volume | 179 |
| DOIs | |
| Publication status | Published - Nov 2026 |
Bibliographical note
Publisher Copyright:© 2026 The Authors
Keywords
- Action understanding
- Parameter-efficient fine-tuning
- Pre-training
- Vision transformer
Fingerprint
Dive into the research topics of 'Improving the generalization of ViTs for action understanding with VLM pre-training'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver