Abstract
This PhD thesis addresses the challenge of improving the efficiency and scalability of video-based human action recognition—an essential task in areas such as surveillance, healthcare, and human-computer interaction. While modern transformer-based models deliver high accuracy, their high computational demands limit practical deployment.
To tackle this, the thesis proposes three key contributions. First, it introduces the Local Attention Layer (LA-layer), a convolution-style attention mechanism with a deformable kernel and constraint rule. This design captures local spatial-temporal patterns effectively while reducing computational costs. Second, the Trajectory-Correlation (TC) block is proposed, a hybrid spatio-temporal module that enhances recognition of fine-grained and complex actions, including continuous sign language.
Third, the thesis focuses on enhancing transformer efficiency. It presents VideoMambaPro, a compact and fast architecture based on the Mamba state-space model, which achieves competitive accuracy with significantly fewer resources than traditional Vision Transformers. Additionally, the Four-Tiered Prompts (FTP) framework is proposed to leverage external knowledge from Visual Language Models (VLMs), improving generalization across datasets and tasks without the need for task-specific fine-tuning.
The effectiveness of these methods is validated on multiple benchmark datasets, including Kinetics-400, Something-Something V2, and PHOENIX14. Results show state-of-the-art performance with reduced memory and computation requirements.
This thesis contributes to the development of efficient, generalizable, and scalable action recognition systems, advancing the practical deployment of video understanding technologies.
| Original language | English |
|---|---|
| Qualification | Doctor of Philosophy |
| Awarding Institution |
|
| Supervisors/Advisors |
|
| Award date | 7 Jul 2025 |
| Publisher | |
| DOIs | |
| Publication status | Published - 7 Jul 2025 |
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 3 Good Health and Well-being
Keywords
- Human Action Recognition
- Video Understanding
- Efficient Vision Transformer Models
- Visual Language Models (VLMs)
- Computational Efficiency
- Computer Vision
Fingerprint
Dive into the research topics of 'Efficiently moving forward in video-based human action recognition'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver