Skip to main navigation Skip to search Skip to main content

Efficiently moving forward in video-based human action recognition

  • Hui Lu

Research output: ThesisDoctoral thesis 1 (Research UU / Graduation UU)

Abstract

This PhD thesis addresses the challenge of improving the efficiency and scalability of video-based human action recognition—an essential task in areas such as surveillance, healthcare, and human-computer interaction. While modern transformer-based models deliver high accuracy, their high computational demands limit practical deployment. To tackle this, the thesis proposes three key contributions. First, it introduces the Local Attention Layer (LA-layer), a convolution-style attention mechanism with a deformable kernel and constraint rule. This design captures local spatial-temporal patterns effectively while reducing computational costs. Second, the Trajectory-Correlation (TC) block is proposed, a hybrid spatio-temporal module that enhances recognition of fine-grained and complex actions, including continuous sign language. Third, the thesis focuses on enhancing transformer efficiency. It presents VideoMambaPro, a compact and fast architecture based on the Mamba state-space model, which achieves competitive accuracy with significantly fewer resources than traditional Vision Transformers. Additionally, the Four-Tiered Prompts (FTP) framework is proposed to leverage external knowledge from Visual Language Models (VLMs), improving generalization across datasets and tasks without the need for task-specific fine-tuning. The effectiveness of these methods is validated on multiple benchmark datasets, including Kinetics-400, Something-Something V2, and PHOENIX14. Results show state-of-the-art performance with reduced memory and computation requirements. This thesis contributes to the development of efficient, generalizable, and scalable action recognition systems, advancing the practical deployment of video understanding technologies.
Original languageEnglish
QualificationDoctor of Philosophy
Awarding Institution
  • Utrecht University
Supervisors/Advisors
  • Salah, Albert, Supervisor
  • Poppe, Ronald, Co-supervisor
Award date7 Jul 2025
Publisher
DOIs
Publication statusPublished - 7 Jul 2025

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 3 - Good Health and Well-being
    SDG 3 Good Health and Well-being

Keywords

  • Human Action Recognition
  • Video Understanding
  • Efficient Vision Transformer Models
  • Visual Language Models (VLMs)
  • Computational Efficiency
  • Computer Vision

Fingerprint

Dive into the research topics of 'Efficiently moving forward in video-based human action recognition'. Together they form a unique fingerprint.

Cite this