A Modular Approach for the Detection and Interconnection of Objects, Hands, Locations, and Actions for Egocentric Video Understanding


Research output: ThesisDoctoral thesis 1 (Research UU / Graduation UU)


The topic of this dissertation is the analysis and understanding of egocentric (firstperson) videos with respect to the performed human actions of the camera wearer, in a structured and automatic manner. Perhaps, the most identifying characteristic of the egocentric perspective is that it provides an information-rich view of the scene that the person holding the camera experiences. The resulting scenes are often indicative of the location of the persons and the activities they undertake. Recognition is based on high-level information, such as the hands of the camera wearer and the objects that are being manipulated, as well as low-level features made available through data-learning methods. In this thesis, we use deep convolutional neural networks trained on egocentric images, video segments, and/or (a)synchronously acquired high-level features of the scene as the backbone of action classification models. We demonstrate that the training process and architecture of the models is detrimental to their success; a topic largely investigated with the application of multitask learning, measuring the effect of a variety of learnable outputs to the final action recognition result. We additionally pursued the combination of video data from a variety of sources simultaneously. In the context of the thesis, it is called multi-dataset multitask learning and refers to a novel way to combine related and unrelated data sources to improve egocentric action recognition quality.
Original languageEnglish
QualificationDoctor of Philosophy
Awarding Institution
  • Utrecht University
  • Veltkamp, Remco, Primary supervisor
  • Poppe, Ronald, Co-supervisor
Award date9 Jun 2021
Place of PublicationUtrecht
Print ISBNs978-94-6423-273-8
Publication statusPublished - 9 Jun 2021


  • Egocentric Vision
  • Machine Learning
  • Deep Learning
  • Location Recognition
  • Action Recognition
  • Activity Recognition
  • Hand Detection
  • Tracking
  • Multitask Learning
  • Multi-dataset Learning


Dive into the research topics of 'A Modular Approach for the Detection and Interconnection of Objects, Hands, Locations, and Actions for Egocentric Video Understanding'. Together they form a unique fingerprint.

Cite this