Abstract
The topic of this dissertation is the analysis and understanding of egocentric (firstperson)
videos with respect to the performed human actions of the camera wearer,
in a structured and automatic manner. Perhaps, the most identifying characteristic
of the egocentric perspective is that it provides an information-rich view of the scene
that the person holding the camera experiences. The resulting scenes are often indicative
of the location of the persons and the activities they undertake. Recognition
is based on high-level information, such as the hands of the camera wearer and the
objects that are being manipulated, as well as low-level features made available
through data-learning methods. In this thesis, we use deep convolutional neural
networks trained on egocentric images, video segments, and/or (a)synchronously
acquired high-level features of the scene as the backbone of action classification
models. We demonstrate that the training process and architecture of the models
is detrimental to their success; a topic largely investigated with the application of
multitask learning, measuring the effect of a variety of learnable outputs to the final
action recognition result. We additionally pursued the combination of video data
from a variety of sources simultaneously. In the context of the thesis, it is called
multi-dataset multitask learning and refers to a novel way to combine related and
unrelated data sources to improve egocentric action recognition quality.
Original language | English |
---|---|
Qualification | Doctor of Philosophy |
Awarding Institution |
|
Supervisors/Advisors |
|
Award date | 9 Jun 2021 |
Place of Publication | Utrecht |
Publisher | |
Print ISBNs | 978-94-6423-273-8 |
DOIs | |
Publication status | Published - 9 Jun 2021 |
Keywords
- Egocentric Vision
- Machine Learning
- Deep Learning
- Location Recognition
- Action Recognition
- Activity Recognition
- Hand Detection
- Tracking
- Multitask Learning
- Multi-dataset Learning