Modeling Dyadic Human Interactions: A study of methods for training pose+motion models of fine-grained face-to-face interactions in unsegmented videos

Cornelis Johannes van Gemeren

Research output: ThesisDoctoral thesis 1 (Research UU / Graduation UU)

Abstract

A dyadic interaction is a behavioral exchange between two people. In this thesis a computer framework is presented that can localize and classify fine-grained dyadic interactions, such as a handshake, a hug or passing an object from one person to another. In artificial intelligence tasks like these are commonly referred to as human interaction recognition. Our method can be trained on videos with accompanying metadata of the poses of the individuals involved in the interactions, to automatically recognize dyadic interactions. Instead of focusing on interactions that are visually very different, such as kicking and punching, we look at visually similar interactions, such as shaking hands and passing an object. We give special attention to the fine-grained differences between these types of interactions. The interactions we consider for this thesis usually involve physical contact, but our method is not limited to these types of interactions. Focusing on the localization and classification of fine-grained dyadic interactions is a challenging task. Solving this problem is important because of the many types of different applications which lie in its prospects. Human interaction recognition plays an important role in surveillance, video search and automated video captioning. Aside from the different type of applications that successful interaction recognition models may produce, they can also play an important role in human behavior understanding and the study of social development. For many years automatically finding and labeling interactions has been beyond the capabilities of computer systems and artificial intelligence. In this thesis we introduce two data sets specifically designed for the task of dyadic human interaction detection and we describe a spatio-temporal model that contains pose and motion features in a graph of deformable body parts. We set our model up by first finding the moment during an interaction that is most representative of its particular class. We call this frame its epitome. Our model is created from the epitome to describe the temporal build-up towards it and the run out of the interaction afterwards. Over this course of time, we describe the dyadic interaction poses for different limbs using Histograms of Oriented Gradients (HOG) and the accompanying motions using Histograms of Optical Flow (HOF) and Motion Boundary Histograms (MBH). We show that we can train our model from relatively few examples. We test its robustness when the amount of available training data is extremely limited and we look at the use of auxiliary images to leverage training in these cases. On testing on unsegmented videos, our framework returns labeled spatio-temporal tubes that cover an interaction precisely. Our experiments show that our models generalize well to different environments. Next to its performance our formulation is flexible enough to incorporate different features and part configurations, so other interaction classes can be easily trained. Our research shows that there is still room for improvement. Most importantly, the temporal extent of the interaction is difficult to estimate precisely with our method because we train models on the epitome of the interaction, which covers only a small part of it.
Original languageEnglish
QualificationDoctor of Philosophy
Awarding Institution
  • Utrecht University
Supervisors/Advisors
  • Veltkamp, Remco, Primary supervisor
  • Poppe, Ronald, Co-supervisor
Award date5 Jun 2019
Place of PublicationUtrecht
Publisher
Print ISBNs978-94-6332-511-0
Publication statusPublished - 5 Jun 2019

Keywords

  • artificial intelligence
  • machine learning
  • computer vision
  • interaction recognition
  • video

Fingerprint

Dive into the research topics of 'Modeling Dyadic Human Interactions: A study of methods for training pose+motion models of fine-grained face-to-face interactions in unsegmented videos'. Together they form a unique fingerprint.

Cite this