TY - JOUR
T1 - Video-based emotion recognition in the wild using deep transfer learning and score fusion
AU - Kaya, Heysem
AU - Gürpınar, Furkan
AU - Salah, Albert Ali
PY - 2017/9/1
Y1 - 2017/9/1
N2 - Multimodal recognition of affective states is a difficult problem, unless the recording conditions are carefully controlled. For recognition “in the wild”, large variances in face pose and illumination, cluttered backgrounds, occlusions, audio and video noise, as well as issues with subtle cues of expression are some of the issues to target. In this paper, we describe a multimodal approach for video-based emotion recognition in the wild. We propose using summarizing functionals of complementary visual descriptors for video modeling. These features include deep convolutional neural network (CNN) based features obtained via transfer learning, for which we illustrate the importance of flexible registration and fine-tuning. Our approach combines audio and visual features with least squares regression based classifiers and weighted score level fusion. We report state-of-the-art results on the EmotiW Challenge for “in the wild” facial expression recognition. Our approach scales to other problems, and ranked top in the ChaLearn-LAP First Impressions Challenge 2016 from video clips collected in the wild.
AB - Multimodal recognition of affective states is a difficult problem, unless the recording conditions are carefully controlled. For recognition “in the wild”, large variances in face pose and illumination, cluttered backgrounds, occlusions, audio and video noise, as well as issues with subtle cues of expression are some of the issues to target. In this paper, we describe a multimodal approach for video-based emotion recognition in the wild. We propose using summarizing functionals of complementary visual descriptors for video modeling. These features include deep convolutional neural network (CNN) based features obtained via transfer learning, for which we illustrate the importance of flexible registration and fine-tuning. Our approach combines audio and visual features with least squares regression based classifiers and weighted score level fusion. We report state-of-the-art results on the EmotiW Challenge for “in the wild” facial expression recognition. Our approach scales to other problems, and ranked top in the ChaLearn-LAP First Impressions Challenge 2016 from video clips collected in the wild.
KW - Convolutional neural networks
KW - Emotion recognition in the wild
KW - EmotiW
KW - Kernel extreme learning machine
KW - Multimodal fusion
KW - Partial least squares
UR - http://www.scopus.com/inward/record.url?scp=85012924090&partnerID=8YFLogxK
U2 - 10.1016/j.imavis.2017.01.012
DO - 10.1016/j.imavis.2017.01.012
M3 - Article
AN - SCOPUS:85012924090
SN - 0262-8856
VL - 65
SP - 66
EP - 75
JO - Image and Vision Computing
JF - Image and Vision Computing
ER -