Spatio-Temporal FAST 3D Convolutions for Human Action Recognition

A.G. Stergiou, R.W. Poppe

Research output: Contribution to conferencePaperAcademic

Abstract

Effective processing of video input is essential for the recognition of temporally varying events such as human actions. Motivated by the often distinctive temporal characteristics of actions in either horizontal or vertical direction, we introduce a novel convolution block for CNN architectures with video input. Our proposed Fractioned Adjacent Spatial and Temporal (FAST) 3D convolutions are a natural decomposition of a regular 3D convolution. Each convolution block consist of three sequential convolution operations: a 2D spatial convolution followed by spatio-temporal convolutions in the horizontal and vertical direction, respectively. Additionally, we introduce a FAST variant that treats horizontal and vertical motion in parallel. Experiments on benchmark action recognition datasets UCF-101 and HMDB-51 with ResNet architectures demonstrate consistent increased performance of FAST 3D convolution blocks over traditional 3D convolutions. The lower validation loss indicates better generalization, especially for deeper networks. We also evaluate the performance of CNN architectures with similar memory requirements, based either on Two-stream networks or with 3D convolution blocks. DenseNet-121 with FAST 3D convolutions was shown to perform best, giving further evidence
of the merits of the decoupled spatio-temporal convolutions.
Original languageEnglish
Number of pages8
DOIs
Publication statusPublished - 2019
EventIEEE International Conference on Machine Learning and Applications 2019 - Boca Raton, United States
Duration: 16 Dec 201919 Dec 2019
https://www.icmla-conference.org/icmla19/

Conference

ConferenceIEEE International Conference on Machine Learning and Applications 2019
Abbreviated titleICMLA
Country/TerritoryUnited States
CityBoca Raton
Period16/12/1919/12/19
Internet address

Keywords

  • 3D-Convolutions
  • space-time
  • action recognition
  • decoupled

Fingerprint

Dive into the research topics of 'Spatio-Temporal FAST 3D Convolutions for Human Action Recognition'. Together they form a unique fingerprint.

Cite this