Abstract
In this paper, we propose a novel and efficient system for large-scale action recognition from realistic video clips. Our approach combines several recent advances in this area. We use improved dense trajectory features in combination with Fisher vector encoding, and perform learning and classification with extreme learning machine classifiers. The resulting system is a fast and accurate alternative to more traditional action classification approaches like bag of words and support vector machines. Additionally, we use mid-level features that encode information about presence of humans in the videos, as well as color distributions. We extensively evaluate each step of our pipeline in a comparative manner, and report results on the recently published THUMOS 2014 benchmark, which was introduced as a challenge dataset with temporally untrimmed videos and 101 action classes. We achieve 63.37% mean average precision using the challenge protocol (i.e. sequestered test labels and limited system submissions), and got the third rank among eleven participants. The results show that it is possible to obtain a high accuracy with extreme learning machines in an efficient way, without using the extensively trained and computationally heavy deep neural networks that the top performing systems of the challenge incorporated.
Original language | English |
---|---|
Pages (from-to) | 8274-8282 |
Number of pages | 9 |
Journal | Expert Systems with Applications |
Volume | 42 |
Issue number | 21 |
DOIs | |
Publication status | Published - 27 Jul 2015 |
Externally published | Yes |
Keywords
- Action recognition
- Extreme learning machine
- Fisher vector
- Multimedia mining