A long short-term memory convolutional neural network for first-person vision activity recognition

 

 

 

Abstract

 

Temporal information is the main source of discriminating characteristics for the recognition of proprioceptive activities in first-person vision (FPV). In this paper, we propose a motion representation that uses stacked spectrograms. These spectrograms are generated over temporal windows from mean grid optical-flow vectors and the displacement vectors of the intensity centroid. The stacked representation enables us to use 2D convolutions to learn and extract global motion features. Moreover, we employ a long short-term memory (LSTM) network to encode the temporal dependency among consecutive samples recursively. Experimental results show that the proposed approach achieves state-of-the-art performance in the largest public dataset for FPV activity recognition.

 

 

Reference

 

G. Abebe, A. Cavallaro, “A long short-term memory convolutional neural network for first-person vision activity recognition”, Proc. of ICCV workshop on Assistive Computer Vision and Robotics (ACVR), Venice, October 28, 2017 pdf

 

 

 

Code

 

Source code for the method presented in the paper [Code]

 

Extracted features used in the paper [Data]