Multi-modal ego-centric data from inertial measurement units (IMU) and first-person videos (FPV) can be effectively fused to recognise proprioceptive activities. Existing IMU-based approaches mostly employ cascades of handcrafted triaxial motion features or deep frameworks trained on limited data. FPV approaches generally encode scene dynamics with motion and pooled appearance features. In this paper, we propose a multi-modal ego-centric proprioceptive activity recognition that uses a convolutional neural network (CNN) followed by a long short-term memory (LSTM) network, transfer learning and a merit-based fusion of IMU and/or FPV streams. The CNN encodes short-term temporal dynamics of the ego-motion and the LSTM exploits the long-term temporal dependency among activities. The merit of a stream is evaluated with a sparsity measure of its initial classification output. We validate the proposed framework on multiple visual and inertial datasets.
G. Abebe, A. Cavallaro, “Inertial-Vision: cross-domain knowledge transfer for wearable sensors”, Proc. of ICCV workshop on Assistive Computer Vision and Robotics (ACVR), Venice, October 28, 2017 pdf
Source code for the method presented in the paper [Code]
Extracted features used in the paper [Data]