The research aims at tracking and recognition of facial expressions, body poses and gestures and human actions in image sequences. The research is driven by applications in Multimodal Human Computer Interaction, Body games and Multimedia Indexing and Retrieval.
Members: Ioannis Patras, Irene Kotsia, Weiwei Guo, Sander Koelstra, Vijay Kumar, Antonis Oikonomopoulos (Imperial College), Ognjen Rudinovic (Imperial College), Stefanos Vrochidis (partially in ITI-CERTH)
This work aims at developing methods for recognition and localisation of human and animal action categories in image sequences. Once trained, the methods should be able to detect and localise in an unseen image sequence, all the actions that belong to one of the known categories. The methodologies will allow training the models in image sequences in which there is significant background clutter, that is in the presence of multiple objects/actions in the scene and moving cameras. No prior knowledge of the anatomy of the human body is a-priori considered, and therefore the models will be able to identify a large class of action categories, including facial/hand/body actions, animal motion, as well as interaction between humans and objects in their environment (such as drinking a glass of water).
The main field of interest of this work includes computational intelligence and image processing techniques in order to analyze facial information in images and video. This includes tracking facial features, gaze tracking, recognition of activation of facial muscles (Facial Action Units) and recognition of facial expressions such as the ones associated with the six basic emotions (anger, disgust, fear, happiness, sadness and surprise). Research is conducted not only in controlled environments, but also under challenging conditions, such as occlusion, different lighting/pose and spontaneous facial expressions.
This line of work focuses on methods for the recovery of body parts from a single 2D image. The research focuses on learning direct mappings from image observations to the parameters that describe the 3D body pose. We have first developed a hierarchical approach to address this problem, learning piecewise mappings from observations to human poses. To achieve that we employed Support Vector Machines and multi-valued Relevance Vector Machine (RVM) regressors. Moreover, we developed a tensor regression framework employing two empirical risk functions formulated using either the Frobenius norm or the group sparsity norm. By using the group sparsity norm we also achieved automatic selection of the rank during the learning process by favoring the low rank decomposition of the tensorial weights.
The research aims at the multimodal analysis of user behaviour when interacting with multimedia content. This includes analysis of both traditional modes of interaction (e.g. mouse and keyboard input) but mainly novel means of interaction such as EEG (encephalograph) signals, facial expressions and gaze patterns. The research is driven by applications in Multimedia Indexing and Retrieval as well as in Multimodal Human Computer Interaction.
In this work, we aim to analyze neuro-physiological user reactions to the presentation of multimedia, for indexing and retrieval. An advantage of using the EEG modality is that it can facilitate implicit tagging, that is it ican occur while the use passively watches multimedia content. We first analyze EEG signals in order to validate tags attached to video content. Subjects are shown a video and a tag and we aim to determine whether the shown tag was congruent with the presented video by detecting the occurrence of an N400 event-related potential. Tag validation could be used in conjunction with a vision-based recognition system as a feedback mechanism to improve the classification accuracy for multimedia indexing and retrieval. Independent Component Analysis and repeated measures ANOVA are used for analysis. Our experimental results show a clear occurrence of the N400 and a significant difference in N400 activation between matching and non-matching tags. The dataset we collected is now available, see here for details.
This line of research focuses on utilising implicit indicators of user interactions with multimedia content via a user-computer interface. As such we consider the user actions during a video retrieval task including gaze, mouse movements and clicks, key strokes and keyboard inputs. The objectives of this work are:
In this context, an interactive video retrieval engine has been implemented, which is capable of retrieving video in different modalities (i.e. textual, visual and temporal search) as well as capturing user interaction. Video analysis was performed by employing state of the art techniques, while implicit feedback analysis was conducted by introducing new video implicit indicators and subsequently constructing an action graph that describes the user navigation during the search process. To validate the approach, the system was tested with real user experiments and its performance was evaluated with the widely used metrics of precision and recall. As it derives from the evaluation and the results, significant improvement of recall and precision is reported after the exploitation of past user-computer interaction.
In this work we propose a dynamic-texture-based approach to the recognition of facial Action Units (AUs, atomic facial gestures) and their temporal models (i.e., sequences of temporal segments: neutral, onset, apex, and offset) in near-frontal-view face videos. We introduce a novel approach to modeling the dynamics and the appearance in the face region of an input video based on Non-rigid Registration using Free-Form Deformations (FFDs). The extracted motion representation is used to derive motion orientation histogram descriptors in both the spatial and temporal domain that form further the input to a set of AU classifiers. Per AU, a combination of ensemble learners and Hidden Markov Models detects the presence of the AU in question and its temporal segments in an input image sequence. When tested for recognition of all 27 lower and upper face AUs, occurring alone or in combination in 264 sequences from the MMI facial expression database, the proposed method achieved an average event recognition accuracy of 89.2% for the MHI method and of 94.3% for the FFD method. The generalization performance of the FFD method has been tested using the Cohn-Kanade database. Finally, we also explored the performance on spontaneous expressions in the Sensitive Artificial Listener dataset.
This line of work focuses on methods for robust tracking of objects in image sequences addressing the issues of (partial) occlusions, changes in object's appearance (e.g. due to illumination changes) and structure (e.g. deformations), background clutter and tracking multiple interacting targets. We have developed methods for general object tracking where learning needs to be performed on the fly, as well as methods for domain-specific tracking such as facial feature tracking where the appearance, structure and dynamics of the target(s) can be learned offline.
Members: Ioannis Patras
This paper addresses the problem of robust template tracking in image sequences. Our work falls within the discriminative framework in which the observations at each frame yield direct probabilistic predictions of the state of the target. Our primary contribution is that we explicitly address the problem that the prediction accuracy for different observations varies, and in some cases can be very low. To this end, we couple the predictor to a probabilistic classifier which, when trained, can determine the probability that a new observation can accurately predict the state of the target (that is, determine the relevance or reliability of the observation in question). In the particle filtering framework, we derive a recursive scheme for maintaining an approximation of the posterior probability of the state in which multiple observations can be used and their predictions moderated by their corresponding relevance. In this way the predictions of the relevant observations are emphasized, while the predictions of the irrelevant observations are suppressed. We apply the algorithm to the problem of 2D template tracking and demonstrate that the proposed scheme outperforms classical methods for discriminative tracking both in the case of motions which are large in magnitude and also for partial occlusions. See here for details.
The work focuses on methods for tracking multiple targets whose states (e.g. relative positions) are correlated. It is applied in the problem of tracking facial features in which anatomical constraints are learned from annotated data. The method has been extensively used for tracking facial features that were used in facial expression analysis. It is also applied in the problem of stereo tracking, where stereoscopic constrains are used in order to robustly track facial features and the iris in a stereoscopic image sequence. The latter is used for gaze tracking using a pair of webcameras.
The research aims at the localisation in images and image sequences of instances of objects belonging to certain semantic categories. We model object structure and appearance as well the context at which the appear using graphical probabilistic models and/or classification schemes. We utilize both strongly annotated datasets (i.e. data for which the ground truth segmentation is available), weakly annotated datasets (e.g. the presence but not the location of an object in an image is given) as well as datasets from social sites where ambiguities in the labeling of the training datasets are typical.
This work studies models capable to analyse the structure of an image in terms of relationships among building parts, or patches. This process is aimed at discriminating the relevant clues that allow to pair specific low-level patches appearances with high-level semantic "concepts". In this context, the reasoning can dramatically benefit from the availability of structural data, i.e., information associated to the copresence and relative location of patches. The main challenge is how to take into account this information avoiding the complexity explosion associated to the intrinsic high dimensionality of the problem. Graphical models can provide a theoretical framework to build a learning paradigm able to efficiently infer relevant clues and to use them to ultimately derive the class of the objects depicted within a collection of images.
This work aims at combining the benefits of supervised and un-supervised learning by allowing the supervised methods to learn from training samples that are found in collaborative tagging environments, after some preprocessing. Specifically, drawing from a large pool of weakly annotated samples our goal is to collect a set of strongly annotated samples suitable for training an object classifier in a supervised manner. We do this by co-relating the most populated tag-word with the most populated visual-word in a set of weakly annotated images. Tag-words correspond to clusters of terms that are provided by social users to describe an image and are grouped based on their semantic relatedness. Visual-words correspond to clusters of image regions that are identified by an automatic segmentation algorithm and are grouped based on the visual similarity between them. The most populated tag-word is used to provide information about the object that the developed classifier is trained to identify, while the most populated visual-word is used to provide the set of strongly annotated samples for training the classifier in a supervised manner. Our method relies on the fact that due to the common background that most users share, the majority of them tend to contribute similar tags when faced with similar type of visual content. Given this assumption it is expected that as the pool of the weakly annotated images grows, the most populated visual-word in both tag and visual information space will converge into the same object.
One of the most informative measures for feature extraction is Mutual Information (MI). In terms of Mutual Information, the optimal feature extraction creates new features that jointly have the largest dependency on the target class. However, obtaining an accurate estimate of a high-dimensional MI as well as optimizing with respect to it is not always easy, especially when only small training sets are available. In this work, we proposed an efficient tree-based method for feature extraction in which at each step a new feature is created by selecting and linearly combining two features such that the MI between the new feature and the class is maximized. Both the selection of the features to be combined and the estimation of the coefficients of the linear transform rely on estimating two-dimensional MIs. The estimation of the latter is computationally very efficient and robust. The effectiveness of our method has been evaluated on several real-world data sets.
This work aims at addressing the classification and regresion problems within a tensorial framework. We exploit the advantages offered by tensorial representations and propose several tensor learning models. We employ tensors in order to better retain and utilize information about the structure of the high dimensional space the data lie in, for example about the spatial arrangement of the pixel-based features in a 2D image. We formulate our algorithms considering that the weights parameters are expressed as a tensor of multiple modes and employ well known tensor decompositions. In that way the weights tensor in the resulting models can allow simultaneous projections to more than one directions along each mode or can be written as the multiplication of a core tensor with a matrix along each mode. The proposed classification algorithms deal with badly scaled data and are able to achieve compression. We also exploit the information provided by the total or the within-class covariance matrix and whiten the data, thus providing invariance to affine transformations in the feature space. Regarding regression, we approach the problem by employing two empirical risk functions both formulated using the Frobenius norm for regularization. We also use the group sparsity norm for regularization, favoring in that way the low rank decomposition of the tensorial weight and achieving the automatic selection of the rank during the learning process.
In this work, we propose a maximum-margin framework for classification using Non- negative Matrix Factorization. In contrast to the previous approaches where the classifi- cation and the matrix factorization stages are independent, we incorporate the maximum margin constraints within the NMF formulation, i.e we solve for a base matrix that min- imizes the margin of the classifier in the low dimensional feature space. This results to a non-convex constrained optimization problem with respect to the bases, the projection coefficients and the separating hyperplane, which we propose to solve in an iterative way, where at each iteration we solve a set of convex sub-problems with respect to subsets of the unknown variables. By doing so, we obtain a bases matrix by which we extract features that maximize the margin of the resulting classifier. The performance of the pro- posed algorithm is evaluated on several publicly available datasets.