Matching face images across different modalities is a challenging open problem for various reasons, notably feature heterogeneity, and particularly in the case of sketch recognition -- abstraction, exaggeration and distortion. Existing studies have attempted to address this task by engineering invariant features, or learning a common subspace between the modalities. In this paper, we take a different approach and explore learning a mid-level representation within each domain that allows faces in each modality to be compared in a domain invariant way. In particular, we investigate sketch-photo face matching and go beyond the well-studied viewed sketches to tackle forensic sketches and caricatures where representations are often symbolic. We approach this by learning a facial attribute model independently in each domain that represents faces in terms of semantic properties. This representation is thus more invariant to heterogeneity, distortions and robust to mis-alignment. Our intermediate level attribute representation is then integrated synergistically with the original low-level features using CCA. Our framework shows impressive results on cross-modal matching tasks using forensic sketches, and even more challenging caricature sketches. Furthermore, we create a new dataset with 59,000 attribute annotations for evaluation and to facilitate future research.

Contribution Highlights

  • We release a dataset with 59,000 attribute annotations for the major caricature and forensic photo-sketch datasets.
  • We show how to automatically detect photo/sketch facial attributes as a modality-invariant semantic feature.
  • We show how to synergistically integrate attributes and low-level features for recognition.
  • We demonstrate the efficacy of our approach on challenging forensic sketch and caricature sketch based recognition.


  1. Cross-Modal Face Matching: Beyond Viewed Sketches
    Shuxin Ouyang, Timothy Hospedeles, Yi-Zhe Song, Xueming Li
    in Proceedings of the 12th Asian Conference on Computer Vision, 2014 (ACCV)



We address the highlighted challenges in cross-modality matching of forensic sketches and caricatures to photos, by constructing a mid-level attribute representation of each facial modality. The idea is that this representation can be learned independently within each modality (thus completely avoiding any cross-modality challenge); but once learned, it is largely invariant to the cross-modal gap.

Overview of our approach:

We train a bank of facial attribute detectors to produce low-dimensional semantic representation within each modality. Although the attribute representation is invariant to the cross-modal gap, it does lose some detailed information encoded by the low-level features. We therefore develop a robust synergistic representation that encodes the best of both attributes and low-level features by learning a CCA subspace that correlates the two.


We build our attribute dataset by annotating the caricature dataset and forensic dataset.

Experimental results for caricature database and forensic database

Comparison of recognition results between different methods given forensic database and caricature database.



  • Coming soon...