Matching face images across different modalities is a challenging open problem for various reasons, notably feature heterogeneity, and particularly in the case of sketch recognition -- abstraction, exaggeration and distortion. Existing studies have attempted to address this task by engineering invariant features, or learning a common subspace between the modalities. In this paper, we take a different approach and explore learning a mid-level representation within each domain that allows faces in each modality to be compared in a domain invariant way. In particular, we investigate sketch-photo face matching and go beyond the well-studied viewed sketches to tackle forensic sketches and caricatures where representations are often symbolic. We approach this by learning a facial attribute model independently in each domain that represents faces in terms of semantic properties. This representation is thus more invariant to heterogeneity, distortions and robust to mis-alignment. Our intermediate level attribute representation is then integrated synergistically with the original low-level features using CCA. Our framework shows impressive results on cross-modal matching tasks using forensic sketches, and even more challenging caricature sketches. Furthermore, we create a new dataset with 59,000 attribute annotations for evaluation and to facilitate future research.
We address the highlighted challenges in cross-modality matching of forensic sketches and caricatures to photos, by constructing a mid-level attribute representation of each facial modality. The idea is that this representation can be learned independently within each modality (thus completely avoiding any cross-modality challenge); but once learned, it is largely invariant to the cross-modal gap.
We train a bank of facial attribute detectors to produce low-dimensional semantic representation within each modality. Although the attribute representation is invariant to the cross-modal gap, it does lose some detailed information encoded by the low-level features. We therefore develop a robust synergistic representation that encodes the best of both attributes and low-level features by learning a CCA subspace that correlates the two.