Alessio Brutti and
Andrea Cavallaro
We propose an audio-visual
target identification approach for egocentric data with cross-modal model
adaptation. The proposed approach blindly and iteratively adapts the
time-dependent models of each modality to varying target appearance and
environmental conditions using the posterior of the other modality. The
adaptation is unsupervised and performed on-line, thus models can be improved
as new unlabelled data become available. In particular, accurate models do not
deteriorate when a modality is underperforming because of an appropriate
selection of the parameters in the adaptation. Importantly, unlike traditional
audio-visual integration methods, the proposed approach is also useful for
temporal intervals during which only one modality is available or when
different modalities are used for different tasks.
We evaluate the proposed method in an end-to-end multi-modal person
identification application where the proposed model adaptation is coupled with
a traditional late score combination. A block diagram of the implemented system
is shown in the figure below. The image processing system is based on colour
histograms, while i-vectors are used in the audio processing part.
Experiments are based on two challenging real-world datasets and show that the proposed approach successfully adapts models in presence of mild mismatch. Details about the datasets are reported below.
The data used in the experiments are made publicly available to
ease comparison and promote further research on this topic. The two datasets
can be directly downloaded following the links below.
Data are provided raw, without any editing or labelling.
Please cite this paper if you use these data for your
publications.
The QM-GoPro dataset is an egocentric datasets that captures
interactions of 13 participants speaking, for approximately 1 minute, to a
person wearing a chest-mounted GoPro camera. Speakers are up to a few metres
from the microphones (distant-talking task) that are partially covered by the
plastic shield of the camera.
The dataset includes four conditions:
The sessions are
recorded with a GoPro Hero4 camera. The video resolution is 1920x1080, at 25
frames per second. The audio stream is sampled at 48kHz, 16 bits.
Data structure:
GoPro/
where the three digit code is the target ID.
[link]
|
|
C1 |
C2 |
|
|
C3 |
C4 |
Figure 2: Samples of the 4
conditions in the QM-GoPro dataset |
The QM-Seminar dataset consists of 16 participants giving the same
1-minute talk three times recorded by a static camera. The presenters move
freely and generate considerable pose and appearance changes. Moreover, in some
sequences significant illumination changes occur.
The talks are recorded using a JVC GY-HM150E High Definition Camcorder with
video resolution is 1920x1080, at 25 frames per second. The audio signals are
captured by a Sennheiser ew 100-ENG G3 E-Band Wireless System lapel microphone.
Audio is sampled at 48kHz, 16 bits.
Data structure:
Seminars/
where the three digit code is the target ID.
[link]
|
||||
Figure 3: Sample still
image from the QM-seminar dataset |
||||
|
|
|
|
|
Figure 4: Samples of
targets in the QM-seminar dataset |
A. Brutti, A. Cavallaro, “On-line cross-modal adaptation for
audio-visual identification with egocentric data”,
IEEE
Transactions on Human-Machine Systems
2016