On-line cross-modal adaptation for audio-visual person identification with wearable cameras

 

 

 

Alessio Brutti and Andrea Cavallaro

 

 

Proposed Approach

 

We propose an audio-visual target identification approach for egocentric data with cross-modal model adaptation. The proposed approach blindly and iteratively adapts the time-dependent models of each modality to varying target appearance and environmental conditions using the posterior of the other modality. The adaptation is unsupervised and performed on-line, thus models can be improved as new unlabelled data become available. In particular, accurate models do not deteriorate when a modality is underperforming because of an appropriate selection of the parameters in the adaptation. Importantly, unlike traditional audio-visual integration methods, the proposed approach is also useful for temporal intervals during which only one modality is available or when different modalities are used for different tasks.

We evaluate the proposed method in an end-to-end multi-modal person identification application where the proposed model adaptation is coupled with a traditional late score combination. A block diagram of the implemented system is shown in the figure below. The image processing system is based on colour histograms, while i-vectors are used in the audio processing part.

 

 

Block DiagramFigure 1: The audio-visual system used in the experimental evaluation.

 

 

Experiments are based on two challenging real-world datasets and show that the proposed approach successfully adapts models in presence of mild mismatch. Details about the datasets are reported below.

 

Experimental data

The data used in the experiments are made publicly available to ease comparison and promote further research on this topic. The two datasets can be directly downloaded following the links below.
Data are provided raw, without any editing or labelling.
Please cite this paper if you use these data for your publications.

 

QM-GoPro dataset

 

The QM-GoPro dataset is an egocentric datasets that captures interactions of 13 participants speaking, for approximately 1 minute, to a person wearing a chest-mounted GoPro camera. Speakers are up to a few metres from the microphones (distant-talking task) that are partially covered by the plastic shield of the camera.
The dataset includes four conditions:

The sessions are recorded with a GoPro Hero4 camera. The video resolution is 1920x1080, at 25 frames per second. The audio stream is sampled at 48kHz, 16 bits.

Data structure:

GoPro/

|001/ses1/GOPRO.MP4

|ses2/GOPRO.MP4

|ses3/GOPRO.MP4

.

.

.

.

|013/ses1/GOPRO.MP4

|ses2/GOPRO.MP4

|ses3/GOPRO.MP4

 

where the three digit code is the target ID.



[link]

Go Pro Inside 1

Go Pro Inside 2

C1

C2

Go Pro Outside 1

Go Pro Outside 2

C3

C4

Figure 2: Samples of the 4 conditions in the QM-GoPro dataset

 

QM-Seminars dataset

 

The QM-Seminar dataset consists of 16 participants giving the same 1-minute talk three times recorded by a static camera. The presenters move freely and generate considerable pose and appearance changes. Moreover, in some sequences significant illumination changes occur.
The talks are recorded using a JVC GY-HM150E High Definition Camcorder with video resolution is 1920x1080, at 25 frames per second. The audio signals are captured by a Sennheiser ew 100-ENG G3 E-Band Wireless System lapel microphone. Audio is sampled at 48kHz, 16 bits.

Data structure:

Seminars/

|001/ses1/seminars.mp4

|ses2/seminars.mp4

|ses3/seminars.mp4

.

.

.

.

|016/ses1/seminars.mp4

|ses2/seminars.mp4

|ses3/seminars.mp4

 

where the three digit code is the target ID.



[link]

Figure 3: Sample still image from the QM-seminar dataset

Sample QM-seminar

Sample QM-seminar

Sample QM-seminar

Sample QM-seminar

Sample QM-seminar

Figure 4: Samples of targets in the QM-seminar dataset

 

 

 

Reference

A. Brutti, A. Cavallaro, “On-line cross-modal adaptation for audio-visual identification with egocentric data”,

IEEE Transactions on Human-Machine Systems

2016

[pdf]