Multi-modal localization and enhancement of multiple sound sources from a micro aerial vehicle


The ego-noise generated by the motors and propellers of a micro aerial vehicle (MAV) masks the environmental sounds and considerably degrades the sound recording quality. Sound enhancement approaches generally require the direction of arrivals of the target sound sources, which are difficult to estimate from the microphone signals due to the low signal-to-noise-ratio (SNR) caused by the ego-noise and the interferences between multiple sources. We propose a multi-modal analysis approach that jointly exploits audio and visual modalities to enhance the sounds of multiple targets captured from an MAV equipped with a microphone array and a video camera. We first address the audio-visual calibration problems (including camera resectioning, audio-visual temporal alignment and geometrical alignment), so that the features in the audio and video streams, which are independently generated, can be jointly used. The spatial information from the video is used to assist sound enhancement by tracking multiple potential sound sources with a particle filter. We then infer the direction of arrivals of the target sources from the video tracking results and extract the sound from the desired direction with a time-frequency spatial filter, which suppresses the ego-noise by exploiting its time-frequency sparsity. Experimental demonstration results with real outdoor data verify the robustness of the proposed multi-modal approach for multiple speakers in extremely low-SNR scenarios.


Related material: [Paper]
author = {R. Sanchez-Matilla and L. Wang and A. Cavallaro},
title = {Multi-modal localization and enhancement of multiple sound sources from a micro aerial vehicle},
booktitle = {Proceedings of the 2017 ACM on Multimedia Conference},
year = {2017},
publisher = {ACM},
address = {San Francisco, CA, USA},
keywords = {audio-visual sensing, ego-noise reduction, micro aerial vehicles, microphone array, multi-modal localization, enhancement of multiple sound sources, multiple object tracking},


The microphone array and the camera mounted on a drone record three people moving across several locations in the scene and talking concurrently. The proposed method tracks the motion and enhances the speech from each person.
[Speakers' scripts].
  • ACM Multimedia is the premier conference in multimedia, a research field that discusses emerging computing methods from a perspective in which each medium is a strong component of the complete, integrated exchange of information.
  • The multimedia community has a tradition of being able to handle big data, it has been a pioneer in large scale evaluations and dataset creations, and is uniquely angled towards novel applications and cutting edge industrial challenges.
  • As such the conference openly embraces new intellectual angles from both industry as well as academia and welcomes submissions from related fields, such as data science, HCI and signal processing.

Only compatible with Safari, iOS Safari and Microsoft Edge. If your browser is not supported, please download the videos from the above link.

Details of the graphs can be found in the paper which can be downloaded above.


  1. Play the video.
  2. Click the buttons below to select the audio track to be played ("Original" = raw recorded audio signal, "A" = enhanced audio for speaker A, "B" = enhanced audio for speaker B, "C" = enhanced audio for speaker C).

This page is maintained by Lin Wang
Last modification: | Created: 10/21/2017