The ego-noise generated by the motors and propellers of a micro aerial vehicle (MAV) masks the environmental sounds and considerably degrades the sound recording quality. Sound enhancement approaches generally require the direction of arrivals of the target sound sources, which are difficult to estimate from the microphone signals due to the low signal-to-noise-ratio (SNR) caused by the ego-noise and the interferences between multiple sources. We propose a multi-modal analysis approach that jointly exploits audio and visual modalities to enhance the sounds of multiple targets captured from an MAV equipped with a microphone array and a video camera. We first address the audio-visual calibration problems (including camera resectioning, audio-visual temporal alignment and geometrical alignment), so that the features in the audio and video streams, which are independently generated, can be jointly used. The spatial information from the video is used to assist sound enhancement by tracking multiple potential sound sources with a particle filter. We then infer the direction of arrivals of the target sources from the video tracking results and extract the sound from the desired direction with a time-frequency spatial filter, which suppresses the ego-noise by exploiting its time-frequency sparsity. Experimental demonstration results with real outdoor data verify the robustness of the proposed multi-modal approach for multiple speakers in extremely low-SNR scenarios.
Only compatible with Safari, iOS Safari and Microsoft Edge. If your browser is not supported, please download the videos from the above link.
Details of the graphs can be found in the paper which can be downloaded above.
Instructions