Edinburgh Research Explorer

Look who's talking: Detecting the dominant speaker in a cluttered scenario

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Original languageEnglish
Title of host publicationIEEE International Conference on Acoustics, Speech and Signal Processing
Place of PublicationFlorence
Number of pages5
StatePublished - 4 May 2014


n this work we propose a novel method to automatically detect and localise the dominant speaker in an enclosed scenario by means of audio and video cues. The underpinning idea is that gesturing means speaking, so observing motions means observing an audio signal. To the best of our knowledge state-of-the-art algorithms are focussed on stationary motion scenarios and close-up scenes where only one audio source exists, whereas we enlarge the extent of the method to larger field of views and cluttered scenarios including multiple non-stationary moving speakers. In such contexts, moving objects which are not correlated to the dominant audio may exist and their motion may incorrectly drive the audio-video (AV) correlation estimation. This suggests extra localisation data may be fused at decision level to avoid detecting false positives. In this work, we learn Mel-frequency cepstral coefficients (MFCC) coefficients and correlate them to the optical flow. We also exploit the audio and video signals to estimate the position of the actual speaker, narrowing down the visual space of search, hence reducing the probability of incurring in a wrong voice-to-pixel region association. We compare our work with a state-of-the-art existing algorithm and show on real datasets a 36% precision improvement in localising a moving dominant speaker through occlusions and speech interferences.

    Research areas

  • audio signal processing;audio-visual systems;cepstral analysis;correlation theory;decision making;image fusion;image sequences;interference suppression;motion estimation;natural scenes;object tracking;speaker recognition;teleconferencing;video signal processing;MFCC;audio cues;audio signal processing;audio source;audio-video correlation estimation;automatic dominant speaker detection;close-up scene;cluttered scenario;decision level;field of view;localisation data fusion;mel frequency cepstral coefficients;moving dominant speaker localisation;nonstationary moving speaker;occlusions;optical flow;position estimation;speech interference;stationary motion scenario;video cues;video signal processing;visual space;voice-to-pixel region association;Acceleration;Correlation;Mel frequency cepstral coefficient;Speech;Speech processing;Vectors;AV Tracking;Audio-Video Correlation;Multimodal tracking;Speaker Recognition;Speaker Tracking

ID: 18926697