Projects per year
Abstract
n this work we propose a novel method to automatically detect and localise the dominant speaker in an enclosed scenario by means of audio and video cues. The underpinning idea is that gesturing means speaking, so observing motions means observing an audio signal. To the best of our knowledge state-of-the-art algorithms are focussed on stationary motion scenarios and close-up scenes where only one audio source exists, whereas we enlarge the extent of the method to larger field of views and cluttered scenarios including multiple non-stationary moving speakers. In such contexts, moving objects which are not correlated to the dominant audio may exist and their motion may incorrectly drive the audio-video (AV) correlation estimation. This suggests extra localisation data may be fused at decision level to avoid detecting false positives. In this work, we learn Mel-frequency cepstral coefficients (MFCC) coefficients and correlate them to the optical flow. We also exploit the audio and video signals to estimate the position of the actual speaker, narrowing down the visual space of search, hence reducing the probability of incurring in a wrong voice-to-pixel region association. We compare our work with a state-of-the-art existing algorithm and show on real datasets a 36% precision improvement in localising a moving dominant speaker through occlusions and speech interferences.
Original language | English |
---|---|
Title of host publication | IEEE International Conference on Acoustics, Speech and Signal Processing |
Place of Publication | Florence |
Publisher | Institute of Electrical and Electronics Engineers |
Pages | 1532-1536 |
Number of pages | 5 |
DOIs | |
Publication status | Published - 4 May 2014 |
Keywords / Materials (for Non-textual outputs)
- audio signal processing;audio-visual systems;cepstral analysis;correlation theory;decision making;image fusion;image sequences;interference suppression;motion estimation;natural scenes;object tracking;speaker recognition;teleconferencing;video signal processing;MFCC;audio cues;audio signal processing;audio source;audio-video correlation estimation;automatic dominant speaker detection;close-up scene;cluttered scenario;decision level;field of view;localisation data fusion;mel frequency cepstral coefficients;moving dominant speaker localisation;nonstationary moving speaker;occlusions;optical flow;position estimation;speech interference;stationary motion scenario;video cues;video signal processing;visual space;voice-to-pixel region association;Acceleration;Correlation;Mel frequency cepstral coefficient;Speech;Speech processing;Vectors;AV Tracking;Audio-Video Correlation;Multimodal tracking;Speaker Recognition;Speaker Tracking
Fingerprint
Dive into the research topics of 'Look who's talking: Detecting the dominant speaker in a cluttered scenario'. Together they form a unique fingerprint.Projects
- 1 Finished
-
Signal Processing in the Networked Battlespace
Mulgrew, B., Davies, M., Hopgood, J. & Thompson, J.
1/04/13 → 30/06/18
Project: Research