Improving Response Time of Active Speaker Detection Using Visual Prosody Information Prior to Articulation

Fasih Haider, Saturnino Luz, Carl Vogel, Nick Campbell

Research output: Contribution to conferencePaper


Natural multi-party interaction commonly involves turning one’s gaze towards the speaker who has the floor. Implementing virtual agents or robots who are able to engage in natural conversations with humans therefore requires enabling machines to exhibit this form of communicative behaviour. This task is called active speaker detection. In this paper, we propose a method for active speaker detection using visual prosody (lip and head movements) information before and after speech articulation to decrease the machine response time; and also demonstrate the discriminating power of visual prosody before and after speech articulation for active speaker detection. The results show that the visual prosody information one second before articulation is helpful in detecting the active speaker. Lip movements provide better results than head movements, and fusion of both improves accuracy. We have also used visual prosody information of the first second of the speech utterance and found that it provides more accurate results than one second before articulation. We conclude that the fusion of lip movements from both regions (the first one second of speech and the one second before articulation) improves the accuracy for active speaker detection.
Original languageEnglish
Number of pages5
Publication statusPublished - 2018
EventInterspeech 2018 - Hyderabad International Convention Centre, Hyderabad, India
Duration: 2 Sep 20186 Sep 2018


ConferenceInterspeech 2018
Internet address


Dive into the research topics of 'Improving Response Time of Active Speaker Detection Using Visual Prosody Information Prior to Articulation'. Together they form a unique fingerprint.

Cite this