Natural multi-party interaction commonly involves turning one’s gaze towards the speaker who has the floor. Implementing virtual agents or robots who are able to engage in natural conversations with humans therefore requires enabling machines to exhibit this form of communicative behaviour. This task is called active speaker detection. In this paper, we propose a method for active speaker detection using visual prosody (lip and head movements) information before and after speech articulation to decrease the machine response time; and also demonstrate the discriminating power of visual prosody before and after speech articulation for active speaker detection. The results show that the visual prosody information one second before articulation is helpful in detecting the active speaker. Lip movements provide better results than head movements, and fusion of both improves accuracy. We have also used visual prosody information of the first second of the speech utterance and found that it provides more accurate results than one second before articulation. We conclude that the fusion of lip movements from both regions (the first one second of speech and the one second before articulation) improves the accuracy for active speaker detection.
|Number of pages||5|
|Publication status||Published - 2018|
|Event||Interspeech 2018 - Hyderabad International Convention Centre, Hyderabad, India|
Duration: 2 Sep 2018 → 6 Sep 2018
|Period||2/09/18 → 6/09/18|