Active speaker detection in human machine multiparty dialogue using visual prosody information

Fasih Haider, Nick Campbell, Saturnino Luz

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Real-time detection of a speaker and speaker's location is a challenging task, which is usually addressed by processing acoustic/visual information. However, it is a well-known fact that when a person speaks, the lip and head movements can also be used to detect the speaker and location. This paper proposes a speaker detection system using visual prosody information (e.g. head and lip movements) in a human-machine multiparty interactive dialogue setting. This analysis is performed on a human-machine multiparty dialogue corpora. The paper reports results on head movement and fusion of head and lip movements for speaker and speech activity detection in three different machine learning model training settings (speaker dependent, speaker independent and hybrid). However, it also compares the lip movement results with the head and 'fusion of head and lip movements'. The results show that the head movements contributes significantly towards detection and outperform lip movements except in speaker independent settings, and fusion of both improves performance.

Original languageEnglish
Title of host publication2016 IEEE Global Conference on Signal and Information Processing
Pages1207-1211
Number of pages5
ISBN (Electronic)978-1-5090-4545-7
DOIs
Publication statusPublished - 1 Dec 2016
Event2016 IEEE Global Conference on Signal and Information Processing, GlobalSIP 2016 - Washington, United States
Duration: 7 Dec 20169 Dec 2016

Conference

Conference2016 IEEE Global Conference on Signal and Information Processing, GlobalSIP 2016
Country/TerritoryUnited States
CityWashington
Period7/12/169/12/16

Keywords / Materials (for Non-textual outputs)

  • Active Speaker Detection
  • Dialogue Systems
  • Human-Computer Interaction
  • Visual Prosody

Fingerprint

Dive into the research topics of 'Active speaker detection in human machine multiparty dialogue using visual prosody information'. Together they form a unique fingerprint.

Cite this