Abstract
Real-time detection of a speaker and speaker's location is a challenging task, which is usually addressed by processing acoustic/visual information. However, it is a well-known fact that when a person speaks, the lip and head movements can also be used to detect the speaker and location. This paper proposes a speaker detection system using visual prosody information (e.g. head and lip movements) in a human-machine multiparty interactive dialogue setting. This analysis is performed on a human-machine multiparty dialogue corpora. The paper reports results on head movement and fusion of head and lip movements for speaker and speech activity detection in three different machine learning model training settings (speaker dependent, speaker independent and hybrid). However, it also compares the lip movement results with the head and 'fusion of head and lip movements'. The results show that the head movements contributes significantly towards detection and outperform lip movements except in speaker independent settings, and fusion of both improves performance.
Original language | English |
---|---|
Title of host publication | 2016 IEEE Global Conference on Signal and Information Processing |
Pages | 1207-1211 |
Number of pages | 5 |
ISBN (Electronic) | 978-1-5090-4545-7 |
DOIs | |
Publication status | Published - 1 Dec 2016 |
Event | 2016 IEEE Global Conference on Signal and Information Processing, GlobalSIP 2016 - Washington, United States Duration: 7 Dec 2016 → 9 Dec 2016 |
Conference
Conference | 2016 IEEE Global Conference on Signal and Information Processing, GlobalSIP 2016 |
---|---|
Country/Territory | United States |
City | Washington |
Period | 7/12/16 → 9/12/16 |
Keywords / Materials (for Non-textual outputs)
- Active Speaker Detection
- Dialogue Systems
- Human-Computer Interaction
- Visual Prosody