Automatic emotion recognition is vital for building natural and engaging human-computer interaction systems. Combining information from multiple modalities typically improves emotion recognition performance. In previous work, features from different modalities have generally been fused at the same level with two types of fusion strategies: Feature-Level fusion, which concatenates feature sets before recognition; and Decision-Level fusion, which makes the final decision based on outputs of the unimodal models. However, different features may describe data at different time scales or have different levels of abstraction. Cognitive Science research also indicates that when perceiving emotions, humans use information from different modalities at different cognitive levels and time steps. Therefore, we propose a Hierarchical fusion strategy for multimodal emotion recognition, which incorporates global or more abstract features at higher levels of its knowledge-inspired structure. We build multimodal emotion recognition models combining state-of-the-art acoustic and lexical features to study the performance of the proposed Hierarchical fusion. Experiments on two emotion databases of spoken dialogue show that this fusion strategy consistently outperforms both Feature-Level and Decision-Level fusion. The multimodal emotion recognition models using the Hierarchical fusion strategy achieved state-of-the-art performance on recognizing emotions in both spontaneous and acted dialogue.
|Title of host publication||2016 IEEE Workshop on Spoken Language Technology|
|Publisher||Institute of Electrical and Electronics Engineers (IEEE)|
|Number of pages||8|
|Publication status||Published - 9 Feb 2017|
|Event||2016 IEEE Workshop on Spoken Language Technology - San Diego, United States|
Duration: 13 Dec 2016 → 16 Dec 2016
|Conference||2016 IEEE Workshop on Spoken Language Technology|
|Abbreviated title||SLT 2016|
|Period||13/12/16 → 16/12/16|
FingerprintDive into the research topics of 'Recognising emotions in spoken dialogue with hierarchically fused acoustic and lexical features'. Together they form a unique fingerprint.
- School of Philosophy, Psychology and Language Sciences - Lecturer in Speech and Language Processing
- Institute of Language, Cognition and Computation
- Centre for Speech Technology Research
Person: Academic: Research Active