Recognising emotions in spoken dialogue with hierarchically fused acoustic and lexical features

Leimin Tian, Johanna Moore, Catherine Lai

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Automatic emotion recognition is vital for building natural and engaging human-computer interaction systems. Combining information from multiple modalities typically improves emotion recognition performance. In previous work, features from different modalities have generally been fused at the same level with two types of fusion strategies: Feature-Level fusion, which concatenates feature sets before recognition; and Decision-Level fusion, which makes the final decision based on outputs of the unimodal models. However, different features may describe data at different time scales or have different levels of abstraction. Cognitive Science research also indicates that when perceiving emotions, humans use information from different modalities at different cognitive levels and time steps. Therefore, we propose a Hierarchical fusion strategy for multimodal emotion recognition, which incorporates global or more abstract features at higher levels of its knowledge-inspired structure. We build multimodal emotion recognition models combining state-of-the-art acoustic and lexical features to study the performance of the proposed Hierarchical fusion. Experiments on two emotion databases of spoken dialogue show that this fusion strategy consistently outperforms both Feature-Level and Decision-Level fusion. The multimodal emotion recognition models using the Hierarchical fusion strategy achieved state-of-the-art performance on recognizing emotions in both spontaneous and acted dialogue.
Original languageEnglish
Title of host publication2016 IEEE Workshop on Spoken Language Technology
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Number of pages8
ISBN (Electronic)978-1-5090-4903-5
ISBN (Print)978-1-5090-4904-2
Publication statusPublished - 9 Feb 2017
Event2016 IEEE Workshop on Spoken Language Technology - San Diego, United States
Duration: 13 Dec 201616 Dec 2016


Conference2016 IEEE Workshop on Spoken Language Technology
Abbreviated titleSLT 2016
CountryUnited States
CitySan Diego
Internet address


Dive into the research topics of 'Recognising emotions in spoken dialogue with hierarchically fused acoustic and lexical features'. Together they form a unique fingerprint.

Cite this