ASR and Emotional Speech: A Word-Level Investigation of the Mutual Impact of Speech and Emotion Recognition

Yuanchao Li, Zeyu Zhao, Ondrej Klejch, Peter Bell, Catherine Lai

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

In Speech Emotion Recognition (SER), textual data is often used alongside audio signals to address their inherent variability. However, the reliance on human annotated text in most research hinders the development of practical SER systems. To overcome this challenge, we investigate how Automatic Speech Recognition (ASR) performs on emotional speech by analyzing the ASR performance on emotion corpora and examining the distribution of word errors and confidence scores in ASR transcripts to gain insight into how emotion affects ASR. We utilize four ASR systems, namely Kaldi ASR, wav2vec, Conformer, and Whisper, and three corpora: IEMOCAP, MOSI, and MELD to ensure generalizability. Additionally, we conduct text-based SER on ASR transcripts with increasing word error rates to investigate how ASR affects SER. The objective of this study is to uncover the relationship and mutual impact of ASR and SER, in order to facilitate ASR adaptation to emotional speech and the use of SER in real world.
Original languageEnglish
Title of host publicationProc. INTERSPEECH 2023
PublisherInternational Speech Communication Association
Pages1449-1453
Number of pages5
DOIs
Publication statusPublished - 20 Aug 2023
EventInterspeech 2023 - Dublin, Ireland
Duration: 20 Aug 202324 Aug 2023
Conference number: 24
https://www.interspeech2023.org/

Publication series

NameInterspeech
ISSN (Electronic)1990-9772

Conference

ConferenceInterspeech 2023
Country/TerritoryIreland
CityDublin
Period20/08/2324/08/23
Internet address

Keywords / Materials (for Non-textual outputs)

  • speech recognition
  • speech emotion recognition
  • wav2vec2
  • Conformer
  • Whisper
  • confidence measure

Fingerprint

Dive into the research topics of 'ASR and Emotional Speech: A Word-Level Investigation of the Mutual Impact of Speech and Emotion Recognition'. Together they form a unique fingerprint.

Cite this