Abstract

We address quality assessment for neural network based ASR by providing explanations that help increase our understanding of the system and ultimately help build trust in the system. Compared to simple classification labels, explaining transcriptions is more challenging as judging their correctness is not straightforward and transcriptions as a variable length sequence is not handled by existing interpretable machine learning models.We provide an explanation for an ASR transcription as a subset of audio frames that is both a minimal and sufficient cause of the transcription. To do this, we adapt existing explainable AI (XAI) techniques from image classification - (1)Statistical Fault Localisation (SFL) [1] and(2) Causal [2]. Additionally, we use an adapted version of Local Interpretable Model-Agnostic Explanations (LIME) [3]for ASR as a baseline in our experiments. We evaluate the quality of the explanations generated by the proposed techniques over three different ASR – Google API [4], the baseline model of Sphinx [5], Deep speech [6] – and 100 audio samples from the Common voice dataset [7].
Original languageEnglish
Title of host publicationICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
PublisherInstitute of Electrical and Electronics Engineers
Number of pages5
ISBN (Electronic)9781728163277
ISBN (Print)9781728163284
DOIs
Publication statusPublished - 5 May 2023
Event2023 IEEE International Conference on Acoustics, Speech and Signal Processing - Rhodes Island, Greece
Duration: 4 Jun 202310 Jun 2023
https://2023.ieeeicassp.org/

Publication series

NameInternational Conference on Acoustics, Speech, and Signal Processing (ICASSP)
PublisherIEEE
ISSN (Print)1520-6149
ISSN (Electronic)2379-190X

Conference

Conference2023 IEEE International Conference on Acoustics, Speech and Signal Processing
Abbreviated titleICASSP
Country/TerritoryGreece
CityRhodes Island
Period4/06/2310/06/23
Internet address

Fingerprint

Dive into the research topics of 'Explanations for automatic speech recognition'. Together they form a unique fingerprint.

Cite this