Confidence Intervals for ASR-based TTS Evaluation

Jason Taylor, Korin Richmond

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Automatic speech recognition (ASR) is increasingly used to evaluate the intelligibility of text-to-speech synthesis (TTS). ASR is less costly than traditional listening tests, but ques- tions remain about its reliability. We re-evaluate the Blizzard Challenge’s intelligibility tasks in English since 2011 using ASR. Re-analysing transcriptions collected by paid in-lab participants, online volunteers and Amazon Mechanical Turkers (the latter used only in 2011), we compare their word error rates (WERs) and statistically-significant system-groupings with those generated by an open-source, Transformer-based ASR model. This ASR model consistently decodes test stimuli with more reliable WERs than the Blizzard Challenge’s (mostly non-native) speech experts and online volunteers. The model also groups systems according to statistical significance similarly to the paid in-lab participants. Using surplus semantically unpredictable sentences (SUS) submitted every year to the challenge, we investigate how confidence intervals in ASR WERs change as the number of transcribed stimuli increases. We plot the Frobenius norm of pairwise significance matrices with increasing stimuli. We find that finer groupings of systems are detected as confidence intervals narrow. The number of stimuli where p-values start to converge ranges from 400-800 stimuli. We conclude that, with enough stimuli, ASR can be more reliable than humans.
Original languageEnglish
Title of host publication22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
PublisherInternational Speech Communication Association
Pages2791-2795
Number of pages5
ISBN (Electronic)9781713836902
DOIs
Publication statusPublished - 3 Sept 2021
EventInterspeech 2021: The 22nd Annual Conference of the International Speech Communication Association - Brno, Czech Republic
Duration: 30 Aug 20213 Sept 2021
Conference number: 22
https://www.interspeech2021.org

Publication series

NameProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
ISSN (Print)2308-457X
ISSN (Electronic)1990-9772

Conference

ConferenceInterspeech 2021
Country/TerritoryCzech Republic
CityBrno
Period30/08/213/09/21
Internet address

Fingerprint

Dive into the research topics of 'Confidence Intervals for ASR-based TTS Evaluation'. Together they form a unique fingerprint.

Cite this