Detection and analysis of attention errors in sequence-to-sequence text-to-speech

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Sequence-to-sequence speech synthesis models are notorious for gross errors such as skipping and repetition, commonly associated with failures in the attention mechanism. While a lot has been done to improve attention and decrease errors, this paper focuses instead on automatic error detection and analysis. We evaluated three objective metrics against error detection scores collected by human listening. All metrics were derived from the synthesised attention matrix alone and do not require a reference signal, relying on the expectation that errors occur when attention is dispersed or insufficient. Using one of this metrics as an analysis tool, we observed that gross errors are more likely to occur in longer sentences and in sentences with punctuation marks that indicate pause or break. We also found that mechanisms such as forcibly incremented attention have the potential for decreasing gross errors but to the detriment of naturalness. The results of the error detection evaluation revealed that two of the evaluated metrics were able to detect errors with a relatively high success rate, obtaining F-scores of up to 0.89 and 0.96.
Original languageEnglish
Title of host publicationInterspeech 2021
PublisherISCA
Pages2746-2750
Number of pages5
DOIs
Publication statusPublished - 30 Aug 2021
EventInterspeech 2021: The 22nd Annual Conference of the International Speech Communication Association - Brno, Czech Republic
Duration: 30 Aug 20213 Sept 2021
Conference number: 22
https://www.interspeech2021.org

Conference

ConferenceInterspeech 2021
Country/TerritoryCzech Republic
CityBrno
Period30/08/213/09/21
Internet address

Keywords / Materials (for Non-textual outputs)

  • Speech synthesis
  • attention
  • sequence-to sequence modelling

Fingerprint

Dive into the research topics of 'Detection and analysis of attention errors in sequence-to-sequence text-to-speech'. Together they form a unique fingerprint.

Cite this