Abstract
Sequence-to-sequence speech synthesis models are notorious for gross errors such as skipping and repetition, commonly associated with failures in the attention mechanism. While a lot has been done to improve attention and decrease errors, this paper focuses instead on automatic error detection and analysis. We evaluated three objective metrics against error detection scores collected by human listening. All metrics were derived from the synthesised attention matrix alone and do not require a reference signal, relying on the expectation that errors occur when attention is dispersed or insufficient. Using one of this metrics as an analysis tool, we observed that gross errors are more likely to occur in longer sentences and in sentences with punctuation marks that indicate pause or break. We also found that mechanisms such as forcibly incremented attention have the potential for decreasing gross errors but to the detriment of naturalness. The results of the error detection evaluation revealed that two of the evaluated metrics were able to detect errors with a relatively high success rate, obtaining F-scores of up to 0.89 and 0.96.
Original language | English |
---|---|
Title of host publication | Interspeech 2021 |
Publisher | ISCA |
Pages | 2746-2750 |
Number of pages | 5 |
DOIs | |
Publication status | Published - 30 Aug 2021 |
Event | Interspeech 2021: The 22nd Annual Conference of the International Speech Communication Association - Brno, Czech Republic Duration: 30 Aug 2021 → 3 Sept 2021 Conference number: 22 https://www.interspeech2021.org |
Conference
Conference | Interspeech 2021 |
---|---|
Country/Territory | Czech Republic |
City | Brno |
Period | 30/08/21 → 3/09/21 |
Internet address |
Keywords / Materials (for Non-textual outputs)
- Speech synthesis
- attention
- sequence-to sequence modelling