Abstract / Description of output

Due to the data inefficiency and low speech quality of grapheme-based end-to-end text-to-speech (TTS), having a separate high-performance TTS linguistic frontend is still commonly regarded as necessary. However, a TTS frontend is itself difficult to build and maintain, since it requires abundant linguistic knowledge for its construction. In this paper, we start by bootstrapping an integrated sequence-to-sequence (Seq2Seq) TTS frontend using a pre-existing pipeline-based frontend and large amounts of unlabelled normalized text, achieving promising memorization and generalisation abilities. To overcome the performance limitation imposed by the pipeline-based frontend, this work proposes a Forced Alignment (FA) method to decode the pronunciations from transcribed speech audio and then use them to update the Seq2Seq frontend. Our experiments demonstrate the effectiveness of our proposed FA method, which can significantly improve the word token accuracy from 52.6% to 91.2% for out-of-dictionary words. In addition, it can also correct the pronunciation of homographs from transcribed speech audio and potentially improve the homograph disambiguation performance of the Seq2Seq frontend.
Original languageEnglish
Pages (from-to)1940-1952
JournalIEEE/ACM Transactions on Audio, Speech and Language Processing
Volume31
Early online date5 May 2023
DOIs
Publication statusPublished - 22 May 2023

Fingerprint

Dive into the research topics of 'Improving Seq2Seq TTS Frontends with Transcribed Speech Audio'. Together they form a unique fingerprint.

Cite this