Speech Audio Corrector - using speech from non-target speakers for one-off correction of mispronunciations in grapheme-input text-to-speech

Jason Fong, Daniel Lyth, Gustav Eje Henter, Hao Tang, Simon King

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Correct pronunciation is essential for text-to-speech (TTS) systems in production. Most production systems rely on pronouncing dictionaries to perform grapheme-to-phoneme conversion. Unlike end-to-end TTS, this enables pronunciation correction by manually altering the phoneme sequence, but the necessary dictionaries are labour-intensive to create and only exist in a few high-resourced languages. This work demonstrates that accurate TTS pronunciation control can be achieved without a dictionary. Moreover, we show that such control can be performed without requiring any model retraining or fine-tuning, merely by supplying a single correctly-pronounced reading of a word in a different voice and accent at synthesis time. Experimental results show that our proposed system successfully enables one-off correction of mispronunciations in grapheme-based TTS with maintained synthesis quality. This opens the door to production-level TTS in languages and applications where pronunciation dictionaries are unavailable.
Original languageEnglish
Title of host publicationProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
EditorsHanseok Ko, John H. L. Hansen
PublisherInternational Speech Communication Association
Pages1213-1217
Number of pages5
Volume2022-September
DOIs
Publication statusPublished - 18 Sept 2022
Event23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 - Incheon, Korea, Republic of
Duration: 18 Sept 202222 Sept 2022

Publication series

NameProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
ISSN (Print)2308-457X

Conference

Conference23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022
Country/TerritoryKorea, Republic of
CityIncheon
Period18/09/2222/09/22

Keywords / Materials (for Non-textual outputs)

  • pronunciation control
  • speech synthesis

Fingerprint

Dive into the research topics of 'Speech Audio Corrector - using speech from non-target speakers for one-off correction of mispronunciations in grapheme-input text-to-speech'. Together they form a unique fingerprint.

Cite this