A Comparison between Letters and Phones as Input to Sequence-to-Sequence Models for Speech Synthesis

Jason Fong, Jason Taylor, Korin Richmond, Simon King

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Neural sequence-to-sequence (S2S) models for text-tospeech synthesis (TTS) may take letter or phone input sequences. Since for many languages phones have a more direct relationship to the acoustic signal, they lead to improved quality. But generating phone transcriptions from text requires an expensive dictionary and an error-prone grapheme-to-phoneme (G2P) model, and the relative improvement over using letters has yet to be quantified. In approaching this question, we presume that letter-input S2S models must implicitly learn an internal counterpart to G2P conversion and therefore inevitably make errors. Such a model may thus be viewed as phone-input S2S with inaccurate phone input. To quantify this inaccuracy, we compare in this paper a letter-input S2S system to several phone-input systems trained on data with a varying level of error in the phonetic transcription. Our findings show our letterinput system is equivalent in quality to the phone-input system in which 25% of word tokens in the training data have incorrect phonetic transcriptions. Furthermore, we find that for phoneinput systems up to 15% of word tokens in the training data can have incorrect phonetic transcriptions without any significant difference in performance to a 0% error rate system. This suggests it is acceptable to use G2P to predict pronunciations for out-of-vocabulary words (OOVs) provided they are less than around 15% of the training data, removing the need to manually add OOVs to the dictionary for every new training set.
Original languageEnglish
Title of host publicationProceedings of the 10th ISCA Speech Synthesis Workshop
PublisherInternational Speech Communication Association
Pages223-227
Number of pages5
DOIs
Publication statusPublished - 20 Sept 2019
EventThe 10th ISCA Speech Synthesis Workshop - Austrian museum of folk life and folk art in Vienna, Vienna, Austria
Duration: 20 Sept 201922 Sept 2019
Conference number: 10
http://ssw10.oeaw.ac.at/index.html

Publication series

Name
PublisherISCA
ISSN (Electronic)2312-2846

Conference

ConferenceThe 10th ISCA Speech Synthesis Workshop
Abbreviated titleSSW 2019
Country/TerritoryAustria
CityVienna
Period20/09/1922/09/19
Internet address

Fingerprint

Dive into the research topics of 'A Comparison between Letters and Phones as Input to Sequence-to-Sequence Models for Speech Synthesis'. Together they form a unique fingerprint.

Cite this