Abstract
Previous work on cross-lingual transfer learning in text-to-speech has shown the effectiveness of fine-tuning phonemic representations on small amounts of target language data. In other contexts, phonological features (PFs) have been suggested as a more suitable input representation than phonemes for sharing acoustic information between languages, for example in multilingual model training or for code-switching synthesis where an utterance may contain words from multiple languages. Starting from a model trained on 14 hours of English, we find that cross-lingual fine-tuning with 15 minutes of German data can produce speech with subjective naturalness ratings comparable to a model trained from scratch on 4 hours of German, using either phonemes or PFs. We also find a modest but statistically significant improvement in naturalness ratings using PFs over phonemes when training from scratch on 4 hours of German.
Original language | English |
---|---|
Title of host publication | Proc. 11th ISCA Speech Synthesis Workshop (SSW 11) |
DOIs | |
Publication status | Published - 28 Aug 2021 |
Event | The 11th ISCA Speech Synthesis Workshop (SSW11) - Gárdony, Hungary Duration: 26 Aug 2021 → 28 Aug 2021 Conference number: 11 https://ssw11.hte.hu |
Conference
Conference | The 11th ISCA Speech Synthesis Workshop (SSW11) |
---|---|
Abbreviated title | SSW11 |
Country/Territory | Hungary |
City | Gárdony |
Period | 26/08/21 → 28/08/21 |
Internet address |
Keywords / Materials (for Non-textual outputs)
- speech synthesis
- low-resource
- cross-lingual
- transfer learning