Investigating accuracy of pitch-accent annotations in neural network-based speech synthesis and denoising effects

Hieu-Thi Luong, Xin Wang, Junichi Yamagishi, Nobuyuki Nishizawa

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

We investigated the impact of noisy linguistic features on the performance of a Japanese speech synthesis system based on neural network that uses WaveNet vocoder. We compared an ideal system that uses manually corrected linguistic features including phoneme and prosodic information in training and test sets against a few other systems that use corrupted linguistic features. Both subjective and objective results demonstrate that corrupted linguistic features, especially those in the test set, affected the ideal system’s performance significantly in a statistical sense due to a mismatched condition between the training and test sets. Interestingly, while an utterance-level Turing test showed that listeners had a difficult time differentiating synthetic speech from natural speech, it further indicated that adding noise to the linguistic features in the training set can partially reduce the effect of the mismatch, regularize the model, and help the system perform better when linguistic features of the test set are noisy.
Index Terms: speech synthesis, deep neural network, Japanese prosody, WaveNet
Original languageEnglish
Title of host publicationProc. Interspeech 2018
Place of PublicationHyderabad, India
PublisherInternational Speech Communication Association
Pages37-41
Number of pages5
Publication statusPublished - 5 Sept 2018
EventInterspeech 2018 - Hyderabad International Convention Centre, Hyderabad, India
Duration: 2 Sept 20186 Sept 2018
http://interspeech2018.org/

Publication series

NameInterspeech
PublisherISCA
ISSN (Print)1990-9772

Conference

ConferenceInterspeech 2018
Country/TerritoryIndia
CityHyderabad
Period2/09/186/09/18
Internet address

Fingerprint

Dive into the research topics of 'Investigating accuracy of pitch-accent annotations in neural network-based speech synthesis and denoising effects'. Together they form a unique fingerprint.

Cite this