Waveform Generation for Text-to-speech Synthesis Using Pitch-synchronous Multi-scale Generative Adversarial Networks

Lauri Juvela, Bajibabu Bollepalli, Junichi Yamagishi, Paavo Alku

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The state-of-the-art in text-to-speech (TTS) synthesis has recently improved considerably due to novel neural waveform generation methods, such as WaveNet. However, these methods suffer from their slow sequential inference process, while their parallel versions are difficult to train and even more computationally expensive. Meanwhile, generative adversarial networks (GANs) have achieved impressive results in image generation and are making their way into audio applications; parallel inference is among their lucrative properties. By adopting recent advances in GAN training techniques, this investigation studies waveform generation for TTS in two domains (speech signal and glottal excitation). Listening test results show that while direct waveform generation with GAN is still far behind WaveNet, a GAN-based glottal excitation model can achieve quality and voice similarity on par with a WaveNet vocoder.
Original languageEnglish
Title of host publicationICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Place of PublicationBrighton, United Kingdom
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Pages6915-6919
Number of pages5
ISBN (Electronic)978-1-4799-8131-1
ISBN (Print)978-1-4799-8132-8
DOIs
Publication statusPublished - 17 May 2019
Event44th International Conference on Acoustics, Speech, and Signal Processing: Signal Processing: Empowering Science and Technology for Humankind - Brighton , United Kingdom
Duration: 12 May 201917 May 2019
Conference number: 44
https://2019.ieeeicassp.org/

Publication series

Name
PublisherIEEE
ISSN (Print)1520-6149
ISSN (Electronic)2379-190X

Conference

Conference44th International Conference on Acoustics, Speech, and Signal Processing
Abbreviated titleICASSP 2019
CountryUnited Kingdom
CityBrighton
Period12/05/1917/05/19
Internet address

Keywords

  • Neural vocoding
  • text-to-speech
  • GAN
  • glottal excitation model

Fingerprint Dive into the research topics of 'Waveform Generation for Text-to-speech Synthesis Using Pitch-synchronous Multi-scale Generative Adversarial Networks'. Together they form a unique fingerprint.

Cite this