A Vector Quantized Variational Autoencoder (VQ-VAE) Autoregressive Neural F0 Model for Statistical Parametric Speech Synthesis

Xin Wang, Shinji Takaki, Junichi Yamagishi, Simon King, Keiichi Tokuda

Research output: Contribution to journalArticlepeer-review

Abstract / Description of output

Recurrent neural networks (RNNs) can predict fundamental frequency (F0) for statistical parametric speech synthesis systems, given linguistic features as input. However, these models assume conditional independence between consecutive F0 values, given the RNN state. In a previous study, we proposed autoregressive (AR) neural F0 models to capture the causal dependency of successive F0 values. In subjective evaluations, a deep AR model (DAR) outperformed an RNN.
Here, we propose a Vector Quantized Variational Autoencoder (VQ-VAE) neural F0 model that is both more efficient and more interpretable than the DAR. This model has two stages: one uses the VQ-VAE framework to learn a latent code for the F0 contour of each linguistic unit, and other learns to map from linguistic features to latent codes. In contrast to the DAR and RNN, which process the input linguistic features frame-by-frame, the new model converts one linguistic feature vector into one latent code for each linguistic unit. The new model achieves better objective scores than the DAR, has a smaller memory footprint and is computationally faster. Visualization of the latent codes for phones and moras reveals that each latent code represents an F0 shape for a linguistic unit.
Original languageEnglish
Pages (from-to)157-170
Number of pages13
JournalIEEE/ACM Transactions on Audio, Speech and Language Processing
Volume28
Early online date28 Oct 2019
DOIs
Publication statusPublished - 1 Jan 2020

Keywords / Materials (for Non-textual outputs)

  • fundamental frequency
  • speech synthesis
  • neural network
  • variational auto-encoder

Fingerprint

Dive into the research topics of 'A Vector Quantized Variational Autoencoder (VQ-VAE) Autoregressive Neural F0 Model for Statistical Parametric Speech Synthesis'. Together they form a unique fingerprint.

Cite this