Edinburgh Research Explorer

A Vector Quantized Variational Autoencoder (VQ-VAE) Autoregressive Neural F0 Model for Statistical Parametric Speech Synthesis

Research output: Contribution to journalArticle

Related Edinburgh Organisations

Open Access permissions



Original languageEnglish
Pages (from-to)157-170
Number of pages13
Journal IEEE/ACM Transactions on Audio, Speech, and Language Processing
Early online date28 Oct 2019
Publication statusPublished - 1 Jan 2020


Recurrent neural networks (RNNs) can predict fundamental frequency (F0) for statistical parametric speech synthesis systems, given linguistic features as input. However, these models assume conditional independence between consecutive F0 values, given the RNN state. In a previous study, we proposed autoregressive (AR) neural F0 models to capture the causal dependency of successive F0 values. In subjective evaluations, a deep AR model (DAR) outperformed an RNN.
Here, we propose a Vector Quantized Variational Autoencoder (VQ-VAE) neural F0 model that is both more efficient and more interpretable than the DAR. This model has two stages: one uses the VQ-VAE framework to learn a latent code for the F0 contour of each linguistic unit, and other learns to map from linguistic features to latent codes. In contrast to the DAR and RNN, which process the input linguistic features frame-by-frame, the new model converts one linguistic feature vector into one latent code for each linguistic unit. The new model achieves better objective scores than the DAR, has a smaller memory footprint and is computationally faster. Visualization of the latent codes for phones and moras reveals that each latent code represents an F0 shape for a linguistic unit.

    Research areas

  • fundamental frequency, speech synthesis, neural network, variational auto-encoder

Download statistics

No data available

ID: 115877562