Soft context clustering for F0 modeling in HMM-based speech synthesis

Soheil Khorram*, Hossein Sameti, Simon King

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract / Description of output

This paper proposes the use of a new binary decision tree, which we call a soft decision tree, to improve generalization performance compared to the conventional ‘hard’ decision tree method that is used to cluster context-dependent model parameters in statistical parametric speech synthesis. We apply the method to improve the modeling of fundamental frequency, which is an important factor in synthesizing natural-sounding high-quality speech. Conventionally, hard decision tree-clustered hidden Markov models (HMMs) are used, in which each model parameter is assigned to a single leaf node. However, this ‘divide-and-conquer’ approach leads to data sparsity, with the consequence that it suffers from poor generalization, meaning that it is unable to accurately predict parameters for models of unseen contexts: the hard decision tree is a weak function approximator. To alleviate this, we propose the soft decision tree, which is a binary decision tree with soft decisions at the internal nodes. In this soft clustering method, internal nodes select both their children with certain membership degrees; therefore, each node can be viewed as a fuzzy set with a context-dependent membership function. The soft decision tree improves model generalization and provides a superior function approximator because it is able to assign each context to several overlapped leaves. In order to use such a soft decision tree to predict the parameters of the HMM output probability distribution, we derive the smoothest (maximum entropy) distribution which captures all partial first-order moments and a global second-order moment of the training samples. Employing such a soft decision tree architecture with maximum entropy distributions, a novel speech synthesis system is trained using maximum likelihood (ML) parameter re-estimation and synthesis is achieved via maximum output probability parameter generation. In addition, a soft decision tree construction algorithm optimizing a log-likelihood measure is developed. Both subjective and objective evaluations were conducted and indicate a considerable improvement over the conventional method.
Original languageEnglish
Article number2
Pages (from-to)1-17
JournalEURASIP Journal on Advances in Signal Processing
Issue number2
Early online date9 Jan 2015
Publication statusPublished - 2015

Keywords / Materials (for Non-textual outputs)

  • context clustering
  • decision tree-based clustering
  • F0 modeling
  • Hidden Markov model
  • HMM
  • HMM-based speech synthesis
  • maximum entropy model
  • soft context clustering
  • soft decision tree
  • statistical parametric speech synthesis


Dive into the research topics of 'Soft context clustering for F0 modeling in HMM-based speech synthesis'. Together they form a unique fingerprint.

Cite this