The HMM-based TTS can produce a highly intelligible and decent quality voice. However, HMM model degrades when feature vectors used in training are noisy. Among all noisy features, pitch tracking errors and corresponding flawed voiced/unvoiced (v/u) decisions are identified as two key factors in voice quality problems. In this paper, we propose a minimum v/u error approach to F0 generation. A prior knowledge of v/u is imposed in each Mandarin phone and accumulated v/u posterior probabilities are used to search for the optimal v/u switching point in each VU or UV segment in generation. Objectively the new approach is shown to improve v/u prediction performance, specifically on voiced to unvoiced swapping errors. They are reduced from 3.7% (baseline) down to 2.0% (new approach). The improvement is also subjectively confirmed by an AB preference test score, 72% (new approach) versus 22% (baseline).
|Title of host publication||INTERSPEECH 2009 10th Annual Conference of the International Speech Communication Association|
|Number of pages||4|
|Publication status||Published - 2009|