Building high quality synthesis systems with open domain vocabulary and a small audio database is a challenging problem, even when the targeted application is well constrained. Monophone unit concatenation (as opposed to diphone) is an approach that can compensate for the poor unit coverage that a small database implies. However, joining at phone boundaries is a delicate task that requires accurate targeting. In this paper, we present an automatically trained targeting system based on the parametric synthesiser HTS, and compare it to a concatenative monophone system and a baseline concatenative diphone system. We apply a novel evaluation methodology which includes a qualitative component, and allows for fast incremental development of synthesis systems. Preliminary results show that although the hybrid system performed significantly more poorly on out of database items, it is less affected by segmentation errors than the monophone system.
|Title of host publication||INTERSPEECH 2012, 13th Annual Conference of the International Speech Communication Association, Portland, Oregon, USA, September 9-13, 2012|
|Number of pages||4|
|Publication status||Published - 2012|