Towards an Unsupervised Speaking Style Voice Building Framework: Multi-Style Speaker Diarization

J Lorenzo, B Martinez, R Barra-Chicote, V Lopez-Ludena, J Ferreiros, J Yamagishi, J M Montero

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Current text-to-speech systems are developed using studio-recorded speech in a neutral style or based on acted emotions. However, the proliferation of media sharing sites would allow developing a new generation of speech-based systems which could cope with spontaneous and styled speech. This paper proposes an architecture to deal with realistic recordings and carries out some experiments on unsupervised speaker diarization. In order to maximize the speaker purity of the clusters while keeping a high speaker coverage, the paper evaluates the F-measure of a diarization module, achieving high scores (>85%) especially when the clusters are longer than 30 seconds, even for he more spontaneous and expressive styles (such as talk shows or sports).
Original languageEnglish
Title of host publicationProc. Interspeech 2012
Subtitle of host publication13th Annual Conference of the International Speech Communication Association, Portland, Oregon, USA, September 9-13, 2012
Publication statusPublished - Sep 2012


  • expressive speech synthesis, speaker diarization, speaking styles, voice cloning


Dive into the research topics of 'Towards an Unsupervised Speaking Style Voice Building Framework: Multi-Style Speaker Diarization'. Together they form a unique fingerprint.

Cite this