Speech Enhancement of Noisy and Reverberant Speech for Text-to-Speech

Cassia Valentini Botinhao, Junichi Yamagishi

Research output: Contribution to journalArticlepeer-review

Abstract

Text-to-speech voices created from noisy and reverberant recordings are of lower quality. A simple way to improve this is to increase the quality of the recordings prior to text-to-speech training with speech enhancement methods such as noise suppression and dereverberation. In this paper, we opted for this approach and to perform the enhancement, we used a recurrent neural network. The network is trained with parallel data of clean and lower quality recordings of speech. The lower quality data was artificially created by adding recordings of environmental noise to studio quality recordings of speech and by convolving room impulse responses with these clean recordings. We trained separate networks with noise only, reverberation only and both reverberation and additive noise data. The quality of voices trained with lower quality data that has been enhanced using these networks was significantly higher in all cases. For the noise only case, the enhanced synthetic voice ranked as high as the voice trained with clean data. For the most realistic and challenging scenario, when both noise and reverberation were present, the improvements were more modest, but still significant.
Original languageEnglish
Pages (from-to)1420-1433
Number of pages14
JournalIEEE/ACM Transactions on Audio, Speech and Language Processing
Volume26
Issue number8
Early online date20 Apr 2018
DOIs
Publication statusPublished - 1 Aug 2018

Fingerprint

Dive into the research topics of 'Speech Enhancement of Noisy and Reverberant Speech for Text-to-Speech'. Together they form a unique fingerprint.

Cite this