TY - GEN
T1 - Hider-finder-combiner
T2 - 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020
AU - Webber, Jacob J.
AU - Perrotin, Olivier
AU - King, Simon
N1 - Funding Information:
The first author is funded by the Engineering and Physical Sciences Research Council (grant EP/L01503X/1), EPSRC Centre for Doctoral Training in Pervasive Parallelism at the University of Edinburgh, School of Informatics.
Publisher Copyright:
Copyright © 2020 ISCA
Copyright:
Copyright 2020 Elsevier B.V., All rights reserved.
PY - 2020/10/31
Y1 - 2020/10/31
N2 - We introduce a prototype system for modifying an arbitrary parameter of a speech signal. Unlike signal processing approaches that require dedicated methods for different parameters, our system can - in principle - modify any control parameter that the signal can be annotated with. Our system comprises three neural networks. The 'hider' removes all information related to the control parameter, outputting a hidden embedding. The 'finder' is an adversary used to train the 'hider', attempting to detect the value of the control parameter from the hidden embedding. The 'combiner' network recombines the hidden embedding with a desired new value of the control parameter. The input and output to the system are mel-spectrograms and we employ a neural vocoder to generate the output speech waveform. As a proof of concept, we use F0 as the control parameter. The system was evaluated in terms of control parameter accuracy and naturalness against a high quality signal processing method of F0 modification that also works in the spectrogram domain. We also show that, with modifications only to training data, the system is capable of modifying the 1st and 2nd vocal tract formants, showing progress towards universal signal modification.
AB - We introduce a prototype system for modifying an arbitrary parameter of a speech signal. Unlike signal processing approaches that require dedicated methods for different parameters, our system can - in principle - modify any control parameter that the signal can be annotated with. Our system comprises three neural networks. The 'hider' removes all information related to the control parameter, outputting a hidden embedding. The 'finder' is an adversary used to train the 'hider', attempting to detect the value of the control parameter from the hidden embedding. The 'combiner' network recombines the hidden embedding with a desired new value of the control parameter. The input and output to the system are mel-spectrograms and we employ a neural vocoder to generate the output speech waveform. As a proof of concept, we use F0 as the control parameter. The system was evaluated in terms of control parameter accuracy and naturalness against a high quality signal processing method of F0 modification that also works in the spectrogram domain. We also show that, with modifications only to training data, the system is capable of modifying the 1st and 2nd vocal tract formants, showing progress towards universal signal modification.
KW - Adversarial networks
KW - Speech modification
KW - Speech synthesis
U2 - 10.21437/Interspeech.2020-2558
DO - 10.21437/Interspeech.2020-2558
M3 - Conference contribution
AN - SCOPUS:85098104368
VL - 2020-October
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 3206
EP - 3210
BT - Proceedings of the Annual Conference of the International Speech Communication Association
Y2 - 25 October 2020 through 29 October 2020
ER -