Abstract / Description of output
Our goal is to separate out speaking style from speaker identity in utterance-level representations of speech such as i-vectors and x-vectors. We first show that both i-vectors and x-vectors contain information not only about speaker but also about speaking style (for one data set) or emotion (for another data set), even when projected into a low-dimensional space. To disentangle these factors, we use an autoencoder in which the latent space is split into two subspaces. The entangled information about speaker and style/emotion is pushed apart by the use of auxiliary classifiers that take one of the two latent subspaces as input and that are jointly learned with the autoencoder. We evaluate how well the latent subspaces separate the factors by using them as input to separate style/emotion classification tasks. In traditional speaker identification tasks, speaker-invariant characteristics are factorized from channel and then the channel information is ignored. Our results suggest that this so-called channel may contain exploitable information, which we refer to as style factors. Finally, we propose future work to use information theory to formalize style factors in the context of speaker identity.
Original language | English |
---|---|
Title of host publication | Proceedings Interspeech 2019 |
Publisher | ISCA |
Pages | 3945-3949 |
Volume | 2019-September |
DOIs | |
Publication status | Published - 19 Sept 2019 |
Event | 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language - Graz, Austria Duration: 15 Sept 2019 → 19 Sept 2019 https://www.interspeech2019.org/ |
Publication series
Name | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
---|---|
ISSN (Print) | 2308-457X |
Conference
Conference | 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language |
---|---|
Abbreviated title | INTERSPEECH 2019 |
Country/Territory | Austria |
City | Graz |
Period | 15/09/19 → 19/09/19 |
Internet address |
Keywords / Materials (for Non-textual outputs)
- speaking style
- emotion recognition
- speech disentanglement
- speaker recognition