Abstract
Deep speaker embeddings have been shown to encode a wide variety of attributes relating to a speaker. The aim of this work is to separate out some of these attributes in the embedding space, disentangling these sources of speaker variation into subsets of the embedding dimensions. This is achieved modifying the training procedure of a typical speaker embedding network, which is typically only trained to classify speakers. This work instead adds pairs of attribute specific task heads to operate on complementary subsets of the speaker embedding dimensions. While specific dimensions are encouraged to encode an attribute, for example gender, the other dimensions are penalized for containing this information using an adversarial loss. We show that this method is effective in factorizing out multiple attributes in the embedding space, successfully disentangling gender, nationality and age. Using the disentangled representations, we investigate how much removing this information impacts speaker verification and diarization performance, showing that gender is a significant source of separation in the deep speaker embedding space, with nationality and age also contributing to a lesser degree.
Original language | English |
---|---|
Title of host publication | Proceedings of Interspeech 2022 |
Editors | Hanseok Ko, John H. L. Hansen |
Publisher | ISCA |
Pages | 610-614 |
Number of pages | 5 |
DOIs | |
Publication status | Published - 18 Sept 2022 |
Event | Interspeech 2022 - Incheon, Korea, Republic of Duration: 18 Sept 2022 → 22 Sept 2022 Conference number: 23 https://interspeech2022.org/ |
Conference
Conference | Interspeech 2022 |
---|---|
Country/Territory | Korea, Republic of |
City | Incheon |
Period | 18/09/22 → 22/09/22 |
Internet address |