Investigating the contribution of speaker attributes to speaker separability using disentangled speaker representations

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Deep speaker embeddings have been shown to encode a wide variety of attributes relating to a speaker. The aim of this work is to separate out some of these attributes in the embedding space, disentangling these sources of speaker variation into subsets of the embedding dimensions. This is achieved modifying the training procedure of a typical speaker embedding network, which is typically only trained to classify speakers. This work instead adds pairs of attribute specific task heads to operate on complementary subsets of the speaker embedding dimensions. While specific dimensions are encouraged to encode an attribute, for example gender, the other dimensions are penalized for containing this information using an adversarial loss. We show that this method is effective in factorizing out multiple attributes in the embedding space, successfully disentangling gender, nationality and age. Using the disentangled representations, we investigate how much removing this information impacts speaker verification and diarization performance, showing that gender is a significant source of separation in the deep speaker embedding space, with nationality and age also contributing to a lesser degree.
Original languageEnglish
Title of host publicationProceedings of Interspeech 2022
EditorsHanseok Ko, John H. L. Hansen
PublisherISCA
Pages610-614
Number of pages5
DOIs
Publication statusPublished - 18 Sept 2022
EventInterspeech 2022 - Incheon, Korea, Republic of
Duration: 18 Sept 202222 Sept 2022
Conference number: 23
https://interspeech2022.org/

Conference

ConferenceInterspeech 2022
Country/TerritoryKorea, Republic of
CityIncheon
Period18/09/2222/09/22
Internet address

Fingerprint

Dive into the research topics of 'Investigating the contribution of speaker attributes to speaker separability using disentangled speaker representations'. Together they form a unique fingerprint.

Cite this