Orthogonality and isotropy of speaker and phonetic information in self-supervised speech representations

Mukhtar Mohamed, Oli Danyi Liu, Hao Tang, Sharon Goldwater

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Self-supervised speech representations can hugely benefit downstream speech technologies, yet the properties that make the museful are still poorly understood. Two candidate properties related to the geometry of the representation space have been hypothesized to correlate well with downstream tasks: (1) the degree of orthogonality between the subspaces spanned by the speaker centroids and phone centroids, and (2) the isotropy of the space, i.e., the degree to which all dimensions are effectively utilized. To study them, we introduce a new measure, Cumulative Residual Variance (CRV), which can be used to assess both properties. Using linear classifiers for speaker and phone ID to probe the representations of six different self-supervised models and two untrained baselines, we ask whether either orthogonality or isotropy correlate with linear probing accuracy. We find that both measures correlate with phonetic probing accuracy, though our results on isotropy are more nuanced.
Original languageEnglish
Title of host publicationINTERSPEECH 2024
PublisherISCA
Publication statusAccepted/In press - 6 Jun 2024
EventINTERSPEECH 2024: Speech and Beyond - Kos Island, Greece
Duration: 1 Sept 20245 Sept 2024
https://interspeech2024.org/

Conference

ConferenceINTERSPEECH 2024
Country/TerritoryGreece
CityKos Island
Period1/09/245/09/24
Internet address

Keywords / Materials (for Non-textual outputs)

  • model analysis
  • representational geometry

Fingerprint

Dive into the research topics of 'Orthogonality and isotropy of speaker and phonetic information in self-supervised speech representations'. Together they form a unique fingerprint.

Cite this