TY - GEN
T1 - Phonetic Analysis of Self-supervised Representations of English Speech
AU - Wells, Dan
AU - Tang, Hao
AU - Richmond, Korin
N1 - Funding Information:
Acknowledgements: This work was supported in part by the UKRI Centre for Doctoral Training in Natural Language Processing, funded by the UKRI (grant EP/S022481/1) and the University of Edinburgh, School of Informatics and School of Philosophy, Psychology & Language Sciences.
Publisher Copyright:
Copyright © 2022 ISCA.
PY - 2022/9/18
Y1 - 2022/9/18
N2 - We present an analysis of discrete units discovered via self-supervised representation learning on English speech. We focus on units produced by a pre-trained HuBERT model due to its wide adoption in ASR, speech synthesis, and many other tasks. Whereas previous work has evaluated the quality of such quantization models in aggregate over all phones for a given language, we break our analysis down into broad phonetic classes, taking into account specific aspects of their articulation when considering their alignment to discrete units. We find that these units correspond to sub-phonetic events, and that fine dynamics such as the distinct closure and release portions of plosives tend to be represented by sequences of discrete units. Our work provides a reference for the phonetic properties of discrete units discovered by HuBERT, facilitating analyses of many speech applications based on this model.
AB - We present an analysis of discrete units discovered via self-supervised representation learning on English speech. We focus on units produced by a pre-trained HuBERT model due to its wide adoption in ASR, speech synthesis, and many other tasks. Whereas previous work has evaluated the quality of such quantization models in aggregate over all phones for a given language, we break our analysis down into broad phonetic classes, taking into account specific aspects of their articulation when considering their alignment to discrete units. We find that these units correspond to sub-phonetic events, and that fine dynamics such as the distinct closure and release portions of plosives tend to be represented by sequences of discrete units. Our work provides a reference for the phonetic properties of discrete units discovered by HuBERT, facilitating analyses of many speech applications based on this model.
KW - self-supervised learning
KW - speech units
UR - http://www.scopus.com/inward/record.url?scp=85140093582&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2022-10884
DO - 10.21437/Interspeech.2022-10884
M3 - Conference contribution
AN - SCOPUS:85140093582
VL - 2022-September
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 3583
EP - 3587
BT - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
A2 - Ko, Hanseok
A2 - Hansen, John H. L.
PB - ISCA
T2 - 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022
Y2 - 18 September 2022 through 22 September 2022
ER -