Abstract / Description of output
We present a multi-speaker silent speech recognition system trained on articulator features derived from the Tongue and Lips corpus, a multi-speaker corpus of ultrasound tongue imaging and lip video data. We extracted articulator features using the pose estimation software DeepLabCut, then trained recognition models with these point-tracking features using Kaldi. We trained with voiced utterances, then tested performance on both voiced and silent utterances. Our multi-speaker SSR improved WER by 23.06% when compared to a previous similar multi-speaker SSR system which used image-based instead of point-tracking features. We also found great improvements (up to 15.45% decrease in WER) in recognition of silent speech using fMLLR adaptation compared to raw features. Finally, we investigated differences in articulator trajectories between voiced and silent speech and found that speakers tend to miss articulatory targets that are present in voiced speech when speaking silently.
Original language | English |
---|---|
Title of host publication | Proceedings of the Annual Conference of the International Speech Communication Association |
Subtitle of host publication | Interspeech 2023 |
Editors | Naomi Harte, Julie Carson-Berndsen, Gareth Jones |
Place of Publication | Dublin |
Publisher | ISCA |
Pages | 1149-1153 |
DOIs | |
Publication status | Published - Sept 2023 |
Event | Interspeech 2023 - Dublin, Ireland Duration: 20 Aug 2023 → 24 Aug 2023 Conference number: 24 https://www.interspeech2023.org/ |
Publication series
Name | Interspeech - Annual Conference of the International Speech Communication Association |
---|---|
Publisher | ISCA |
ISSN (Electronic) | 2308-457X |
Conference
Conference | Interspeech 2023 |
---|---|
Country/Territory | Ireland |
City | Dublin |
Period | 20/08/23 → 24/08/23 |
Internet address |
Keywords / Materials (for Non-textual outputs)
- silent speech interfaces
- silent speech recognition
- articulator pose estimation
- ultrasound imaging
- lip reading
Fingerprint
Dive into the research topics of 'Silent speech recognition with articulator positions estimated from tongue ultrasound and lip video'. Together they form a unique fingerprint.Datasets
-
UltraSuite Repository - sample data
Eshky, A. (Creator), Ribeiro, M. S. (Creator), Cleland, J. (Creator), Renals, S. (Creator), Richmond, K. (Creator), Roxburgh, Z. (Creator), Scobbie, J. (Creator) & Wrench, A. (Creator), Edinburgh DataShare, 11 Feb 2019
DOI: 10.7488/ds/2495, https://doi.org/10.21437/Interspeech.2018-1736
Dataset