Silent speech recognition with articulator positions estimated from tongue ultrasound and lip video

Rachel Beeson, Korin Richmond

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

We present a multi-speaker silent speech recognition system trained on articulator features derived from the Tongue and Lips corpus, a multi-speaker corpus of ultrasound tongue imaging and lip video data. We extracted articulator features using the pose estimation software DeepLabCut, then trained recognition models with these point-tracking features using Kaldi. We trained with voiced utterances, then tested performance on both voiced and silent utterances. Our multi-speaker SSR improved WER by 23.06% when compared to a previous similar multi-speaker SSR system which used image-based instead of point-tracking features. We also found great improvements (up to 15.45% decrease in WER) in recognition of silent speech using fMLLR adaptation compared to raw features. Finally, we investigated differences in articulator trajectories between voiced and silent speech and found that speakers tend to miss articulatory targets that are present in voiced speech when speaking silently.
Original languageEnglish
Title of host publicationProceedings of the Annual Conference of the International Speech Communication Association
Subtitle of host publicationInterspeech 2023
EditorsNaomi Harte, Julie Carson-Berndsen, Gareth Jones
Place of PublicationDublin
PublisherISCA
Pages1149-1153
DOIs
Publication statusPublished - Sept 2023
EventInterspeech 2023 - Dublin, Ireland
Duration: 20 Aug 202324 Aug 2023
Conference number: 24
https://www.interspeech2023.org/

Publication series

NameInterspeech - Annual Conference of the International Speech Communication Association
PublisherISCA
ISSN (Electronic)2308-457X

Conference

ConferenceInterspeech 2023
Country/TerritoryIreland
CityDublin
Period20/08/2324/08/23
Internet address

Keywords / Materials (for Non-textual outputs)

  • silent speech interfaces
  • silent speech recognition
  • articulator pose estimation
  • ultrasound imaging
  • lip reading

Fingerprint

Dive into the research topics of 'Silent speech recognition with articulator positions estimated from tongue ultrasound and lip video'. Together they form a unique fingerprint.

Cite this