Projects per year
Abstract / Description of output
Audiovisual synchronisation is the task of determining the time offset between speech audio and a video recording of the articulators. In child speech therapy, audio and ultrasound videos of the tongue are captured using instruments which rely on hardware to synchronise the two modalities at recording time. Hardware synchronisation can fail in practice, and no mechanism exists to synchronise the signals post hoc. To address this problem, we employ a two-stream neural network which exploits the correlation between the two modalities to find the offset. We train our model on recordings from 69 speakers, and show that it correctly synchronises 82.9% of test utterances from unseen therapy sessions and unseen speakers, thus considerably reducing the number of utterances to be manually synchronised. An analysis of model performance on the test utterances shows that directed phone articulations are more difficult to automatically synchronise compared to utterances containing natural variation in speech such as words, sentences, or conversations.
Original language | English |
---|---|
Title of host publication | INTERSPEECH 2019: Proceedings of the 20th Annual Conference of the International Speech Communication Association (ISCA) |
Place of Publication | Graz, Austria |
Publisher | International Speech Communication Association |
Pages | 4100-4104 |
Number of pages | 5 |
DOIs | |
Publication status | Published - 19 Sept 2019 |
Event | Interspeech 2019 - Graz, Austria Duration: 15 Sept 2019 → 19 Sept 2019 https://www.interspeech2019.org/ |
Publication series
Name | |
---|---|
Publisher | International Speech Communication Association |
ISSN (Electronic) | 1990-9772 |
Conference
Conference | Interspeech 2019 |
---|---|
Country/Territory | Austria |
City | Graz |
Period | 15/09/19 → 19/09/19 |
Internet address |
Keywords / Materials (for Non-textual outputs)
- Audiovisual synchronisation
- audio and ultrasound data
- machine learning
- neural-networks
- self-supervision
Fingerprint
Dive into the research topics of 'Synchronising audio and ultrasound by learning cross-modal embeddings'. Together they form a unique fingerprint.Projects
- 1 Finished
-
Ultrax2020: Ultrasound Technology for Optimising the Treatment of Speech Disorders
1/08/17 → 30/11/21
Project: Research
Datasets
-
UltraSuite Repository - sample data
Eshky, A. (Creator), Ribeiro, M. S. (Creator), Cleland, J. (Creator), Renals, S. (Creator), Richmond, K. (Creator), Roxburgh, Z. (Creator), Scobbie, J. (Creator) & Wrench, A. (Creator), Edinburgh DataShare, 11 Feb 2019
DOI: 10.7488/ds/2495, https://doi.org/10.21437/Interspeech.2018-1736
Dataset