Abstract / Description of output
Vast amounts of speech data collected for language documentation and research remain untranscribed and unsearchable, but often a small amount of speech may have text translations available. We present a method for partially labeling additional speech with translations in this scenario. We modify an unsupervised speech-totranslation alignment model and obtain prototype speech segments that match the translation words, which are in turn used to discover terms in the unlabelled data. We evaluate our method on a Spanish-English speech translation corpus and on two corpora of endangered languages, Arapaho and Ainu, demonstrating its appropriateness and applicability in an actual very-low-resource scenario.
Original language | English |
---|---|
Title of host publication | Proceedings of the First Workshop on Speech-Centric Natural Language Processing |
Publisher | Association for Computational Linguistics |
Pages | 53-58 |
Number of pages | 6 |
ISBN (Print) | 978-1-945626-92-0 |
DOIs | |
Publication status | Published - 11 Sept 2017 |
Event | First Workshop on Speech-Centric Natural Language Processing - Copenhagen, Denmark Duration: 7 Sept 2017 → 7 Sept 2017 http://speechnlp.github.io/2017/ |
Conference
Conference | First Workshop on Speech-Centric Natural Language Processing |
---|---|
Abbreviated title | SCNLP 2017 |
Country/Territory | Denmark |
City | Copenhagen |
Period | 7/09/17 → 7/09/17 |
Internet address |