Abstract
Vast amounts of speech data collected for language documentation and research remain untranscribed and unsearchable, but often a small amount of speech may have text translations available. We present a method for partially labeling additional speech with translations in this scenario. We modify an unsupervised speech-totranslation alignment model and obtain prototype speech segments that match the translation words, which are in turn used to discover terms in the unlabelled data. We evaluate our method on a Spanish-English speech translation corpus and on two corpora of endangered languages, Arapaho and Ainu, demonstrating its appropriateness and applicability in an actual very-low-resource scenario.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the First Workshop on Speech-Centric Natural Language Processing |
| Publisher | Association for Computational Linguistics |
| Pages | 53-58 |
| Number of pages | 6 |
| ISBN (Print) | 978-1-945626-92-0 |
| DOIs | |
| Publication status | Published - 11 Sept 2017 |
| Event | First Workshop on Speech-Centric Natural Language Processing - Copenhagen, Denmark Duration: 7 Sept 2017 → 7 Sept 2017 http://speechnlp.github.io/2017/ |
Conference
| Conference | First Workshop on Speech-Centric Natural Language Processing |
|---|---|
| Abbreviated title | SCNLP 2017 |
| Country/Territory | Denmark |
| City | Copenhagen |
| Period | 7/09/17 → 7/09/17 |
| Internet address |