Spoken Term Discovery for Language Documentation using Translations

Antonios Anastasopoulos, Sameer Bansal, Sharon Goldwater, Adam Lopez, David Chiang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Vast amounts of speech data collected for language documentation and research remain untranscribed and unsearchable, but often a small amount of speech may have text translations available. We present a method for partially labeling additional speech with translations in this scenario. We modify an unsupervised speech-totranslation alignment model and obtain prototype speech segments that match the translation words, which are in turn used to discover terms in the unlabelled data. We evaluate our method on a Spanish-English speech translation corpus and on two corpora of endangered languages, Arapaho and Ainu, demonstrating its appropriateness and applicability in an actual very-low-resource scenario.
Original languageEnglish
Title of host publicationProceedings of the First Workshop on Speech-Centric Natural Language Processing
PublisherAssociation for Computational Linguistics
Pages53-58
Number of pages6
ISBN (Print)978-1-945626-92-0
DOIs
Publication statusPublished - 11 Sept 2017
EventFirst Workshop on Speech-Centric Natural Language Processing - Copenhagen, Denmark
Duration: 7 Sept 20177 Sept 2017
http://speechnlp.github.io/2017/

Conference

ConferenceFirst Workshop on Speech-Centric Natural Language Processing
Abbreviated titleSCNLP 2017
Country/TerritoryDenmark
CityCopenhagen
Period7/09/177/09/17
Internet address

Fingerprint

Dive into the research topics of 'Spoken Term Discovery for Language Documentation using Translations'. Together they form a unique fingerprint.

Cite this