Data Augmentation via Dependency Tree Morphing for Low Resource Languages

Gözde Gül Sahin, Mark Steedman

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Neural NLP systems achieve high scores in the presence of sizable training dataset. Lack of such datasets leads to poor system performances in the case low-resource languages. We present two simple text augmentation techniques using dependency trees, inspired from image processing. We “crop” sentences by removing dependency links, and we “rotate” sentences by moving the tree fragments around the root. We apply these techniques to augment the training sets of low resource languages in Universal Dependencies project. We implement a character-level sequence tagging model and evaluate the augmented datasets on part-of-speech tagging task. We show that crop and rotate provides improvements over the models trained with non-augmented data for majority of the languages, especially for languages with rich case marking systems.
Original languageEnglish
Title of host publicationProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
PublisherACL Anthology
Pages5004-5009
Number of pages6
DOIs
Publication statusPublished - 4 Nov 2018
Event2018 Conference on Empirical Methods in Natural Language Processing - Square Meeting Center, Brussels, Belgium
Duration: 31 Oct 20184 Nov 2018
http://emnlp2018.org/

Conference

Conference2018 Conference on Empirical Methods in Natural Language Processing
Abbreviated titleEMNLP 2018
Country/TerritoryBelgium
CityBrussels
Period31/10/184/11/18
Internet address

Fingerprint

Dive into the research topics of 'Data Augmentation via Dependency Tree Morphing for Low Resource Languages'. Together they form a unique fingerprint.

Cite this