Bootstrapping Generators from Noisy Data

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

A core step in statistical data-to-text generation concerns learning correspondences between structured data representations (e.g., facts in a database) and associated texts. In this paper we aim to bootstrap generators from large scale datasets where the data (e.g., DBPedia facts) and related texts (e.g., Wikipedia abstracts) are loosely aligned. We tackle this challenging task by introducing a special-purpose content selection mechanism.1 We use multi-instance learning to automatically discover correspondences between data and text pairs and show how these can be used to enhance the content signal while training an encoder-decoder architecture. Experimental results demonstrate that models trained with content-specific objectives improve upon a vanilla encoder-decoder which solely relies on soft attention.1Our code and data are available at https://github.com/EdinburghNLP/wikigen
Original languageEnglish
Title of host publicationThe 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Place of PublicationNew Orleans, Louisiana
PublisherAssociation for Computational Linguistics
Pages1516-1527
Number of pages12
DOIs
Publication statusPublished - 30 Jun 2018
Event16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Hyatt Regency New Orleans Hotel, New Orleans, United States
Duration: 1 Jun 20186 Jun 2018
http://naacl2018.org/

Conference

Conference16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Abbreviated titleNAACL HLT 2018
Country/TerritoryUnited States
CityNew Orleans
Period1/06/186/06/18
Internet address

Cite this