Zipporah: a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora

Hainan Xu, Philipp Koehn

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

We introduce Zipporah, a fast and scalable data cleaning system. We propose a novel type of bag-of-words translation feature, and train logistic regression models to classify good data and synthetic noisy data in the proposed feature space. The trained model is used to score parallel sentences in the data pool for selection. As shown in experiments, Zipporah selects a high-quality parallel corpus from a large, mixed quality data pool. In particular, for one noisy dataset, Zipporah achieves a 2.1 BLEU score improvement with using 1/5 of the data over using the entire corpus.
Original languageEnglish
Title of host publicationProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
Place of PublicationCopenhagen, Denmark
PublisherAssociation for Computational Linguistics
Pages2935-2940
Number of pages6
ISBN (Print)978-1-945626-97-5
Publication statusPublished - 11 Sept 2017
EventEMNLP 2017: Conference on Empirical Methods in Natural Language Processing - Copenhagen, Denmark
Duration: 7 Sept 201711 Sept 2017
http://emnlp2017.net/index.html
http://emnlp2017.net/

Conference

ConferenceEMNLP 2017: Conference on Empirical Methods in Natural Language Processing
Abbreviated titleEMNLP 2017
Country/TerritoryDenmark
CityCopenhagen
Period7/09/1711/09/17
Internet address

Fingerprint

Dive into the research topics of 'Zipporah: a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora'. Together they form a unique fingerprint.

Cite this