Quick and Reliable Document Alignment via TF/IDF-weighted Cosine Distance

Christian Buck, Philipp Koehn

Research output: Chapter in Book/Report/Conference proceedingConference contribution


This work describes our submission to the WMT16 Bilingual Document Alignment task. We show that a very simple distance metric, namely Cosine distance of tf/idf weighted document vectors provides a quick and reliable way to align documents. We compare many possible variants for constructing the document vectors. We also introduce a greedy algorithm that runs quicker and performs better in practice than the optimal solution to bipartite graph matching. Our approach shows competitive performance and can be improved even further through combination with URL based pair matching.
Original languageEnglish
Title of host publicationProceedings of the First Conference on Machine Translation, Volume 2: Shared Task Papers
Place of PublicationBerlin, Germany
PublisherAssociation for Computational Linguistics
Number of pages7
ISBN (Print)978-1-945626-10-4
Publication statusPublished - 12 Aug 2016
EventFirst Conference on Machine Translation - Berlin, Germany
Duration: 11 Aug 201612 Aug 2016


ConferenceFirst Conference on Machine Translation
Abbreviated titleWMT16
Internet address


Dive into the research topics of 'Quick and Reliable Document Alignment via TF/IDF-weighted Cosine Distance'. Together they form a unique fingerprint.

Cite this