Abstract
This work describes our submission to the WMT16 Bilingual Document Alignment task. We show that a very simple distance metric, namely Cosine distance of tf/idf weighted document vectors provides a quick and reliable way to align documents. We compare many possible variants for constructing the document vectors. We also introduce a greedy algorithm that runs quicker and performs better in practice than the optimal solution to bipartite graph matching. Our approach shows competitive performance and can be improved even further through combination with URL based pair matching.
Original language | English |
---|---|
Title of host publication | Proceedings of the First Conference on Machine Translation, Volume 2: Shared Task Papers |
Place of Publication | Berlin, Germany |
Publisher | Association for Computational Linguistics |
Pages | 669-675 |
Number of pages | 7 |
ISBN (Print) | 978-1-945626-10-4 |
DOIs | |
Publication status | Published - 12 Aug 2016 |
Event | First Conference on Machine Translation - Berlin, Germany Duration: 11 Aug 2016 → 12 Aug 2016 http://www.statmt.org/wmt16/ |
Conference
Conference | First Conference on Machine Translation |
---|---|
Abbreviated title | WMT16 |
Country/Territory | Germany |
City | Berlin |
Period | 11/08/16 → 12/08/16 |
Internet address |