Abstract
Recent work has shown that deeper character-based neural machine translation (NMT) models can outperform subword-based models. However, it is still unclear what makes deeper character-based models successful. In this paper, we conduct an investigation into pure character-based models in the case of translating Finnish into English, including exploring the ability to learn word senses and morphological inflections and the attention mechanism. We demonstrate that word-level information is distributed over the entire character sequence rather than over a single character, and characters at different positions play different roles in learning linguistic knowledge. In addition, character-based models need more layers to encode word senses which explains why only deeper models outperform subword-based models. The attention distribution pattern shows that separators attract a lot of attention and we explore a sparse word-level attention to enforce character hidden states to capture the full word-level information. Experimental results show that the word-level attention with a single head results in 1.2 BLEU points drop.
Original language | English |
---|---|
Title of host publication | Proceedings of the 28th International Conference on Computational Linguistics |
Place of Publication | Barcelona, Spain (Online) |
Publisher | International Committee on Computational Linguistics |
Pages | 4251-4262 |
Number of pages | 12 |
ISBN (Print) | 978-1-952148-27-9 |
Publication status | Published - 8 Dec 2020 |
Event | The 28th International Conference on Computational Linguistics - Online Duration: 8 Dec 2020 → 13 Dec 2020 https://coling2020.org/ |
Conference
Conference | The 28th International Conference on Computational Linguistics |
---|---|
Abbreviated title | COLING 2020 |
Period | 8/12/20 → 13/12/20 |
Internet address |