Code-switching with word senses for pretraining in neural machine translation

Vivek Iyer, Edoardo Barba, Alexandra Birch, Jeff Z. Pan, Roberto Navigli

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Lexical ambiguity is a significant and pervasive challenge in Neural Machine Translation (NMT), with many state-of-the-art (SOTA) NMT systems struggling to handle polysemous words (Campolungo et al., 2022). The same holds for the NMT pretraining paradigm of denoising synthetic “code-switched” text (Pan et al., 2021; Iyer et al., 2023), where word senses are ignored in the noising stage – leading to harmful sense biases in the pretraining data that are subsequently inherited by the resulting models. In this work, we introduce Word Sense Pretraining for Neural Machine Translation (WSP-NMT) - an end-to-end approach for pretraining multilingual NMT models leveraging word sense-specific information from Knowledge Bases. Our experiments show significant improvements in overall translation quality. Then, we show the robustness of our approach to scale to various challenging data and resource-scarce scenarios and, finally, report fine-grained accuracy improvements on the DiBiMT disambiguation benchmark. Our studies yield interesting and novel insights into the merits and challenges of integrating word sense information and structured knowledge in multilingual pretraining for NMT.
Original languageEnglish
Title of host publicationFindings of the Association for Computational Linguistics
Subtitle of host publicationEMNLP 2023
EditorsHouda Bouamor, Juan Pino, Kalika Bali
PublisherAssociation for Computational Linguistics
Pages12889–12901
Number of pages13
ISBN (Electronic)9798891760615
DOIs
Publication statusPublished - 10 Dec 2023
EventThe 2023 Conference on Empirical Methods in Natural Language Processing - Resorts World Convention Centre, Sentosa, Singapore
Duration: 6 Dec 202310 Dec 2023
Conference number: 28
https://2023.emnlp.org/

Conference

ConferenceThe 2023 Conference on Empirical Methods in Natural Language Processing
Abbreviated titleEMNLP 2023
Country/TerritorySingapore
CitySentosa
Period6/12/2310/12/23
Internet address

Fingerprint

Dive into the research topics of 'Code-switching with word senses for pretraining in neural machine translation'. Together they form a unique fingerprint.

Cite this