Vision Matters When It Should: Sanity Checking Multimodal Machine Translation Models

Jiaoda Li, Duygu Ataman, Rico Sennrich

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Multimodal machine translation (MMT) systems have been shown to outperform their text-only neural machine translation (NMT) counterparts when visual context is available. However, recent studies have also shown that the performance of MMT models is only marginally impacted when the associated image is replaced with an unrelated image or noise, which suggests that the visual context might not be exploited by the model at all. We hypothesize that this might be caused by the nature of the commonly used evaluation benchmark, also known as Multi30K, where the translations of image captions were prepared without actually showing the images to human translators. In this paper, we present a qualitative study that examines the role of datasets in stimulating the leverage of visual modality and we propose methods to highlight the importance of visual signals in the datasets which demonstrate improvements in reliance of models on the source images. Our findings suggest the research on effective MMT architectures is currently impaired by the lack of suitable datasets and careful consideration must be taken in creation of future MMT datasets, for which we also provide useful insights.
Original languageEnglish
Title of host publicationProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Place of PublicationOnline and Punta Cana, Dominican Republic
PublisherAssociation for Computational Linguistics
Number of pages7
ISBN (Electronic)978-1-955917-09-4
Publication statusPublished - 7 Nov 2021
Event2021 Conference on Empirical Methods in Natural Language Processing - Punta Cana, Dominican Republic
Duration: 7 Nov 202111 Nov 2021


Conference2021 Conference on Empirical Methods in Natural Language Processing
Abbreviated titleEMNLP 2021
Country/TerritoryDominican Republic
CityPunta Cana
Internet address

Cite this