Mitigating modality collapse in multimodal VAEs via impartial optimization

Adrian Javaloy*, Maryam Meghdadi, Isabel Valera

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

A number of variational autoencoders (VAEs) have recently emerged with the aim of modeling multimodal data, e.g., to jointly model images and their corresponding captions. Still, multimodal VAEs tend to focus solely on a subset of the modalities, e.g., by fitting the image while neglecting the caption. We refer to this limitation as modality collapse. In this work, we argue that this effect is a consequence of conflicting gradients during multimodal VAE training. We show how to detect the sub-graphs in the computational graphs where gradients conflict (impartiality blocks), as well as how to leverage existing gradient-conflict solutions from multitask learning to mitigate modality collapse. That is, to ensure impartial optimization across modalities. We apply our training framework to several multimodal VAE models, losses and datasets from the literature, and empirically show that our framework significantly improves the reconstruction performance, conditional generation, and coherence of the latent space across modalities.
Original languageEnglish
Title of host publicationProceedings of the 39th International Conference on Machine Learning
EditorsKamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, Sivan Sabato
PublisherPMLR
Pages9938-9964
Number of pages27
Volume162
Publication statusPublished - 23 Jul 2022
Event39th International Conference on Machine Learning - Baltimore, United States
Duration: 17 Jul 202223 Jul 2022
Conference number: 39
https://icml.cc/Conferences/2022

Conference

Conference39th International Conference on Machine Learning
Abbreviated titleICML 2022
Country/TerritoryUnited States
CityBaltimore
Period17/07/2223/07/22
Internet address

Fingerprint

Dive into the research topics of 'Mitigating modality collapse in multimodal VAEs via impartial optimization'. Together they form a unique fingerprint.

Cite this