Duplicate Latent Representation Suppression for Multi-object Variational Autoencoders

Li Nanbo, Robert B Fisher

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Generative object-centric scene representation learning is crucial for structural visual scene understanding. Built upon variational autoencoders (VAEs), current approaches infer a set of latent object representations to interpret a scene observation (e.g. an image) under the assumption that each part (e.g. a pixel) of a scene observation must be explained by one and only one object of the underlying scene. Despite the impressive performance these models achieved in unsupervised scene factorization and representation learning, we show empirically that they often produce duplicate scene object representations which directly harms the scene factorization performance. In this paper, we address the issue by introducing a differentiable prior that explicitly forces the inference to suppress duplicate latent object representations. The extension is evaluated by adding it to three different unsupervised scene factorization approaches. The results show that the models trained with the proposed method not only outperform the original models in scene factorization and have fewer duplicate representations, but also achieve better variational posterior approximations than the original models.
Original languageEnglish
Title of host publicationProceedings of the 32nd British Machine Vision Conference (BMVC 2021)
PublisherBritish Machine Vision Conference
Number of pages12
Publication statusPublished - 25 Nov 2021
EventThe 32nd British Machine Vision Conference - Virtual
Duration: 22 Nov 202125 Nov 2021


ConferenceThe 32nd British Machine Vision Conference
Abbreviated titleBMVC 2021
Internet address

Keywords / Materials (for Non-textual outputs)

  • object-centric representation learning
  • variational autoencoders
  • scene representation


Dive into the research topics of 'Duplicate Latent Representation Suppression for Multi-object Variational Autoencoders'. Together they form a unique fingerprint.

Cite this