CLIPGraphs: Multimodal Graph Networks to Infer Object-Room Affinities

Ayush Agrawal*, Raghav Arora, Ahana Datta, Snehasis Banerjee, Brojeshwar Bhowmick, Krishna Murthy Jatavallabhula, Mohan Sridharan, Madhava Krishna

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

This paper introduces a novel method for determining the best room to place an object in, for embodied scene rearrangement. While state-of-the-art approaches rely on large language models (LLMs) or reinforcement learned (RL) policies for this task, our approach, CLIPGraphs, efficiently combines commonsense domain knowledge, data-driven methods, and recent advances in multimodal learning. Specifically, it (a) encodes a knowledge graph of prior human preferences about the room location of different objects in home environments, (b) incorporates vision-language features to support multimodal queries based on images or text, and (c) uses a graph network to learn object-room affinities based on embeddings of the prior knowledge and the vision-language features. We demonstrate that our approach provides better estimates of the most appropriate location of objects from a benchmark set of object categories in comparison with state-of-the-art baselines.11Supplementary material and code:

Original languageEnglish
Title of host publication2023 32nd IEEE International Conference on Robot and Human Interactive Communication, RO-MAN 2023
PublisherIEEE Computer Society Press
Number of pages6
ISBN (Electronic)9798350336702
Publication statusPublished - 13 Nov 2023
Event32nd IEEE International Conference on Robot and Human Interactive Communication - Busan, Korea, Republic of
Duration: 28 Aug 202331 Aug 2023
Conference number: 32

Publication series

NameIEEE International Workshop on Robot and Human Communication, RO-MAN
ISSN (Print)1944-9445
ISSN (Electronic)1944-9437


Conference32nd IEEE International Conference on Robot and Human Interactive Communication
Abbreviated titleIEEE RO-MAN 2023
Country/TerritoryKorea, Republic of
Internet address

Keywords / Materials (for Non-textual outputs)

  • Commonsense knowledge
  • graph convolutional network
  • knowledge graph
  • large language models
  • scene rearrangement


Dive into the research topics of 'CLIPGraphs: Multimodal Graph Networks to Infer Object-Room Affinities'. Together they form a unique fingerprint.

Cite this