Abstract
This paper introduces a novel method for determining the best room to place an object in, for embodied scene rearrangement. While state-of-the-art approaches rely on large language models (LLMs) or reinforcement learned (RL) policies for this task, our approach, CLIPGraphs, efficiently combines commonsense domain knowledge, data-driven methods, and recent advances in multimodal learning. Specifically, it (a) encodes a knowledge graph of prior human preferences about the room location of different objects in home environments, (b) incorporates vision-language features to support multimodal queries based on images or text, and (c) uses a graph network to learn object-room affinities based on embeddings of the prior knowledge and the vision-language features. We demonstrate that our approach provides better estimates of the most appropriate location of objects from a benchmark set of object categories in comparison with state-of-the-art baselines.11Supplementary material and code: https://clipgraphs.github.io
Original language | English |
---|---|
Title of host publication | 2023 32nd IEEE International Conference on Robot and Human Interactive Communication, RO-MAN 2023 |
Publisher | IEEE Computer Society Press |
Pages | 2604-2609 |
Number of pages | 6 |
ISBN (Electronic) | 9798350336702 |
DOIs | |
Publication status | Published - 13 Nov 2023 |
Event | 32nd IEEE International Conference on Robot and Human Interactive Communication - Busan, Korea, Republic of Duration: 28 Aug 2023 → 31 Aug 2023 Conference number: 32 https://ro-man2023.org/overview/welcomeMessage |
Publication series
Name | IEEE International Workshop on Robot and Human Communication, RO-MAN |
---|---|
ISSN (Print) | 1944-9445 |
ISSN (Electronic) | 1944-9437 |
Conference
Conference | 32nd IEEE International Conference on Robot and Human Interactive Communication |
---|---|
Abbreviated title | IEEE RO-MAN 2023 |
Country/Territory | Korea, Republic of |
City | Busan |
Period | 28/08/23 → 31/08/23 |
Internet address |
Keywords / Materials (for Non-textual outputs)
- Commonsense knowledge
- graph convolutional network
- knowledge graph
- large language models
- scene rearrangement