Exploring Deep Models for Comprehension of Deictic Gesture-Word Combinations in Cognitive Robotics

Gabriella Pizzuto, Angelo Cangelosi

Research output: Chapter in Book/Report/Conference proceedingConference contribution


In the early stages of infant development, gestures and speech are integrated during language acquisition. Such a natural combination is therefore a desirable, yet challenging, goal for fluid human-robot interaction. To achieve this, we propose a multimodal deep learning architecture, for comprehension of complementary gesture-word combinations, implemented on an iCub humanoid robot. This enables human-assisted language learning, with interactions like pointing at a cup and labelling it with a vocal utterance. We evaluate various depths of the Mask Regional Convolutional Neural Network (for object and wrist detection) and the Residual Network (for gesture classification). Validation is carried out with two deictic gestures across ten real-world objects on frames recorded directly from the iCub’s cameras. Results further strengthen the potential of gesture-word combinations for robot language acquisition
Original languageEnglish
Title of host publication2019 International Joint Conference on Neural Networks (IJCNN)
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Number of pages7
ISBN (Electronic)978-1-7281-1985-4
ISBN (Print)978-1-7281-1986-1
Publication statusPublished - 30 Sep 2019

Publication series

PublisherInstitute of Electrical and Electronics Engineers (IEEE)
ISSN (Print)2161-4393
ISSN (Electronic)2161-4407


  • cognitive systems
  • convolutional neural nets
  • feature extraction
  • gesture recognition
  • humanoid robots
  • human-robot interaction
  • image classification
  • learning (artificial intelligence)
  • neural net architecture
  • object detection
  • robot programming
  • robot vision
  • multimodal deep learning architecture
  • iCub humanoid robot
  • wrist detection
  • gesture classification
  • robot language acquisition
  • deictic gesture-word combinations
  • cognitive robotics
  • mask regional convolutional neural network
  • residual network
  • human-assisted language learning
  • cognitive developmental robotics
  • embodied language acquisition


Dive into the research topics of 'Exploring Deep Models for Comprehension of Deictic Gesture-Word Combinations in Cognitive Robotics'. Together they form a unique fingerprint.

Cite this