Visually Grounded Meaning Representations

Carina Silberer, Vittorio Ferrari, Mirella Lapata

Research output: Contribution to journalArticlepeer-review

Abstract / Description of output

In this paper we address the problem of grounding distributional representations of lexical meaning. We introduce a new model which uses stacked autoencoders to learn higher-level representations from textual and visual input. The visual modality is encoded via vectors of attributes obtained automatically from images. We create a new large-scale taxonomy of 600 visual attributes representing more than 500 concepts and 700K images. We use this dataset to train attribute classifiers and integrate their predictions with text-based distributional models of word meaning. We evaluate our model on its ability to simulate word similarity judgments and concept categorization. On both tasks, our model yields a better fit to behavioral data compared to baselines and related models which either rely on a single modality or do not make use of attribute-based input.
Original languageEnglish
Pages (from-to)2284-2297
Number of pages14
JournalIEEE Transactions on Pattern Analysis and Machine Intelligence
Issue number11
Early online date2 Dec 2016
Publication statusPublished - 1 Nov 2017


Dive into the research topics of 'Visually Grounded Meaning Representations'. Together they form a unique fingerprint.

Cite this