Semi-supervised multimodal coreference resolution in image narrations

Arushi Goel, Basura Fernando, Frank Keller, Hakan Bilen

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

In this paper, we study multimodal coreference resolution, specifically where a longer descriptive text, i.e., a narration is paired with an image. This poses significant challenges due to fine-grained image-text alignment, inherent ambiguity present in narrative language, and unavailability of large annotated training sets. To tackle these challenges, we present a data efficient semi-supervised approach that utilizes image-narration pairs to resolve coreferences and narrative grounding in a multimodal context. Our approach incorporates losses for both labeled and unlabeled data within a cross-modal framework. Our evaluation shows that the proposed approach outperforms strong baselines both quantitatively and qualitatively, for the tasks of coreference resolution and narrative grounding.
Original languageEnglish
Title of host publicationProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
PublisherAssociation for Computational Linguistics
Pages11067–11081
Number of pages15
ISBN (Electronic)979-8-89176-060-8
DOIs
Publication statusPublished - 1 Dec 2023
EventThe 2023 Conference on Empirical Methods in Natural Language Processing - , Singapore
Duration: 6 Dec 202310 Dec 2023
https://2023.emnlp.org/

Conference

ConferenceThe 2023 Conference on Empirical Methods in Natural Language Processing
Abbreviated titleEMNLP 2023
Country/TerritorySingapore
Period6/12/2310/12/23
Internet address

Keywords / Materials (for Non-textual outputs)

  • cs.CL
  • cs.CV

Fingerprint

Dive into the research topics of 'Semi-supervised multimodal coreference resolution in image narrations'. Together they form a unique fingerprint.

Cite this