Long-context vision large language models: Empirical insights and a baseline

Yongshuo Zong*, Ismail Elezi, Yongxin Yang, Jiankang Deng, Timothy Hospedales

*Corresponding author for this work

Research output: Contribution to conferencePaper

Abstract / Description of output

The development of long-context large language models (LLMs) has attracted significant interest. However, progress in advancing long-context vision large language models (VLLMs) falls behind, despite their vast potential in applications like high-resolution input, multimodal in-context learning, multi-image understanding, and video understanding. In this paper, we present an empirical study to identify major challenges in developing long-context VLLMs and present a simple yet effective baseline for long-context tasks. By captioning the images separately and aggregating the captions as inputs, we directly alleviate the input length issue and also show that it outperforms other context extension or token reduction strategies effectively.
Original languageEnglish
Pages1-8
Number of pages8
Publication statusPublished - 18 Jun 2024
EventWorkshop on Long Context Foundation Models - Vienna, Austria
Duration: 26 Jul 202426 Jul 2024
https://longcontextfm.github.io/

Workshop

WorkshopWorkshop on Long Context Foundation Models
Abbreviated titleLCFM 2024
Country/TerritoryAustria
CityVienna
Period26/07/2426/07/24
Internet address

Keywords / Materials (for Non-textual outputs)

  • vision-language models
  • long-context
  • multimodal

Fingerprint

Dive into the research topics of 'Long-context vision large language models: Empirical insights and a baseline'. Together they form a unique fingerprint.

Cite this