Abstract / Description of output
The development of long-context large language models (LLMs) has attracted significant interest. However, progress in advancing long-context vision large language models (VLLMs) falls behind, despite their vast potential in applications like high-resolution input, multimodal in-context learning, multi-image understanding, and video understanding. In this paper, we present an empirical study to identify major challenges in developing long-context VLLMs and present a simple yet effective baseline for long-context tasks. By captioning the images separately and aggregating the captions as inputs, we directly alleviate the input length issue and also show that it outperforms other context extension or token reduction strategies effectively.
Original language | English |
---|---|
Pages | 1-8 |
Number of pages | 8 |
Publication status | Published - 18 Jun 2024 |
Event | Workshop on Long Context Foundation Models - Vienna, Austria Duration: 26 Jul 2024 → 26 Jul 2024 https://longcontextfm.github.io/ |
Workshop
Workshop | Workshop on Long Context Foundation Models |
---|---|
Abbreviated title | LCFM 2024 |
Country/Territory | Austria |
City | Vienna |
Period | 26/07/24 → 26/07/24 |
Internet address |
Keywords / Materials (for Non-textual outputs)
- vision-language models
- long-context
- multimodal