Abstract
Many self-supervised learning methods are pre-trained on the well-curated ImageNet-1K dataset. In this work, given the excellent scalability of web data, we consider self-supervised pre-training on noisy web sourced image-text paired data. First, we conduct a benchmark study of representative self-supervised pre-training methods on large-scale web data in a like-for-like setting. We compare a range of methods, including single-modal ones that use masked training objectives and multi-modal ones that use image-text constrastive training. We observe that existing multi-modal methods do not outperform their single-modal counterparts on vision transfer learning tasks. We derive an information-theoretical view to explain these benchmark results, which provides insight into how to design a novel vision learner. Inspired by this insight, we present a new visual representation pre-training method, MUlti-modal Generator~(MUG), that learns from scalable web sourced image-text data. MUG achieves state-of-the-art transfer performance on a variety of tasks and demonstrates promising scaling properties.
| Original language | English |
|---|---|
| Pages (from-to) | 1-23 |
| Number of pages | 23 |
| Journal | Transactions on Machine Learning Research |
| DOIs | |
| Publication status | Published - 13 Sept 2024 |
Keywords / Materials (for Non-textual outputs)
- computer vision and pattern recognition
- artificial intelligence
- computation and language
- machine learning