Abstract

Large-scale pre-trained vision models are becoming increasingly prevalent, offering expressive and generalizable visual representations that benefit various downstream tasks. Recent studies on the emergent properties of these models have revealed their high-level geometric understanding, in particular in the context of depth perception. However, it remains unclear how depth perception arises in these models without explicit depth supervision provided during pre-training. To investigate this, we examine whether the monocular depth cues, similar to those used by the human visual system, emerge in these models. We introduce a new benchmark, DepthCues, designed to evaluate depth cue understanding, and present findings across 20 diverse and representative pre-trained vision models. Our analysis shows that human-like depth cues emerge in more recent larger models. We also explore enhancing depth perception in large vision models by fine-tuning on DepthCues, and find that even without dense depth supervision, this improves depth estimation. To support further research, our benchmark and evaluation code will be made publicly available for studying depth perception in vision models.
Original languageEnglish
Title of host publicationProceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition
PublisherInstitute of Electrical and Electronics Engineers
Pages1-21
Number of pages21
Publication statusAccepted/In press - 26 Feb 2025
EventThe IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025 - Music City Center, Nashville, United States
Duration: 11 Jun 202515 Jun 2025
https://cvpr.thecvf.com/

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
PublisherInstitute of Electrical and Electronics Engineers
ISSN (Print)1063-6919
ISSN (Electronic)2575-7075

Conference

ConferenceThe IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025
Abbreviated titleCVPR 2025
Country/TerritoryUnited States
CityNashville
Period11/06/2515/06/25
Internet address

Keywords / Materials (for Non-textual outputs)

  • computer vision and pattern recognition

Fingerprint

Dive into the research topics of 'DepthCues: Evaluating monocular depth perception in large vision models'. Together they form a unique fingerprint.

Cite this