Depth estimation in real-world scenes

Poster Presentation: Tuesday, May 20, 2025, 2:45 – 6:45 pm, Pavilion
Session: Scene Perception: Natural images, virtual environments

Michaela Trescakova1 (), Wendy Adams1, Matthew Anderson2, James Elder3, Erich Graf1; 1School of Psychology, University of Southampton, 2School of Optometry and Vision Science, University of California, Berkeley, 3Centre for AI & Society, York University, Toronto

Human depth perception is often studied using simple shapes defined by a limited number of depth cues. In this study, we used the Southampton-York Natural Scenes (SYNS) dataset to examine the dynamics of depth estimation and the relative contributions of elevation, binocular disparity, and color on both ordinal and ratio depth judgments. Participants viewed briefly presented images (17–267 ms) from 19 outdoor scene categories in the SYNS dataset. In Experiment 1, images were viewed under monocular conditions, while Experiment 2 included both monocular and binocular conditions. Each image featured two crosshairs marking a pair of locations, and participants identified which location appeared farther away. They then used a slider to report the depth of the nearer location as a percentage of the depth to the farther location. Trials varied in the depth difference and mean depth of the target locations. Experiments 3 and 4 introduced manipulations to the color of the images (natural, color-inverted, and greyscale), transformed using the CIELuv color space. Participants displayed a strong elevation prior across the experiments - they consistently judged the higher crosshair as further. Participants demonstrated the ability to estimate local depth with extremely brief image presentations, with both elevation and binocular disparity influencing early depth estimates. Color effects were marginal relative to the impact of other cues, suggesting that color information does not significantly enhance early perception of relative depth for outdoor scenes. We found that deep networks trained on other datasets, and K-nearest neighbours regression trained on SYNS image features, were less accurate than humans. This performance gap was most pronounced when elevation cues were absent, and at longer presentation durations. Altogether, these results suggest that human depth estimation relies on cues not fully captured by local image features or current deep learning models.

Acknowledgements: The work was funded by an EPSRC grant (ROSSINI, PI: Adams), EP/S016368/1