Scene Perception
Talk Session: Tuesday, May 20, 2025, 10:45 am – 12:15 pm, Talk Room 2
Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions
Talk 1, 10:45 am
The Occipital Place Area (OPA) represents left-right information in 8-year-olds, but not 5-year-olds
Rebecca J. Rennert1, Daniel D. Dilks1; 1Emory University
Recent fMRI studies suggest that the occipital place area (OPA) – a scene-selective cortical region involved in “visually-guided navigation” – is surprisingly late developing, not involved in visually-guided navigation at all until around 8 years of age and thus develops abruptly (discontinuously). However, it could nevertheless be the case that OPA still supports visually-guided navigation (in some “weaker” form) continuously, for example, with earlier emerging representations of one of the most primitive kinds of navigationally-relevant information: “sense” (left-right) information. Indeed, studies have shown that young toddlers and even non-human species, including fish and insects, can perceive and utilize sense information for navigation. Thus, how does OPA develop, continuously or discontinously? Here we directly address this question by investigating the development of sense representation in OPA. More specifically, using functional magnetic resonance imaging (fMRI) adaptation in children at 5 and 8 years of age, we measured the response in OPA to repeating images of identical scenes (“same” condition), mirror-image reversals of scenes (“mirror” condition), or different scenes (“different” condition). Importantly, OPA in both groups adapted to the “same” condition (demonstrating that the adaptation paradigm is working). Crucially, however, we found that OPA in 8-year-olds, like adults, was sensitive to the “mirror” condition (thus representing sense information), while OPA in 5-year-olds was not (thus not representing sense information). Taken together, these findings i) reveal further evidence that the visually-guided navigation system undergoes protracted development, not even supporting sense information until around 8 years of age, and ii) raise the intriguing possibility that OPA develops discontinuously, developing abruptly around 8 years of age.
Talk 2, 11:00 am
Behavioral and Neural Correlates of Surround Suppression in Dynamic Natural Scenes
Merve Kınıklıoğlu1 (), Daniel Kaiser1,2; 1Neural Computation Group, Department of Mathematics and Computer Science, Physics, Geography, Justus Liebig University Gießen, Germany, 2Center for Mind, Brain and Behavior (CMBB), Philipps-University Marburg, Justus Liebig University Gießen and Technical University Darmstadt, Germany
Sensitivity to a central grating often decreases when it is presented with an annular grating. Such surround suppression is observed in neural activity within regions like MT and also manifests in motion perception. Notably, suppression strength decreases when the central and surrounding stimuli move in opposite directions. While previous studies have primarily investigated surround suppression using low-level stimuli like drifting gratings, evidence from ecologically relevant contexts remains limited. In this study, we investigated the behavioral and neural correlates of surround suppression in humans using dynamic natural scenes. Additionally, we examined whether the suppression effect is modulated by the content similarity between the central and surrounding scenes. Participants viewed panoramic videos created by moving static images behind a circular occluder, with central images shown through a 1.9° aperture and surrounding images through a 2.5°-10.4° annular aperture. The resulting videos depicted natural scenes with varied relationships between the center and surround, including identical exemplars, different exemplars of the same basic-level category, and videos from different basic-level or superordinate categories. In a behavioral experiment, participants reported the superordinate categories of the central and surrounding videos. Results showed that surround suppression was strongest when the central and surrounding videos were from different superordinate categories and weakened as content similarity increased. Furthermore, suppression decreased when the videos moved in opposite directions. In a complementary fMRI experiment, participants viewed the same stimuli while detecting rare target events. Findings revealed that surround suppression was evident in hMT+, with stronger suppression during same-direction videos. In contrast, regions such as V1 and scene-selective areas (OPA and PPA) exhibited surround facilitation. Critically, no content-specific effects were observed in any region of interest. These findings suggest that hMT+ activity alone cannot fully explain the perceptual suppression effect, suggesting complex interactions among multiple brain regions in surround suppression with dynamic natural scenes.
D.K. is supported by the DFG Grants SFB/TRR135, Project Number 222641018; KA4683/5-1, Project Number 518483074, “The Adaptive Mind,” funded by the Excellence Program of the Hessian Ministry of Higher Education, Science, Research and Art, and an ERC Starting Grant PEP, ERC-2022-STG 101076057.
Talk 3, 11:15 am
Three categorization tasks yield comparable category spaces: a comparison using real-world scenes
Pei-Ling Yang1 (), Diane M. Beck1; 1University of Illinois
Categorization is a fundamental cognitive function that depends on the similarity among items. Many tasks have been developed to measure item similarities to better understand the underlying similarity space of categories. Most such studies have focused on object categories. The current study compares the measured similarity outcomes of three categorization tasks for real-world scenes: an arrangement, odd-one-out, and same-different judgment task. Our study asks whether a stable representation for scene categories (i.e., beach, city, mountain) exists across tasks. To assess the reliability of the tasks, each task was conducted twice for each participant (N = 98 for each task). The arrangement task asked participants to place each scene image relative to the three text anchors that they set at the beginning of the task. The odd-one-out task required participants to choose the scene that is the most dissimilar to the other two in a triplet of scenes. The same-different judgment task asked participants to respond whether a pair of scenes was from the same category or not. The similarity matrices were derived from distances in pixel space for the arrangement task, the probability of choosing a scene as dissimilar for the odd-one-out task, and the probability of reporting two scenes as ‘same’ for the same/different task. The rank correlations were calculated between two repeats of the task to examine the reliability of the similarities. All three tasks showed comparable rank correlation reliability: arrangement (0.61), odd-one-out (0.60), same-different (0.69). Ordinal multidimensional scaling on the similarity matrices of each task was used to construct 3-D category spaces, where the distances reflect the similarity among stimuli. These rank-derived spaces were moderately correlated across tasks: arrangement & same-different (0.67), arrangement & odd-one-out (0.58), odd-one-out & same-different (0.57). These results imply some stable representation of scene category similarity space that is worth further investigation.
Talk 4, 11:30 am
Unexpected scene views are prioritized in perceptual awareness in real-world environments
Anna Mynick1, Michael A. Cohen2, Adithi Jayaraman1, Kala Goyal1, Caroline E. Robertson1; 1Dartmouth College, 2Amherst College
Although we only perceive a small portion of the visual world at any given moment, we operate within the world with remarkable efficiency. How does memory for the world around us help facilitate perception in naturalistic environments? Here, we examined how memory for immersive, 360º environments shapes perceptual awareness of the world around us across head turns. Participants (N=64) first studied a set of immersive real-world scenes from the local college campus in head-mounted virtual reality (VR). After studying these scenes, each trial began by showing participants a limited view from a studied scene (prime). Then, the prime disappeared and participants turned their head left or right 90º towards a target view. Targets were initially masked with continuous flash suppression, and participants’ task was to indicate when the target entered perceptual awareness. To ensure true target detection, only half of the target was displayed (a semi-circle) and participants indicated which side of the circle the target was on (left/right). Targets depicted another view from the same scene, either displayed in their learned spatial position (‘expected view’) or displayed 180º opposite their learned position (‘unexpected view’). We found that unexpected scene views entered perceptual awareness roughly 200ms faster than expected scene views (p=.006). Intriguingly, scene memory alone did not explain these results: the same pattern emerged in a new group of participants (N=31) who completed the same task on a set of unfamiliar, never-studied scenes (p=.04). Taken together, these results suggest that extrapolation beyond the current field of view is sufficient to draw expected and unexpected scene views into perceptual awareness at different rates. More broadly, our findings indicate that expectations of upcoming scene views shape how scene information is prioritized in perceptual awareness across head turns, supporting efficient interaction in immersive environments despite our limited field of view.
Talk 5, 11:45 am
Modeling human scene understanding fixation-by-fixation using generative models
Ritik Raina1 (), Alexandros Graikos1, Abe Leite1, Seoyoung Ahn2, Gregory Zelinsky1; 1Stony Brook University, 2UC Berkeley
The gist-level of understanding that humans extract from brief exposures to a scene becomes quickly elaborated with a more detailed scene understanding that emerges with each new fixation made during scene viewing. Each fixation enables the extraction of rich visual and contextual features that encode new information about the scene’s objects and their spatial interrelationships. Successive fixations lead to an incrementally evolving representation of the scene's meaning and layout that parallels the viewer’s evolving understanding of the scene, but this dynamic evolution of scene understanding has yet to be modeled. To this end, we present SparseDiff, a novel approach that leverages generative (latent diffusion) modeling to incrementally generate images that reflect progressively detailed levels of scene understanding that can be used as hypotheses for behavioral evaluation. Our model uses a self-supervised image encoder (DINOv2) to extract visual representations from the fixated regions of an image that capture both local object and broader scene context features. These fixation-grounded features then condition a pre-trained text-to-image diffusion model to generate full coherent scene representations. As SparseDiff accumulates information from successive fixations, it iteratively generates increasingly refined hypotheses based on the scene’s objects that were fixated while filling in unattended regions with contextually plausible content. We evaluated SparseDiff using the COCO-FreeView dataset and found that increasing the number of fixations provided to the model led to enhanced visual and semantic similarity between generated and original images as measured by image similarity metrics (e.g. CLIP, DreamSim) that capture high-level semantic alignment as well as mid-level feature similarity. Future work will use same-different tasks to evaluate whether generated hypotheses are scene “metamers” for what was perceived during viewing, and use SparseDiff as a tool to study individual differences in visual perception by comparing the different scene understandings generated from different viewing scanpaths.
This work was supported in part by NSF IIS awards 1763981 and 2123920 to G.Z.
Talk 6, 12:00 pm
Contours, not textures determine orientation tuning in humans
Seohee Han1 (), Dirk B. Walther1; 1University of Toronto
Ever since the seminal discoveries of Hubel and Wiesel, we know that cortical representations of visual input start with oriented edges and lines in the primary visual cortex. Determining exactly how orientation information is processed in the human brain is critical for understanding the computational and neural mechanisms of vision. Filter and contour-based methods, including simple bars and Gabor filters, have historically been used interchangeably for detecting orientation activity. While these approaches effectively capture orientation, filter-based methods typically aggregate data across spatial frequencies, blending texture and contour information. This overlap raises important questions about what specific orientation features are most relevant for perception and neural representation. In this study, we address this important question with two complementary approaches. First, we investigated human orientation judgments using image patches with maximal and minimal differences between average orientations computed by steerable pyramid filters and a contour-based method. Behavioural results revealed that human judgments align with contour-based orientation but not with filter-based orientation when the two were in conflict. Observers clearly prioritized contours over textures when summarizing orientation in complex scenes. Second, we evaluated the impact of orientation computation methods on neural maps of orientation selectivity, using Roth and colleagues' (2022) image-computable model as a benchmark. By comparing filter-based methods applied to photographs and line drawings with a contour-based method, we assessed how the choice of computation influences model fit and voxel-level orientation preference in the visual cortex. Again, we found a clear advantage of contours over textures in explaining orientation tuning in the visual cortex. These findings underscore the importance of oriented contours rather than textures as the elemental building blocks of vision. By highlighting the importance of contours for human orientation judgments and neural selectivity in the visual cortex, our work emphasizes the need for methodological alignment in visual neuroscience research.