Perceptual Organization: Neural mechanisms, models

Talk Session: Friday, May 16, 2025, 3:30 – 4:45 pm, Talk Room 2

Talk 1, 3:30 pm

Investigating semantic expectation and predictive error in the visual cortex with a large fMRI vision-language dataset

Shurui Li1 (), Zheyu Jin1, Ru-Yuan Zhang2, Shi Gu3, Yuanning Li1; 1ShanghaiTech University, Shanghai, China, 2Shanghai Jiao Tong University, Shanghai, China, 3University of Electronic Science and Technology of China, Chengdu, China

Classical models of visual processing in the brain emphasize a predominantly feedforward hierarchical coding scheme, where lower-level features are progressively integrated into higher-level semantic representations. However, this view fails to fully account for the complex and dynamic nature of semantic information processing in the visual cortex, which involves interactions that extend beyond passive feedforward pathways. To study the neural coding of semantic information in the visual hierarchy, we collected a large-scale fMRI vision-language dataset, where each participant processed over 4,400 paired stimuli consisting of a text caption followed by a naturalistic image, with the task of evaluating their semantic congruence. Driven by the predictive coding theory, we hypothesized that the early visual cortex can represent semantic expectations and predictive errors. First, we observed that when subjects were presented with images that matched their semantic expectations, the early visual cortex responded significantly less than they were presented with unexpected images. To explain the neural coding, we built neural encoding models using features extracted from large language and vision models. We found that language model features could predict the early visual cortex’s response after the subjects viewed a text caption. This indicates that the early visual cortex can generate semantic expectations. Next, we found that neural activity from V1 to V3 encode prediction mismatch of visual stimuli, representing a progression from low-level to high-level prediction errors. Finally, we found that the degree of response amplitude reduction correlated to the neural coding of high-level prediction errors. To sum up, using a large fMRI vision-language dataset, we provide evidence of cross-modal semantic expectation and predictive error coding in the visual cortex.

This work is supported by the National Natural Science Foundation of China (32371154, Y.L.), Shanghai Rising-Star Program (24QA2705500, Y.L.), Shanghai Pujiang Program (22PJ1410500, Y.L.), and the Lingang Laboratory (LG-GG-202402-06, Y.L.).

Talk 2, 3:45 pm

Mesoscale functional connectivity in human V1 revealed by high-resolution fMRI.

Marianna E. Schmidt1,2, Iman Aganj3,4, Jason Stockmann3,4, Berkin Bilgic3,4,5, Yulin Chang6, W. Scott Hoge7, Evgeniya Kirilina1, Nikolaus Weiskopf1,8,9, Shahin Nasr3,4; 1Max Planck Institute for Human Cognitive and Brain Sciences, 2Max Planck School of Cognition, Leipzig, Germany, 3Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital, 4Harvard Medical School, 5Harvard/MIT Health Sciences and Technology, 6Siemens Medical Solutions USA Inc., Malvern, PA, USA, 7Imaginostics, Inc., Orlando, FL, USA, 8Felix Bloch Institute for Solid State Physics, Faculty of Physics and Earth System Sciences, Leipzig University, Leipzig, Germany, 9Wellcome Center for Human Neuroimaging, Institute of Neurology, University College London

Despite their importance in shaping visual perception, functional connectivity between ocular dominance columns (ODCs), the building blocks of neuronal processing within the human primary visual cortex (V1), remains mostly unknown. In this study, we used high-resolution fMRI (7T; voxel size = 1mm isotropic) to localize ODCs and assess their resting-state (eyes closed) functional connectivity (rs-FC) in 11 individuals (3 females) with intact vision (age=30.9±5.9 years). Consistent with studies in animals, we found stronger rs-FC between ODCs with alike rather than unalike ocular preference (p<0.01). The level of rs-FC was generally higher at mid-cortical depths, while selectivity was more pronounced at superficial and deep cortical depths. Surpassing expectations from anatomical studies of ODC connectivity, we found the following. First, the selective rs-FC between ODCs was preserved for distances of up to 4cm, indicating that connectivity between ODCs remains selective across multiple synapses. Second, rs-FC selectivity was significantly higher between ODCs that exhibited stronger (compared to weaker) ocular preference (p<10-3), even though ODC mapping and rs-FC measurements were conducted in separate scan sessions. Third, the extent of selectivity appeared to vary between foveal vs. peripheral and to a lesser extent between dorsal vs. ventral regions, suggesting a heterogeneity in the distribution of rs-FC within V1. We further tested whether the ODC map was predictable from the rs-FC pattern. Our preliminary results showed a significant correlation between rs-FC and ODC maps (p<10-5). The level of this correlation declined when the size of regions of interest increased from 20.21mm2 (r=0.20) to 286.47mm2 (r=0.08). This result indicates a promising opportunity for ODC mapping in individuals with disrupted binocular vision (e.g., monocular blindness). In conclusion, our results demonstrate the utility of high-resolution fMRI for studying mesoscale rs-FC within V1, successfully replicating the findings based on animal models and highlighting promising new opportunities for future exploration.

Talk 3, 4:00 pm

Feature-specific and feature-independent ensemble representations in the human brain

Patxi Elosegi1,2, Yaoda Xu2, David Soto1; 1Basque Center on Cognition, Brain and Language, Donostia, Spain, 2Psychology Department, Yale University, New Haven, CT, US

The human brain overcomes processing limitations by compressing redundant visual input into ensemble representations. While psychophysical studies demonstrate that ensembles are efficiently extracted across low-, mid-, and high-level features, the domain-generality of ensemble perception remains unclear. Neuroimaging holds the key to addressing this, yet prior findings have been inconsistent. This fMRI study aims to (i) test whether ensembles composed of visual features of increasing complexity—orientation, shape, and animacy—are processed locally in feature-selective areas or by a shared neural substrate, and (ii) assess whether different summary statistical descriptors such as mean and ratio involve common brain representations. We collected fMRI data from 24 participants (two scanning sessions) using a mini-block design. In each mini-block, participants saw a sequential presentation of five ensemble displays, consisting of the same twelve objects that shuffled positions across successive displays. Stimuli were carefully generated to vary both in the ratio of items from each class (e.g., living vs. nonliving items) and along a continuous distribution of the mean ensemble feature (e.g., average lifelikeness). Results from a searchlight MPVAs revealed a clear dissociation: mean ensemble features are encoded locally in a feature-specific manner along the medial Ventral-Visual-Pathway, following the texture-sensitive collateral sulcus. In contrast, ratio information—defined by the proportion of items from each class—is encoded in a feature-independent manner in the Dorsal-Visual-Pathway, particularly along the intraparietal sulcus, as demonstrated by cross-decoding analyses. Thus, ensemble representations are neither completely distributed nor centralized but involve the interplay between sensory and PPC areas to encode both stimulus-specific and stimulus-independent information. To test the generality of these results, we will next assess if similar representations emerge on CNN architectures trained to perform the same tasks as participants. This work bridges gaps in prior neuroimaging research and provides an open-source fMRI dataset, fostering computational models of ensemble processing across diverse visual features.

P.E.: Basque Government PREDOC grant; D.S.: Basque Government BERC 2022-2025 program and Spanish Ministry ’Severo Ochoa’ Programme for Centres/Units of Excellence in R & D (CEX2020-001010-S) and project grants PID2019-105494GB-I00. Y.X.: is supported by US NIH grant R01EY030854

Talk 4, 4:15 pm

Larger and earlier category-selective neural activity in the human ventral occipito-temporal cortex than in the medial-temporal-lobe

Simen Hagen1, Corentin Jacques1, Sophie Colnat-Coulbois1,3, Jacques Jonas1,2, Bruno Rossion1,2; 1Université de Lorraine, CNRS, F-54000 Nancy, France, 2Université de Lorraine, CHRU-Nancy, Service de Neurologie, F-54000 Nancy, France, 3Université de Lorraine, CHRU-Nancy, Service de Neurochirurgie, F-54000 Nancy France

The human ventral-occipito-temporal cortex (VOTC) contains spatially dissociated category-selective neural populations. For example, face-selective (i.e., significantly different responses to faces than non-face objects) neural populations are found in the lateral fusiform gyrus, while place-selective (i.e., significantly different responses to buildings than non-building objects) neural populations are found in and around the more medial parahippocampal gyrus. However, less is known about category-selective populations in the medial temporal lobe (MTL) and their interactions with the VOTC. On the one hand, the socially relevant amygdala (AMG) and the spatially relevant hippocampus (HPC) could relay early activity to face- and place-selective populations in the VOTC, respectively. On the other hand, face- and place-selective neural populations in the VOTC could relay early activity to the AMG and HPC, respectively. Here, we examined the spatio-temporal profiles of face- and place-selective activity in the MTL and the VOTC of a large group of epileptic patients (N=88) implanted with intracerebral electrodes in the grey matter of these regions. Both face- and place(buildings)-selective neural activity was isolated with separate frequency-tagging protocols, providing objective measures of category-selective neural activity, devoid of low-level confounds, with high spatial and temporal resolution. We find both face- and place-selective contacts in the MTL, with larger and earlier face-selective than place-selective activity in AMG. In contrast, places elicit earlier category-selective activity than faces in the HPC. Crucially, category-selective activity is more prominent (~3 times higher) and emerges significantly earlier (~50 ms) in the VOTC than in the MTL. These findings cast doubt on the view that the amygdala can serve as an early “face detector” in the human brain and suggest that face- and place-selective activity follow different circuits from the VOTC to different MTL regions.

Funding declaration: ANR IGBDEV ANR-22-CE28-0028; ERC AdG HUMANFACE 101055175

Talk 5, 4:30 pm

The organization of high-level visual cortex is aligned with visual rather than abstract linguistic information

Adva Shoham1 (), Rotem Broday-Dvir2, Rafael Malach2, Galit Yovel1; 1Tel Aviv University, 2Weizmann Institute of Science

Recent studies showed that the response of high-level visual cortex to images can be predicted by their linguistic descriptions, suggesting an alignment between visual and linguistic information. We hypothesized that such alignment is limited to textual descriptions of the visual content of the image and does not extend to abstract descriptions. We distinguish between two types of linguistic descriptions of visual images: visual text, which describes the image’s purely visual content, and abstract text, which describes conceptual knowledge unrelated to the immediate visual attributes. Accordingly, we tested the hypothesis that visual text, but not abstract text, will predict the neural response to images in high-level visual cortex. To that purpose, we used visual and language deep learning algorithms to predict the iEEG responses in humans to images of familiar faces or places. We generated two types of textual descriptions for the images: visual text, describing the visual content of the image, and abstract text, based on their Wikipedia definitions. We then extracted the relational-structure representations from a large language model (GPT-2) for the text descriptions and from a deep neural network (VGG16) for the images. Using these visual and linguistic representations, we predicted the iEEG responses to the images. Our findings showed that neural responses in high-level visual cortex were similarly predicted by the visual representation of the images and linguistic representations of the visual text, but not by abstract text. Frontal-parietal electrodes showed a reverse pattern. These results are in line with recent findings showing that textual descriptions of the content of the image predict the response to images also in the macaque’s brain. These findings demonstrate that visual-language alignment in high-level visual cortex is limited to visually grounded language.