Object Recognition: Models
Talk Session: Sunday, May 18, 2025, 10:45 am – 12:30 pm, Talk Room 2
Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions
Talk 1, 10:45 am
Connecting the dots: similar neural and behavioural representations for visual Braille and line-based scripts
Filippo Cerpelloni1,2, Olivier Collignon2,3*, Hans Op de Beeck1*; 1Department of Brain and Cognition, Leuven Brain Institute, KU Leuven, Belgium, 2Institute of Psychology (IPSY) & Institute of Neuroscience (IoNS), UCLouvain, Belgium, 3HES-SO Valais-Wallis, The Sense Innovation and Research Center, Lausanne & Sion, Switzerland
Visual object recognition relies on shape processing and progressive integration of line-junctions. Reading has been thought to co-opt this central mechanism in object perception since most scripts share similar visual features like line-junctions. We present a series of studies on visual Braille, a script developed for touch that possesses no explicit shape cues. Behaviourally, we show similar script acquisition in novice participants who learn visual Braille or a custom-made script with Braille dots joined into lines (Line Braille). Across four days of visual training, we do not observe differences in the accuracy and speed of transcription of words from the novel to the native script. Only a limited advantage is present in the initial training on single letters, quickly levelled by training on full words. In the brain, the Visual Word Form Area (VWFA) preferentially responds to Braille over scrambled dots in expert visual Braille readers, and not in naïve controls. Moreover, we can decode the linguistic properties of letter strings (e.g. words vs. pseudo-words) in both groups for their known scripts. The representational similarity between conditions in visual Braille for experts correlates to the similarity structure for the active line-based script. We observe this pattern in several key regions of the visual stream (V1, LO), although without being able to decode linguistic properties across scripts, and in linguistic regions (l-PosTemp), where we also found cross-script decoding. Lastly, we replicated in computational models the designs tested in human, to separate visual and linguistic influences. Overall, converging evidence show that the linguistic properties of a visual script, rather than its low-level line-junctions, play a major role both in how an individual approaches reading, and how the visual system, and VWFA in particular, processes scripts.
Talk 2, 11:00 am
A Bayesian model of camouflage detection by humans
Abhranil Das1, Wilson Geisler; 1University of Texas at Austin
Camouflage is an impressive feat of biological evolution, but so is its detection by predators and prey. Well-camouflaged animals copy the luminance, contrast, colour and texture of their natural backgrounds, leaving only the animals’ boundary available for detection. This is one of the hardest cases of visual detection and reveals many detection strategies and their limits. We conducted experiments where humans detect synthetic camouflaged targets across varying conditions, including different textures. To explain this data, we developed a principled detection model that follows human optics and biologically plausible computations and is informed by the statistics of the relevant features in natural images. The model filters an image with human optics, then computes edge gradients and groups them into edge contours. It then computes several contour features: the fraction of area they cover, their lengths, position and orientation alignments with the true target boundary, curvatures, and edge powers at 5 scales. Additionally, it computes histograms of edge gradient magnitude, orientation, and their product (a proxy for their correlation) across all pixels. In parallel, we model the statistical distribution of each of these features over natural images, by computing them on our database of optics-filtered natural image patches. Using these known feature distribution families, we then construct optimal Bayesian decision variables that measure whether the features in the camouflage image are the same over the target boundary region, as compared to outside it. We combine these feature decision variables using a multivariate Gaussian model, which outputs a final detection response, as well as the relative contribution of each feature to detection. We fit these to our experimental data so that the single principled model can predict human camouflage detection performance across our entire array of diverse stimuli, and account for many of our parametric experimental observations.
Talk 3, 11:15 am
Domain-general object recognition ability has perception and memory subfactors
Conor J. R. Smithson1 (), Isabel Gauthier1; 1Vanderbilt University
The ability to make subordinate-level object identity judgements tends to be general across object categories. This general object recognition ability (o) is measured using tasks in which errors result from pressure on either perceptual or memory mechanisms. For example, perception can be taxed by high target-distractor similarity, short presentation times, and visual noise, while memory can be taxed by having multiple targets and delays between study and test. Both task types have been assumed to measure a single o construct. Using structural equation modeling we asked whether object memory and perception abilities are separable, whether they both reflect a general o factor, and whether they are fully explained by established visual, memory, and cognitive abilities. Participants completed eight object recognition tests, four challenging perception and four challenging memory, alongside tests of visual, memory, and intelligence abilities. A model with separate object perception and memory factors fit significantly better than a model with a single object recognition factor. Low-level visual discrimination ability uniquely predicted object perception, but not object memory, while working memory uniquely predicted object memory, but not object perception. General intelligence uniquely predicted both object memory and perception to a similar extent. After regressing out these constructs, the residual correlation between object perception and object memory was very high, with the majority of remaining variance being shared between them. In a model where object memory and object perception were subfactors of a higher-order o, both loaded strongly onto o, while each subfactor’s non-o variance was completely explained by either low-level visual discrimination ability or working memory. The higher-order o factor was strongly related to general intelligence. Object perception and memory abilities are differentiable, but both substantially reflect a more general o factor. o accounts for variance in object recognition tests that is not already explained by established cognitive abilities.
This work was supported by the David K. Wilson Chair Research Fund from Vanderbilt University and NSF BCS Award 2316474.
Talk 4, 11:30 am
Embedding Object-Scene Relationships: Insights from Human Behavior and Vision-Language Models
Karim Rajaei1 (), Hamid Soltanian-Zadeh2,1; 1School of Cognitive Science, IPM, Tehran, Iran, 2University of Tehran
In real-world environments, objects are embedded within scenes, where semantic and syntactic relationships guide perception. The human brain efficiently encodes these contextual regularities, embedding object-scene and object-object relationships into neural systems. However, the mechanisms by which context influences object recognition and the extent to which computational models replicate these effects remain poorly understood. To address this, we generated a novel, parametrically controlled dataset using an embodied AI platform (OmniGibson). The dataset included 48 objects presented within either scene or phase-scrambled backgrounds, with parameters such as object distance and viewing angle, occlusion, scene lighting, and crowding varied to create both "simple" and "challenging" recognition tasks. Behavioral experiments assessed human recognition accuracy, while computational experiments evaluated conventional CNNs (e.g., AlexNet, ResNet), transformers (e.g., ViT), vision-language models (e.g., CLIP), and multimodal models. Behavioral results demonstrated that humans consistently recognized objects more accurately in scene contexts than in scrambled backgrounds; advantage most pronounced under challenging conditions (70% vs. 54% accuracy for scene and scrambled conditions, respectively; p<0.001, two-sided sign-rank-test). Computational results revealed that vision-only models failed to achieve human-level performance, even in simpler tasks (83.5% vs. 54% accuracy for humans and ResNet, respectively; p<0.001, two-sided sign-rank-test). In contrast, models incorporating language supervision, such as CLIP (87.5% accuracy), or multimodal training approached human-level performance but still fell short under most challenging conditions. These findings highlight that language-aligned models may embed object-scene relationships into their visual representations and utilize semantic relationships through language-aligned readout mechanisms. This underscores the importance of integrating contextual regularities into computational frameworks, suggesting that multimodal training paradigms incorporating language may better capture the dependencies inherent in real-world perception. Unique contributions of this study include the creation of a parametrically controlled naturalistic object-scene dataset and a direct comparison of human and state-of-the-art computational models in their ability to recognize objects within scene contexts.
Talk 5, 11:45 am
MOSAIC: An aggregated fMRI dataset for robust and generalizable vision research
Benjamin Lahner1, N. Apurva Ratan Murty2, Aude Oliva1; 1Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA, USA., 2School of Psychology, Georgia Institute of Technology, Atlanta, GA, USA.
Large-scale fMRI datasets are revolutionizing our understanding of the neural processes underlying human perception, driving new breakthroughs in neuroscience and computational modeling. Yet individual fMRI data collection efforts remain constrained by practical limitations in scan time, creating an inherent tradeoff between subjects, stimuli, and stimulus repetitions. This tradeoff often compromises stimuli diversity, data quality, and generalizability of findings such that even the largest fMRI datasets cannot fully leverage the power of high-parameter artificial neural network models and high-dimensional feature spaces. To overcome these challenges, we introduce MOSAIC (Meta-Organized Stimuli And fMRI Imaging data for Computational modeling): a scalable framework for aggregating fMRI responses across multiple subjects and datasets. We preprocessed and registered eight event-related fMRI vision datasets (Natural Scenes Dataset, Natural Object Dataset, BOLD Moments Dataset, BOLD5000, Human Actions Dataset, Deeprecon, Generic Object Decoding, and THINGS) to the fsLR32k cortical surface space with fMRIPrep to obtain 426,245 fMRI-stimulus pairs over 93 subjects and 163,202 unique stimuli. We estimated single-trial beta values with GLMsingle (Prince et al., 2022), obtaining parameter estimates of similar or higher quality than the originally published datasets. Critically, we curated the dataset by eliminating stimuli with perceptual similarity above a defined threshold to prevent test-train leakage. This rigorous pipeline resulted in a well-defined stimulus-response dataset with 145,007 training stimuli, 18,145 test stimuli, and 50 synthetic stimuli well-suited for building and evaluating robust models of human vision. By unifying datasets under an identical preprocessing and registration pipeline, MOSAIC allows researchers to circumvent the limitations of individual datasets and address complex research questions with unprecedented scope. This framework is also extensible: new datasets can be seamlessly incorporated post hoc, enabling this meta-dataset to evolve alongside advances in experimental design and methodology. This adaptability empowers the neuroscience community to collaboratively generate a scalable, generalizable foundation for studying human vision.
Talk 6, 12:00 pm
Weight-similarity Topographic Networks Improve Retinotopy and Noise Robustness
Nhut Truong1 (), Uri Hasson1; 1University of Trento
Typical deep neural networks (DNNs) lack spatial organization and a concept of unit adjacency. In contrast, topographic DNNs (TDNNs) spatially organize units, and are therefore potential spatio-functional models of cortical organization. In previous work, this spatial organization was achieved by adding a loss term that encourages adjacent neurons to exhibit similar activation patterns (activation-similarity, AS-TDNN). However, this optimization is not biologically grounded, and ideally, these correlations should arise naturally as a consequence of biologically motivated constraints. This led us to develop a new type of TDNN, whose training is grounded in the biologically-inspired principle that spatially adjacent units should have similar afferent (incoming) synaptic strength, modeled by similar incoming weight profiles (weight-similarity, WS-TDNN). Using hand-written digit classification (MNIST) as a test domain, we compared the properties of AS-TDNNs, WS-TDNNs, and a control (non-topographic) DNN. Both AS-TDNNs and WS-TDNNs were tested under six different weighting levels for the spatial loss term. While all models achieved nearly identical classification accuracy, WS-TDNNS showed several positive advantages, including greater robustness to several types of noise, greater resistance to node ablation, and higher unit-level activation variance. Interestingly, WS-TDNNs produced higher correlations between adjacent units than AS-TDNNs, even though the latter were explicitly trained on this objective. Importantly, when tested using standard retinotopy protocols (i.e., rotating wedge and eccentric ring stimuli), WS-TDNNs, but not AS-TDNNs, naturally produced angular and eccentricity-based spatial tuning. This was evident in the smooth transitions in units’ preferred angles and spatial grouping by preferred eccentricity. Moreover, these properties naturally emerged through end-to-end training, without requiring separate pre-optimization steps required in recent studies. These results were also replicated using the CIFAR-10 dataset for object recognition. Overall, our results suggest that TDNNs trained with weight-similarity constraints are viable computational models for visual cortical organization.
Talk 7, 12:15 pm
Re-evaluating the ability of object trained convolutional neural networks for classifying 'out of distribution' images
Connor J. Parde1, Hojin Jang2, Frank Tong1,3; 1Psychology Department, Vanderbilt University, 2Department of Brain and Cognitive Engineering, Korea University, South Korea, 3Vanderbilt Vision Research Center, Vanderbilt University
Convolutional neural networks (CNNs) trained for object classification are severely impaired by almost any form of image degradation (e.g., visual noise, blur, phase-scrambling, etc.) unless they receive direct training (Geirhos et al., 2018; Jang, McCormack & Tong, 2021). Although the effect of direct training is not believed to generalize to un-trained distortions, recent work has demonstrated that blur-trained CNNs achieve better human accord and exhibit higher general performance than clear-trained CNNs (Jang & Tong, 2024). Here, we evaluated systematically the performance of CNNs when trained and tested on object images that underwent different types of manipulation. We trained ResNet-50 and VGG-19 models on images from the ILSVRC12 dataset using either no manipulation, low-pass filters, high-pass filters, uniform noise, salt-and-pepper noise, or phase-scrambled noise. In addition, separate models were trained using either 1000-category or 16-category classification tasks and then tested with the full set of image manipulations. In all cases, 1000-category trained models exhibited higher performance than their 16-category trained counterparts. In addition, all models performed highest on images with the same type of manipulation that was present during training. There was no difference in the generalizability of the 1000-class trained models and the 16-class models. However, the low-pass filter trained models outperformed their clear-trained counterparts for images with uniform noise and salt-and-pepper noise, and maintained similar performance to the clear-trained models for all other types of manipulation. Further, uniform-noise trained and salt-and-pepper-noise trained CNNs outperformed clear-trained CNNs on images with low-pass filters, uniform noise, and salt-and-pepper noise. In addition, high-pass trained CNNs outperformed clear-trained CNNs on phase-scrambled images. Our results demonstrate that CNNs trained on degraded images exhibit some ability to generalize performance to images outside of their training distribution. This underscores the importance of challenging or degraded stimuli for learning robust representations of visual categories.
This research was supported by NEI grants R01EY035157 to FT and P30EY008126 to the Vanderbilt Vision Research Center