Object Recognition: Categories and neural mechanisms
Talk Session: Saturday, May 17, 2025, 10:45 am – 12:30 pm, Talk Room 1
Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions
Talk 1, 10:45 am
Beyond Button Presses: Larger Motor Actions Facilitate Visual Category Learning
Luke Rosedahl1, Takeo Watanabe1; 1Brown University
Visual category learning has traditionally been viewed as a primarily perceptual-cognitive process, with popular models like the generalized context model, ALCOVE, and SUSTAIN treating it as largely independent from motor systems. Our research challenges this perspective by demonstrating that the magnitude of motor response in visual categorization tasks significantly impacts learning. Participants categorized Gabor discs in a virtual reality paradigm using either controller triggers or lightsaber swings. We tested two category structures: Rule-Based (RB), requiring binary decisions along visual feature dimensions, and Information-Integration (II), requiring integration of multiple feature dimensions (Rosedahl, Eckstein, and Ashby, 2018). Based on evidence that II structures engage basal ganglia systems (Ashby and Ennis, 2006), we hypothesized that larger motor movements might enhance II category learning. In our initial experiment, participants (N=72) were evenly divided across four conditions: RB-Button, RB-Lightsaber, II-Button, and II-Lightsaber. While RB performance remained consistent across response types (t = -.12, p = .91), II category learning significantly improved with lightsaber swings (t = 2.7, p = .006). To determine whether the benefits came from larger movement, interacting with the stimulus, or longer stimulus presentation times for the swing condition, we conducted a follow-up experiment where participants (N=30) punched a response box or used the controller triggers to learn the II categories with set stimulus presentation time. The punch group showed significantly faster learning (t = 3.19, p = .002) and higher final block performance (90% vs. 80%; t = 2.80, p = .01), indicating that large movements enhance category learning. These findings challenge traditional views on visual category learning and highlight how motor engagement shapes visual pattern classification, particularly for complex categories. This suggests the need to revise existing theories of visual category learning to account for motor system involvement, possibly through enhanced engagement of basal ganglia circuits.
This work was supported by the National Eye Institute of the National Institutes of Health under award numbers [K99EY034891, R01EY019466, R01EY027841, and R01EY031705] and NSF-BSF under award number [BCS2241417].
Talk 2, 11:00 am
Object Blindsight: Functional and neural processing of unseen visual objects in the cortically blind field
Jessica M. Smith1, Bradford Z. Mahon1; 1Carnegie Mellon University
Cortical blindness refers to the loss of the phenomenal experience of seeing that is caused by lesions along the geniculostriate pathway. In blindsight, visual inputs can continue to be processed by extra-geniculostriate and geniculo-extrastriate pathways bypassing the lesion, and can affect behavior in the absence of awareness. Previous studies have investigated blindsight principally using simple stimuli, such as gabors or simple shapes defined by chromatic or luminance contrast. Here, we set out to test functional and neural signatures of visual object processing in the cortically blind field. Hemianopic participants (n = 2) and controls (n = 54) completed a series of vision psychophysics and fMRI tasks in which neutral faces, fearful faces, and familiar graspable objects (e.g., pencil) were presented in seeing and blind regions of the visual field. The psychophysics studies used the redundant target paradigm, in which participants indicate (button push) every time they see anything appear; it is classically observed that participants are faster to respond to redundant visual targets compared to a single visual onset. While healthy controls exhibited classic redundancy gains, a pattern of redundancy loss was observed when redundant stimuli were presented in the hemianopic field. This redundancy loss indicates that visual stimuli presented in the hemianopic field are processed. Both hemianopic participants had lesions affecting V1, but one participant’s lesion involved the dorsal occipital cortex, while the other participant’s lesion spared this region. We found that graspable objects (tool images) presented in the hemianopic field significantly activated posterior intraparietal sulcus (IPS) only in the patient with spared dorsal occipital cortex. These observations, together with other findings in the field, indicate that nongeniculostriate pathways into the dorsal stream automatically compute the “graspability” of visual objects, even in the absence of an explicit goal to act.
Talk 3, 11:15 am
Spatiotemporal dynamics induced by rapid perceptual learning in the human brain at single-neuron resolution
Marcelo Armendariz1,2 (), Julie Blumberg3, Jed Singer1, Franz Aiple3, Jiye Kim1, Peter Reinacher3, Andreas Schulze-Bonhage3, Grabriel Kreiman1,2; 1Boston Children’s Hospital, Harvard Medical School, Boston, MA, USA., 2Center for Brains, Minds and Machines, Cambridge, MA, USA., 3University Medical Center Freiburg, University of Freiburg, Germany.
Humans can swiftly learn to recognize visual objects after just one or a few exposures. A striking example of rapid learning is the sudden recognition of a degraded black-and-white image of an object (Mooney image). These degraded Mooney images are initially unrecognizable. However, Mooney images become easily interpretable after a brief exposure to the original intact version of the image. This rapid learning process necessitates the formation of enduring neural signatures to enable subsequent recognition. Despite extensive behavioral characterization, the neuronal mechanisms underlying perceptual changes induced by rapid learning in the human brain are not well understood. Here, we recorded the spiking activity of neurons in medial occipital and temporal regions of the human brain in patients performing an image recognition task that involved rapid learning of degraded two-tone Mooney images. Neurons in the occipital cortex (OC) and medial temporal lobe (MTL) modulated their firing patterns to encode the identity of recently learned images. Population decoding revealed that occipital neurons resolved the identity of learned images at the cost of additional processing time, with delayed responses observed in MTL neurons. Our findings suggest that OC may not rely on feedback from MTL to support recognition following rapid perceptual learning. Instead, learning-induced dynamics observed in OC may reflect extensive recurrent processing, potentially involving top-down feedback from higher-order cortical areas, before signals reach the MTL. These results highlight the need for further computation beyond bottom-up visual input representations to facilitate recognition after learning and provide spatiotemporal constraints for computational models incorporating such recurrent mechanisms.
Talk 4, 11:30 am
Triple-N Dataset: Non-human Primate Neural Responses to Natural Scenes
Yipeng Li1 (), Jia Yang1, Wei Jin1, Wanru Li1, Baoqi Gong1, Xieyi Liu1, Kesheng Wang1, Jingqiu Luo1, Pinglei Bao1; 1Peking University
Understanding the neural mechanisms of visual perception requires data that encompass both large-scale cortical activity and the finer details of single-neuron dynamics. The Natural Scenes Dataset (NSD) has provided substantial insights into visual processing in humans (Allen et al., 2022), yet its reliance on functional magnetic resonance imaging (fMRI) limits the exploration of individual neuron contributions. To bridge this gap, we present a new dataset: the triple-N dataset, that extends the NSD framework to non-human primates, incorporating single-neuron activity and local field potential recorded from the inferotemporal (IT) cortex. We recorded the neuronal response while macaques passively viewed 1000 NSD shared images with Neuropixels. Over 60 sessions across 15 sub-regions within the IT cortex were recorded from 5 macaques, capturing over 14,000 neural units with good reliability (split-half correlation > 0.4), including approximately 2,000 single neurons. Many recordings were obtained from fMRI-defined category-selective regions, such as face, body, scene, and color-selective areas. Our dataset enables in-depth exploration of neural responses at multiple levels, from population dynamics to single-neuron activity, offering new insights into the spatiotemporal aspects of visual processing. First, most neurons within category-selective regions exhibit similar tuning properties, but a subset of neurons shows responses that cannot be explained by the population response, reshaping our understanding of the neuronal composition within category-selective areas. Second, with neuron-voxel mapping, our dataset provides a foundation for cross-species comparisons and alignment between human and macaque visual processing. Furthermore, given the diversity in the dynamic changes of neuronal responses and encoding features in the IT cortex, our dataset facilitates the development of computational models of the high-level visual system that emphasize the temporal characteristics of visual processing. Overall, our dataset serves as a valuable resource for advancing our understanding of visual perception and bridging the gap between large-scale neuroimaging and fine-grained electrophysiological signal.
Talk 5, 11:45 am
Controlling for everything: Canonical size effects with identical stimuli
Chaz Firestone1, Tal Boger1; 1Johns Hopkins University
Among the most impressive effects in recent vision science are those associated with “canonical size”. When a building and a rubber duck occupy the same number of pixels on a display, the mind nevertheless encodes the real-world size difference between them. Such encoding occurs automatically, organizes neural representations, and drives higher-order judgments. However, objects that differ in canonical size also differ in many mid- and low-level visual properties; this makes it difficult—and seemingly impossible—to isolate canonical size from its covariates (which are known to produce similar effects on their own). Can this challenge be overcome? Here, we leverage a new technique called “visual anagrams”, which uses diffusion models to generate static images whose interpretations change with image orientation. For example, such an image may look like a rabbit in one orientation and an elephant when upside-down. We created a stimulus set of visual anagrams whose interpretations differed in canonical size; each image depicted a canonically large object in one orientation but a canonically small object when rotated, while being pixel-wise identical in every other respect. Six experiments show that most (though not all) canonical size effects survive such maximal control. Experiments 1–2 tested Stroop effects probing the automaticity of canonical size encoding; consistent with previous findings, subjects were faster to correctly judge the onscreen size of an object when its canonical size was congruent with its onscreen size. Experiments 3–4 tested effects on viewing-size preferences; consistent with previous findings, subjects chose larger views for canonically larger objects. Experiments 5–6 tested efficient visual search when targets differed from distractors in canonical size; departing from previous findings, we found no such search advantage. This work not only applies a long-awaited control to classic experiments on canonical size, but also presents a case study of the usefulness of visual anagrams for vision science.
NSF BCS 2021053, NSF GRFP
Talk 6, 12:00 pm
Comparing the multidimensional mental and neural representations of object words and object images
Tonghe Zhuang1,2*, Laura M. Stoinski2,3,4*, Ming Zhou5, Martin N. Hebart1,2,6; 1Department of Medicine, Justus Liebig University, Giessen 35390, Germany, 2Max Planck Institute for Human Cognitive & Brain Sciences, Leipzig 04103, Germany, 3University of Leipzig, Leipzig 04103, Germany, 4International Max Planck Research School on Cognitive NeuroImaging (IMPRS CoNI), 5Beijing Normal University, China, 6Center for Mind, Brain and Behavior, Universities of Marburg, Giessen and Darmstadt, Germany, * equal contribution
Understanding the dimensions underlying mental and neural representations of objects is crucial for uncovering the mechanisms that link perception to semantic knowledge. However, vision and semantics are strongly interrelated, and the degree to which seemingly semantic dimensions (e.g., animacy) can also be explained by visual features (e.g., curvature) has made it challenging to tease apart the effects of vision and semantics in behavior and brain activations. To address this challenge, we examined the mental and neural representations of 1,388 object words and compared them to the representation of matched object images, thereby disentangling the contribution of visual-perceptual features evoked by images from visual-semantic features evoked by words. To this end, we first collected 1.3 million perceived similarity judgments of 1,388 object nouns in a large online sample, which allowed us to identify 50 object dimensions specifically related to words. These word dimensions showed strong overlap with 49 dimensions previously identified on object images (Hebart et al., 2020), but were restricted to high-level semantic information and object shape and did not include color or texture information. This highlights the importance of using images for evoking relevant visual dimensions in similarity judgment. Next, to examine the neural representations of these dimensions, we collected a densely sampled fMRI dataset of 480 object images and 960 matched image pairs in five participants across 15 sessions (4 word-based sessions, 8 image-based sessions, 3 localizer and structural sessions). By mapping the dimensions of words and images to brain activity patterns, we were able to identify the cortical regions related to the mental representation of object images, object words, and their overlap. Together, this work highlights the interplay of vision and semantics in mental and neural object representations and establishes a large, multimodal dataset to support future research on the intersection of vision, semantics, and neural representation.
This work was supported by SFB-TRR135 “Sonderforschungsbereich SFB/Transregio TRR”, ERC Starting Grant project COREDIM (ERC-StG-2021-101039712) and the Hessian Ministry of Higher Education, Science, Research and Art (LOEWE Start Professorship to M.N.H. and Excellence Program ‘The Adaptive Mind’).
Talk 7, 12:15 pm
Object dimensions underlying food selectivity in visual cortex
Davide Cortinovis1 (), Giulia Orlandi1, Lotte Van Campenhout1,2, Stefania Bracci1; 1University of Trento, Italy, 2KU Leuven, Belgium
The occipitotemporal cortex (OTC) has traditionally been viewed as functionally organized into category-selective areas, such as those responding to faces, body parts, and scenes. More recent studies using the Natural Scenes Dataset identified food-selective areas adjacent to face-selective areas (in both lateral and medial OTC), independent of basic visual features like shape, texture or color. However, other evidence found overlapping activations between food and tool responses, suggesting that food-selectivity could be better understood through a dimensional framework that emphasizes shared properties like manipulability. Our study explored the dimensions underlying food-selective areas in the OTC using fMRI and a stimulus set including images of faces, bodies, hands, food, tools, manipulable objects, scenes, and spiky meaningless objects. For food and object categories, both grayscale and colored images in different configurations were presented to assess the roles of visual (e.g., color, clutter-complexity) and action-related (e.g., graspability, effector-specificity) properties. Our localizer identified two distinct food-selective clusters in OTC: one medial, localized between regions selective for faces and scenes, and one lateral, partially overlapping with regions selective for tools and manipulable objects. In lateral OTC, no significant overlap was found between hand and food selectivity. However, we replicated the previously known hand-tool overlap, indicating that tools and hands share effector-specific information absent for food. Moreover, visual properties like object clutter and, to a lesser extent, color contributed to the representations in the medial (but not lateral) food cluster. Finally, computational models of visual cortex topography only partially captured the observed organization of food-selective areas, with similar representation of visual properties but no organization based on action information. Overall, our results show that food responses in OTC may be better understood in the light of a dimensional framework that considers both the visual and the action-related properties of food, going beyond a category-centric framework.