3D Processing
Talk Session: Saturday, May 17, 2025, 8:15 – 9:45 am, Talk Room 2
Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions
Talk 1, 8:15 am
Transformer models better account for human 3D shape recognition than convolution-based models
Shuhao Fu1 (), Philip Kellman1, Hongjing Lu1; 1University of California, Los Angeles
Both humans and deep learning models can recognize objects from 3D shapes depicted with sparse visual information, such as a set of points randomly sampled from the surfaces of 3D objects (termed a point cloud). Although deep learning models achieve human-like performance in recognizing objects from point-cloud displays, it is unclear whether these models acquire 3D shape representations aligned with those used by humans. We conducted three human experiments by varying point density and object orientation (Experiment 1), perturbing the local geometric structure of 3D shapes (Experiment 2), and manipulating the global configuration by scrambling parts (Experiment 3). Results showed that humans rely on global 3D shape representations for object recognition, showing robust performance even when local features were disrupted. We tested two deep learning models with different architectures: a convolution-based model (DGCNN) and a vision transformer model (Point transformer), comparing their performance with humans across the three experiments. The convolution-based DGCNN model relied heavily on local geometric features and was highly susceptible to adversarial perturbations of local geometry. In contrast, the transformer-based model exhibited human-like reliance on global shape representations. Ablation simulations revealed that the global shape representations in the transformer model originated primarily from its downsampling mechanism, which explicitly operationalizes a fine-to-coarse and local-to-global process, analogous to the increasing receptive field sizes of neurons in the human visual system. Secondary contributions stem from position encoding, which maintains spatial information across layers, and attention mechanisms which adaptively weigh context information based on similarity. These findings highlight the key computational mechanisms to bridge the gap between human and machine representations in 3D object recognition.
NSF BCS 2142269
Talk 2, 8:30 am
Computational Models Exhibit Invariance and Multistability In Shape from Shading
Xinran Han1, Ko Nishino2, Todd Zickler1; 1Harvard University, 2Kyoto University
Shape perception from a single shaded image is both unique and inherently ambiguous. It is unique in that a single geometric representation – the curvature field – explains the underlying shape. At the level of surface orientation, however, it is not unique – multiple valid shape solutions exist. Humans experience this ambiguity as multistable perception, where shape from a single image can be alternately interpreted as either convex or concave. We develop two computational methods to model these phenomena, each focusing on a different output representation. Our first method derives a unique representation underlying the observed shading—specifically, the log-Casorati curvature field—and demonstrates that it can be inferred from shading and remains robust under variations in lighting and surface albedo. This invariant representation provides a unified geometric representation of the physically ambiguous observation, laying the foundation for shape perception under varying conditions. Our second method employs a diffusion model—a generative neural network to recover multiple explanations of the surface shape by aggregating putative interpretations from local patches. This model offers a bottom-up mechanism for generating diverse surface orientation proposals, which can be later integrated with top-down queries for further refinement, and may be closer to biological realizations. Both methods are designed to operate without explicit knowledge of lighting direction, inspired by hypotheses suggesting that shape perception often precedes lighting inference. Their reliance on a shift-invariant, bottom-up architecture allows for efficient training on small synthetic datasets while generalizing well to novel scenarios. Our experiments show that these models can mimic ambiguities in shape from shading, including multistable phenomena like the crater illusion. Moreover, our studies highlight shortcomings in prior computational models, which typically produce only a single "best" interpretation. Together, these models provide insight into human perception by demonstrating how invariant and multistable shape representations can emerge from ambiguous inputs.
Talk 3, 8:45 am
Perceived 3D shape of mirror-like objects: interactions of monocular and binocular cues
Celine Aubuchon1 (), Emily A-Izzeddin1, Fulvio Domini1,2, Alexander Muryy, Roland W. Fleming1,3; 1Justus Liebig University Giessen, Germany, 2Brown University, 3Center for Mind, Brain and Behavior, Universities of Marburg and Giessen, Germany
When viewed binocularly, purely specular (mirror-like) objects are bizarre and quite fascinating. They present a naturally-occurring case of major cue conflict, with monocular and binocular shape cues decoupled almost everywhere in the image. This occurs because reflections are viewpoint dependent. With viewpoint changes (e.g., from left- to right-eye view), specular reflections can slide across the surface, change shape, or disappear altogether. As a result, while ‘normal’ features—like texture markings, 3D corners or diffuse shading—obey epipolar geometry, and create accurate depth signals, for specular reflections, the 2D vector fields describing the interocular shifts of corresponding features can have arbitrary sizes and directions. This leads to both unfusible regions and spurious depth signals that are incompatible with monocular shape cues. We created 3D “blob” objects, reconstructed their interocular vector fields, and used them to investigate how the brain deals with these peculiar conflicts. Specifically, we had participants make judgments about the 3D depths and orientations of points on objects with different combinations of cues derived from mirror reflections. We teased apart monocular and binocular cues by shifting image content along the vector field to recreate the binocular characteristics of mirrors while independently varying the information provided by monocular cues. In a control condition, we also ‘painted’ the reflections onto the surface in depth to remove the naturally-occurring cue conflict. Our results show that observers’ perception of 3D shape for mirrored objects, when monocular cues are available, correlate with true object shape despite significant biases in perceived shape when mirror disparities are presented alone. We find (1) that the visual system discriminates between “good” and “bad” disparities in determining object shape and (2) that monocular cues are used to constrain perceived shape from specular disparity-fields.
Supported by DFG, project number 222641018–SFB/TRR 135 TP C1, the ERC Advanced Grant "STUFF" (project number ERC-2022-AdG-101098225); and the Research Cluster "The Adaptive Mind" funded by the Hessian Ministry for Higher Education, Research, Science and the Arts.
Talk 4, 9:00 am
3D slant discrimination in motion parallax depends on retinal gradients and not distal slant
Jovan Kemp1, James Todd2, Fulvio Domini1; 1Brown University, 2The Ohio State University
The problem of slant-from-motion in human perception is often thought to be solved by inverting the retinal velocities using estimates of ancillary properties such as the speed and direction of the object in motion. However, previous evidence suggests that shape estimates emerge directly from retinal velocity gradients through a simple heuristic, rather than solving for parameters to perform inverse geometry. Consequently, slant metamers, which denote when two distinct surfaces are perceived to have the same slant, can be produced by ensuring that the retinal projections of the two surfaces produce the same retinal velocity gradients. Here we tested this hypothesis by requiring participants to perform (1) a probe adjustment task, which allowed us to measure absolute perceived slant, and (2) a 2-interval forced choice (2-IFC) task, which allowed us to measure discrimination rate performance when viewing rotating planar surfaces. In the probe adjustment task, participants adjusted a 2-dimensional probe to report their perceived slant relative to surfaces that could have one of three slants and rotate to produce one of three retinal velocity gradients. We find that the perceived slant reflects changes in retinal velocity gradients, rather than the surface slant. In the 2-IFC task, participants judged which two surfaces presented at separate intervals were the most slanted. The fixed standard surface could have one of two slants and rotate to produce one of two retinal velocity gradients. Critically, the variable comparison was chosen to produce a specified retinal velocity gradient either by changing the slant or changing the rotational speed. We find that discrimination thresholds, similarly to perceived slant, are determined by the retinal gradients, rather than by the slant or rotational velocity alone. Taken together, we show compelling evidence that perceived slant-from-motion is largely determined by simple heuristics using retinal information, rather than by more complex computations.
This research was funded by NSF #2120610 to FD.
Talk 5, 9:15 am
Uncertainty in perceptual priors of sports balls' sizes scale independently of experience
Constantin Rothkopf1, Nils Neupaertl1; 1Technical University of Darmstadt
Human perception resolves ambiguities and uncertainties by combining sensory measurements with prior knowledge. Bayesian models of perception have posited that these priors are well-calibrated to the natural environment through experience, so their accuracy and precision should increase with experience. Here, we measured participants’ prior beliefs about the size of three sports balls with which participants had different degrees of prior experience: tennis ball, baseball, and soccer ball. None of our European participants had ever interacted with a baseball. 16 participants viewed pairs of sports balls through an Oculus VR headset in a carefully calibrated scene of the actual laboratory. The experimental procedure used the so-called Markov Chain Monte Carlo with humans paradigm to elicit participants' priors over the balls' sizes. Participants repeatedly selected which of the two balls displayed on a trial looked more realistic in a two-alternative forced-choice decision. The simultaneously shown balls of the same type only differed in size. Balls’ sizes in subsequent trials were then sampled probabilistically relative to the preceding choice. This procedure has been shown to elicit and accurately measure individuals’ perceptual priors. Participants’ priors showed different degrees of bias and uncertainty. However, the individual differences in biases across balls were highly correlated within participants. Moreover, the range of biases across subjects scaled linearly with their mean. Surprisingly, the uncertainty in the prior belief, as quantified by its standard deviation, was not the largest for the balls with which participants had the least experience. Instead, the standard deviation scaled linearly with the mean of the size’s prior belief. These results show that perceptual priors for sports ball sizes are not determined only by prior experience. Instead, these results suggest that people have structured size priors across different objects and that the priors’ uncertainty is lawfully related to the prior mean.
Talk 6, 9:30 am
Navigating in a realistic VR environment with central field loss
Jade Guénot1, Preeti Verghese1; 1Smith-Kettlewell Eye Research Institute
Impaired depth perception due to central field loss (CFL) significantly impacts mobility, obstacle avoidance, and increases fall risks. We explored obstacle avoidance strategies in CFL patients and control participants under various visual conditions, using a virtual reality environment (HTC Vive Pro Eye headset in conjunction with the PTVR Python toolbox). Participants navigated through a realistic living room to retrieve keys while avoiding or stepping over boxes of varying heights (5-40 cm). Two conditions were tested: an easy condition with six easily avoidable boxes (>15 cm high) and a challenging condition with twelve boxes of mixed heights, including small boxes that were easier to step over than to go around. Participants performed the task binocularly and monocularly. Control participants completed additional trials with and without an artificial gaze-contingent scotoma. We recorded head, hand, and feet positions/rotations, measuring velocity, path tortuosity, foot clearance, and collision frequency, along with gaze behavior through integrated eye tracking. Preliminary results revealed distinct navigation strategies between patients with CFL and controls. CFL patients exhibited significantly slower walking velocities (especially in the monocular condition) and higher path tortuosity ratios, indicating a preference for obstacle avoidance over stepping over obstacles. When stepping over boxes, they showed reduced foot-to-obstacle clearance, resulting in higher collision rates and suggesting difficulties in height estimation. These impairments were even more pronounced than for controls with artificial scotomas. No difference was found between monocular and binocular conditions in patients except for the reduced velocity. Our findings suggest that VR environments can effectively reveal navigation strategies and challenges for patients with CFL, contributing to our understanding of how CFL affects their behavior in the real world, and offering insights into compensatory strategies and risks associated with vision loss. It may help the development of rehabilitation strategies and assistive technologies for individuals with CFL.
NIH funding R01 EY27390