Visual Feature Classification and Error Analysis in Multi-Modal Large Language Models
Poster Presentation: Tuesday, May 20, 2025, 8:30 am – 12:30 pm, Banyan Breezeway
Session: Object Recognition: Models
Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions
Ching-Yi Wang1 (); 1University of California, Los Angeles (UCLA)
How do multi-modal large language models (MLLMs), such as GPT-4o, approximate human visual perception? Human vision excels at interpreting complex and incomplete stimuli by integrating high-level cognitive processes, such as Gestalt principles—the Law of Closure and Amodal Completion. These principles enable humans to infer missing elements of objects. Top-Down Processing allows the brain to utilize prior knowledge to resolve ambiguous sensory inputs. In contrast, GPT-4o relies on bottom-up processing, focusing primarily on raw sensory data and lacking the complex inference mechanisms characteristic of human perception. We selected GPT-4o for its advanced multi-modal capabilities, allowing it to process both textual and visual information simultaneously, which is crucial for exploring human visual perception. We evaluated its performance on 54 3D and 21 2D geometric stimuli (e.g., cubes, prisms, pentagons, hexagons) featuring attributes like missing faces, rotations, and stacked elements. Using multi-regression analysis, we assessed the contributions of different visual features to error rates. Our findings revealed that GPT-4o struggled to distinguish geometrically similar shapes, such as pentagonal and hexagonal prisms, resulting in a 32.7% error rate in processing 3D ambiguities. The model also showed a 63% error rate for detecting missing faces, reflecting its weakness in handling incomplete structures. In 2D stimuli, the most common error was related to orientation, with a 14.3% error rate, likely due to the simplicity of 2D shapes lacking essential depth cues. As complexity increased (e.g., overlapping objects), GPT-4o's accuracy declined, highlighting its limitations in managing multiple features and visual ambiguity. Despite these challenges, it recognized simple patterns and depth cues well, achieving a low 3.7% error rate in 3D structure identification. These results suggest that while GPT-4o performs well in structured tasks, cognitive top-down inference remains indispensable for resolving visual ambiguity and processing incomplete information effectively.