Foveated Multi-Modal Large Language Model Maps to Predict Time to Understand Scenes

Poster Presentation: Tuesday, May 20, 2025, 2:45 – 6:45 pm, Pavilion
Session: Scene Perception: Categorization, memory, clinical, intuitive physics, models

Ziqi Wen1 (), Shravan Murlidaran2, Jonathan Skaza3, Miguel Eckstein1,2,3; 1Department of Computer Science, UC Santa Barbara, 2Department of Psychological and Brain Sciences, UC Santa Barbara, 3Graduate Program in Dynamical Neuroscience, UC Santa Barbara

Introduction: Perceptual science has proposed evidence accumulation models (Ratcliff et al., 2016; Rafiei & Rahnev, 2022) to predict response times for simple perceptual tasks and clutter metrics to predict search times with complex scenes (Rosenholtz, 2005; Deza & Eckstein, 2017). There are no image-computable models that predict an observer’s time to understand real-world scenes. Here, we use Multi-Modal Large Language Models (MLLMs) to generate foveated scene understanding maps of how the description (understanding) of a scene is influenced across fixation locations and propose new metrics to predict observers’ time to understand a scene. Methods: For each image (n=94), we created a foveated scene understanding map (FSUM) that quantifies how an MLLM scene description for different fixation points quantitatively compares (measured using LLM sentence similarity measures, Gemini) to a gold standard description (MLLM description of non-foveated images). An FSUM metric was created by integrating the SUM map using Minkowski pooling. We assessed the ability of the FSUM metric to predict the response time and number of saccades required to accurately describe a scene by seven observers. For benchmark comparisons, we used image clutter metrics (Feature Congestion, Rosenholt, 2017 & Subband Entropy). Results: To understand the scenes, observers executed 6.9±0.18 saccades and took 1.89±0.06 sec. The FSUM metric correlation with human RTs and number of eye movements was significantly higher (p<0.001) than clutter metrics (RTs: FSUM=0.547; Feature Congestion=0.1; Subband Entropy=0.099; Eye movements: FSUM=0.605, Feature Congestion=0.119, Subband Entropy=0.038). As a comparison, the correlation of single observer RTs and the average of the remaining observers RTs was 0.391. Conclusion: Combining foveated architectures, multi-modal large language models, and semantic similarity metrics provides a new powerful tool to extend vision science approaches to human understanding of complex scenes.