Scene Understanding Maps: Predicting Most Frequently Fixated Object during Free Viewing with Multi-Modal Large Language Models

Poster Presentation: Tuesday, May 20, 2025, 2:45 – 6:45 pm, Pavilion
Session: Scene Perception: Categorization, memory, clinical, intuitive physics, models

Shravan Murlidaran1, Miguel P Eckstein; 1Department of Psychological and Brain Sciences, University of California Santa Barbara

Introduction: What objects are most frequently fixated during free viewing of scenes with images depicting complex human behaviors, actions, or interactions? We hypothesized that: 1) objects that contribute the most to the understanding of the scene would be fixated most frequently, 2) Multi-Modal Large Language Models (MLLMs) can be used to build Scene Understanding Maps (SUMs) that visualize the contribution of individual objects to accurate scene descriptions and can predict the most fixated objects by humans during free viewing. Methods: For each image (n=38), we created an MLLM-SUM by digitally removing each object from the image and using a semantic similarity measure (Gemini) of the MLLM description of the manipulated images and that of the intact images. We compared the predictions from MLLM-SUM to the impact of the same object deletion manipulation on human scene descriptions and to human fixations during a separate free-viewing task (2-sec presentation). We benchmarked the MLLM-SUM against Graph-Based Visual Saliency (GBVS), DeepGaze models, and human judgments of the meaningfulness of cropped scene patches (meaning maps). Results: The MLLM-SUM’s object most critical to scene understanding (minimum of the MLLM-SUM) accurately predicted the object most disruptive to human (N=55) scene descriptions in 73.5 +- 7% of the images (chance performance = 12.5%). The MLLM-SUM also predicted the most fixated object by humans (N=50) in the free viewing task in 52.6 +- 8% of the images, significantly higher than the maximum of the GBVS (29 +- 8%, bootstrap, p=0.009) and meaning maps (24 +- 7%, bootstrap, p=0.002) but not significantly different from the maximum of Deepgaze (47.3 +- 8%, bootstrap, p=0.26). Conclusions: Multi-modal large language models and semantic similarity metrics provide a new powerful tool to identify objects critical to scene understanding and predict the most fixated object during free viewing.