Predicting gaze behavior in natural walking environments

Poster Presentation: Tuesday, May 20, 2025, 8:30 am – 12:30 pm, Banyan Breezeway
Session: Eye Movements: Natural or complex tasks

Lloyd Acquaye Thompson1, Zachary Petroff1, Stephanie M. Shields1, Kathryn Bonnen1; 1Indiana University Bloomington

Understanding human gaze behavior during naturalistic tasks is critical for understanding how visual processing supports real-world activities. Previous work has developed spatial saliency models that predict gaze in static images. Here we examine how well these spatial saliency models predict gaze in dynamic environments during naturalistic tasks. Specifically, we evaluated the gaze prediction performance of DeepGazeIIE (Linardos, Kümmerer, Press, & Bethge, 2021), a prominent spatial saliency model, for data from a previous study of walking in natural outdoor environments that captured scene video and gaze data with a mobile eye tracker (Bonnen et al., 2021). We extracted video frames from the head-mounted scene camera video and included them only if gaze estimation confidence was at least 95%. For each video frame, the model produced a log-likelihood map that indicates the likelihood of fixation at each point in the frame. We evaluated the network’s performance on our dataset (without additional training) using the area under the curve (AUC) at each recorded fixation location. Initial results reveal that DeepGazeIIE provides good predictive accuracy for fixation patterns during natural walking (AUC=0.81). Notably, this is lower than the AUC reported on the original benchmark dataset (AUC=0.88). This performance gap may stem from differences between the two datasets, e.g., differences in the task and the visual environments. For example, walkers rarely look at their feet, but because feet are visually salient objects, the model predicts that walkers are highly likely to fixate on their feet. Future work will investigate and test strategies that address these limitations and better account for task-specific patterns (e.g., training a network specifically on egocentric video frames). Broadly, this study demonstrates the feasibility of leveraging existing spatial saliency models to analyze gaze behavior during natural tasks, offering insights into visual processing in dynamic contexts.