Relational gaze information predicts human behavior and neural responses to complex social scenes
Poster Presentation: Tuesday, May 20, 2025, 8:30 am – 12:30 pm, Pavilion
Session: Face and Body Perception: Social cognition, behavioural
Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions
Wenshuo Qin1, Manasi Malik1, Leyla Isik1; 1Johns Hopkins University
Studies have shown that understanding social events requires capturing interpersonal dynamics, which depends on tracking relational visual information between people. Gaze direction is a critical cue to relational information, and is prioritized behaviorally and represented in the superior temporal sulcus (STS), a region implicated in processing social scenes. However, there has been limited work integrating relational gaze information into computational models of social vision. Our current study evaluates simple computational models based on relational gaze features to match human judgments of social events, and compares their performance to state-of-the-art (SOTA) AI vision and language models. Specifically, we tested SocialGNN, a graph neural network that organizes each video frame into a graph structure based on gaze direction, alongside various recurrent neural networks (RNNs), on a dataset of behavior and neural responses to 3s videos of pairs of people engaged in everyday activities. Remarkably, SocialGNN demonstrates exceptional performance in predicting human behavioral ratings of social features, and achieves results on par with SOTA AI vision and language models, despite being trained on significantly less data and having only a fraction of the number of layers and tunable parameters. Notably, it also yields high neural encoding accuracy for STS responses to these videos. Follow-up ablation studies reveal that an even simpler RNN model trained on gaze direction alone is sufficient to achieve the observed alignment with human behavior and neural responses. This suggests that including additional visual features in the GNN framework does not further enhance its performance. These findings underscore the primacy of gaze direction as a relational visual cue in computational models for predicting human social judgments and brain responses. Future work will investigate what additional visual cues can be integrated with gaze direction in computational models to better model human social scene judgments.