High-level visual information underlies social and language processing in the STS during natural movie viewing
Poster Presentation: Tuesday, May 20, 2025, 8:30 am – 12:30 pm, Pavilion
Session: Face and Body Perception: Social cognition, neural mechanisms
Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions
Hannah Small1 (), Haemy Lee Masson2, Ericka Wodka3,4, Stewart H. Mostofsky3,4, Leyla Isik1; 1Johns Hopkins University, 2Durham University, 3Kennedy Krieger Institute, 4Johns Hopkins School of Medicine
Real-world social perception depends on continuously integrating information from both vision and language. However, most prior neuroimaging studies have studied vision and language separately, leaving open critical questions about how these distinct social signals are integrated in the human brain. To address this gap, we investigate how rich social visual and verbal semantic signals are processed simultaneously using controlled and naturalistic fMRI paradigms. Focusing on the superior temporal sulcus (STS), previously shown to be sensitive to both visual social and language signals, we first localized visual social interaction perception and language regions in each participant (n=19) using controlled stimuli from prior work. We show for the first time that social interaction and language voxels in the STS are largely non-overlapping. We then investigate how these regions process a 45 minute naturalistic movie by combining vision (Alexnet) and language (sBERT) deep neural network embeddings with a voxel-wise encoding approach. We find that social interaction selective regions are best described by vision model embeddings of the frames of the movie and, to a lesser extent, language model embeddings of the spoken content.
Surprisingly, language regions are equally well described by language and vision model embeddings, despite the lack of correlation between these features in the movie. Both regions were best explained by the last two layers of the vision model, suggesting sensitivity to high-level visual information. Follow-up analyses suggest that the most predictive vision model features are similar in social interaction and language regions, but different from the most predictive vision model features in low-level vision region MT. Together, these results suggest that social interaction and language-selective brain regions respond not only to spoken language content, but also to semantic information in the visual scene. This work highlights the importance of combining controlled and naturalistic approaches to study multimodal social processing.