A large-scale vision-language fMRI dataset for multi-modal semantic processing

Poster Presentation: Tuesday, May 20, 2025, 2:45 – 6:45 pm, Pavilion
Session: Scene Perception: Categorization, memory, clinical, intuitive physics, models

Yuanning Li1 (), Shurui Li1, Zheyu Jin1, Shi Gu2, Ru-Yuan Zhang3; 1ShanghaiTech University, Shanghai, China, 2University of Electronic Science and Technology of China, Chengdu, China, 3Shanghai Jiao Tong University, Shanghai, China

Large-scale functional MRI datasets with naturalistic stimuli provide more ecologically relevant experimental conditions and promote more reproducible research into the neural basis of sensory perception. These datasets afford the use of advanced AI models to investigate and model the neural coding and processing of language and visual information. However, existing research often focuses on isolated visual or language networks, with few studies addressing the interaction between vision and language processing at the semantic level. To facilitate the investigation of the neural mechanisms of semantic representations across different modalities, we present the Caption Scene Dataset (CSD). Specifically, we designed a paired caption-image semantic matching task and collected extensive 3T fMRI data from 8 subjects for a total of 320 hours. During the scanning, each subject viewed more than 4400 paired visual stimuli, each stimulus contained a text caption, followed by a naturalistic image. The subjects were asked to determine whether the paired text and image stimuli matched semantically. In addition to the fMRI scans, we acquired T1, T2, and diffusion MRI data, as well as eye-tracking and electrocardiogram (ECG) data to enrich the dataset's utility. We also localized the early visual cortex and category-selective areas in each participant using additional functional localizers. Preprocessing steps included slice-time correction, EPI distortion correction, and motion correction, followed by alignment to individual brain space. Single-trial neural responses to text and image stimuli were estimated using GLMSingle. To illustrate the utility of the CSD dataset, we demonstrated that deep neural encoding models could effectively predict neural responses to text and image stimuli across different cortical regions. In sum, our unique, large-scale vision-language fMRI dataset establishes a robust platform for investigating the neural basis of semantic processing across vision and language modalities, fostering cross-disciplinary advances at the intersection of cognitive neuroscience and AI.

Acknowledgements: This work is supported by the National Natural Science Foundation of China (32371154, Y.L.), Shanghai Rising-Star Program (24QA2705500, Y.L.), Shanghai Pujiang Program (22PJ1410500, Y.L.), and the Lingang Laboratory (LG-GG-202402-06, Y.L.).