How do artificial neural networks (ANNs) respond to audiovisual illusions such as the McGurk effect?
Poster Presentation: Saturday, May 17, 2025, 2:45 – 6:45 pm, Pavilion
Session: Multisensory Processing: Audiovisual integration
Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions
Haotian Ma1 (), Zhengjia Wang1, John. F. Magnotti1, Xiang Zhang1, Michael. S. Beauchamp1; 1University of Pennsylvania Perelman School of Medicine
Humans perceive speech by integrating auditory information from the talker’s voice with visual information from the talker’s face. Incongruent speech provides a useful experimental tool for probing multisensory integration. For instance, in the McGurk effect, an auditory “ba” paired with a visual “ga” (AbaVga) produces the illusory percept of “da”. Artificial neural networks (ANNs) have made remarkable progress in reproducing human abilities and may provide a useful model for human audiovisual speech perception, prompting the question of how ANNs respond to incongruent audiovisual speech. To answer this question, we presented McGurk and congruent (control) stimuli to human observers and Audiovisual Hidden-unit Bidirectional Encoder Representations from Transformers (AVHuBERT), an ANN developed by Meta Corporation. Twenty McGurk stimuli were tested, consisting of a single “ga” video paired with different “ba” auditory recordings, all from the same female talker. Amazon Mechanical Turk was used to assess the perception of 128 human observers. To model individual differences in human perception, variants of AVHuBERT were created by adding Gaussian noise to the units in a transformer encoder layer of the model. Both human observers and AVHuBERT classified each stimulus as “ba”, “ga” and “da”. Performance was highly accurate for congruent syllables (mean of 94% for humans and 94% for AVHuBERT). For McGurk stimuli (AbaVga), there was substantial variability in the rate of McGurk “da” reports across different human observers (range from 25% to 65%) and different model variants (16% to 48%). The overall rate of illusory percepts was similar (46% for humans, 34% for AVHuBERT). The similar “perceptual” reports of human observers and AVHuBERT to the McGurk effect suggest that ANNs may be a useful tool for interrogating the perceptual and neural mechanisms of human audiovisual speech perception.