Comparing a Dual-stream Architecture with Single-stream CNNs to Simulate Vision in Locomotor Control
Poster Presentation: Sunday, May 18, 2025, 2:45 – 6:45 pm, Pavilion
Session: Motion: Models, neural mechanisms
Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions
Zhenyu Zhu1, Thomas Serre1, William H. Warren1; 1Brown University
Zhu and Warren (2022) asked participants to follow a group of textured objects, whose heading direction or speed of motion was briefly perturbed. Overall, locomotor responses were consistent with boundary motion (feature tracking). When the object texture and boundaries moved in Opposite directions (the reverse-phi illusion), responses decreased with increasing boundary blur, but this did not occur in the Same direction condition. This widening gap between the two conditions indicates that visually-guided locomotion depends on a weighted combination of feature-tracking and motion energy. Here, we leverage deep neural networks to investigate what network architectures can replicate these effects. We evaluated several model architectures, including single-stream 3D convolutional networks (MC3, R(2+1)D, R3D, DorsalNet), a dual-stream network (SlowFast), and a benchmark neurophysiological motion energy model (Nishimoto and Gallant 2011). The dual network includes a low temporal-frequency/high spatial-frequency stream, and a high temporal/low spatial stream. These models were fine-tuned to estimate the heading and speed of a group of objects with attached surface textures moving across various backgrounds from 12-frame video sequences, as in ecological contexts. The fine-tuned models were then tested on Zhu and Warren’s (2022) reverse-phi stimuli. The mean heading estimates of the single-stream and motion energy models decreased significantly in both Same and Opposite conditions as boundary blur increased (range of F(2, 594): 16.45 - 96.00, all p < 0.0001), deviating from human responses. These findings suggest that single-stream 3D convolutional networks function similarly to motion energy detectors, without the feature-tracking observed in humans. However, the SlowFast network failed to replicate the increasing gap between Same and Opposite conditions with boundary blur. We conclude that the SlowFast model does not capture human-like feature tracking. This indicates the need for further architectural improvements, such as incorporating recurrent connections to support feature-tracking.
Acknowledgements: Funding: NIH R01EY029745, NIH 1S10OD025181, NIH T32MH115895