A deep learning model of self-motion using gaze-centered image sequences

Poster Presentation: Sunday, May 18, 2025, 2:45 – 6:45 pm, Pavilion
Session: Motion: Models, neural mechanisms

Nathaniel Powell1, Youjin Oh1, Mary Hayhoe1; 1University of Texas at Austin

Retinal flow patterns while walking depend not only on gaze location but also on the oscillations of the head (Matthis et. al., 2022). When humans fixate a location while walking, the eyes counter-rotate in the orbit to stabilize the retinal image. Additionally, normal gait is accompanied by rhythmic translations and rotations of the head in space, meaning the momentary heading direction varies greatly over the gait cycle, creating a constantly changing, complex pattern of motion on the retina. These patterns are ubiquitous during development, however most stimuli used to investigate the coding of self-motion do not use the stimuli that reflect the complexity of the actual retinal motion patterns. Therefore, how might the visual system encode the visual motion? Neural recordings from visual areas MT and MST show responses to translational, expanding, and rotating patterns of motion. Recent work by Mineault et. al., (2021) used deep learning and found that predicting the translation and rotation of a camera moving through space based on a stacked-sequence of images was sufficient for producing MT and MST-like receptive fields. The sequences of images used in training, however, were different from the natural stimuli people experience while walking. A similar deep learning model was trained on gaze-centered images sequences recorded from a subject walking through virtual environments with different depth structures. The model was trained to predict the translation and rotation of the head. It learned the motion parameters well and developed speed and direction tunings similar to receptive fields in MT and MST. This suggests that the information present in the retinal flow field is enough to extract the momentary direction of heading. Surprisingly, this model learned the heading parameters despite the presence of saccades in the input sequences, suggesting that the model is robust to discontinuities caused by saccadic eye movements.