Decoding Retinal Responses: A Transformer-Based Model for Visual Stimulus Prediction

Poster Presentation: Saturday, May 17, 2025, 8:30 am – 12:30 pm, Banyan Breezeway
Session: Plasticity and Learning: Models

Maria Jose Escobar1, Francisco Miqueles1, John Atkinson2, Adrian G. Palacios3; 1Universidad Técnica Federico Santa María, Department of Electronic Engineering, Valparaiso, Chile, 2Universidad Adolfo Ibañez, Santiago, Chile, 3Centro Interdisciplinario de Neurociencia de Valparaíso, Facultad de Ciencias, Universidad de Valparaíso, Valparaíso, Chile

The retina encodes visual information into spike trains using diverse functional cell types. While the encoding principles remain unclear, evidence highlights the complementary roles of individual and collective activity. This study explores a transformer-based neural model (POYO) to decode retinal structure and responses. POYO analyzes retinal networks by tokenizing neuronal activity into spikes and processing them through attention mechanisms, accounting for inter-animal variability. Using data from diurnal rodents (Octodon degus), the model was trained on five retinas over 400 epochs, achieving a loss of 0.019 and an $R^2$ of 0.921. Fine-tuning on a sixth, unseen retina led to rapid convergence in 30 epochs, reaching an $R^2$ of 0.998. To examine the role of different cell types, the model was trained on the entire population (496 units) and subsets: ON (46), OFF (256), and ON-OFF (194) cells. The $R^2$ scores were 0.997, 0.910, 0.882, and 0.937, respectively. While single-cell-type subsets achieved good performance, they could not fully recover the stimulus, which was only recovered using all retina cell types. These findings demonstrate that transformer-based models effectively predict visual stimuli from retinal responses and reveal the inner structure of retinal populations. Neuronal diversity plays a critical role in model convergence, as restricted inputs fail to capture the full stimulus. Moreover, trained models cannot generalize across tissues without fine-tuning. Rapid convergence on new retinas indicates that POYO captures generalizable retinal features, facilitating efficient adaptation to new datasets. These results underscore the importance of retinal heterogeneity in visual encoding and the potential of transformer models for advancing the understanding of sensory information processing.

Acknowledgements: ANID FONDECYT 1230170 and 1200880