Model-optimized stimuli: more than just pretty pictures

Symposium: Friday, May 16, 2025, 1:00 – 3:00 pm, Talk Room 2

Organizers: William Broderick1, Jenelle Feather1; 1Center for Computational Neuroscience, Flatiron Institute
Presenters: Ruth Rosenholtz, William F. Broderick, Arash Afraz, Jenelle Feather, Binxu Wang, Andreas Tolias

Experiments in vision science rely on designing the appropriate stimulus set to test specific properties of visual systems. Data collection time is inherently limited, so choosing stimuli that get the most "bang for your buck" when collecting behavioral or neural data is imperative. While early experiments relied on simple stimuli such as points and light and visual gratings to test specific hypotheses about visual representations, the advent of large-scale neural recordings and large benchmark datasets encouraged many researchers to make a jump to natural stimuli such as photographs and video, where one could test predictive models on these large datasets. However, both of these approaches have limitations. A model based purely on simple synthetic stimuli may not generalize to natural environments, but when relying solely on natural stimuli, the hypotheses about the underlying representations are often unclear and results can be muddled with stimulus correlations. We now have tools to combine these approaches, allowing for the design of complex stimuli that test specific hypotheses, guide responses, or decouple confounding underlying properties of the stimuli. In this symposium, we cover recent progress in utilizing model-optimized stimuli to probe visual perception and cognition. Ruth Rosenholtz will describe how stimulus synthesis facilitated the development of a model of peripheral crowding, developing intuition about its predictions and correcting conceptual errors along the way. William Broderick will describe "plenoptic," an open-source python package that offers a framework for synthesizing model-optimized stimuli such as metamers and eigendistortions, enabling the broader vision science community to use these approaches for their own models. Arash Afraz will detail work with "Perceptography", a technique that combines behavior, optogenetics, and machine learning tools to design stimuli that capture the subjective experience induced by local cortical stimuluation. Jenelle Feather will discuss how behavioral experiments with model metamers and other types of model-optimized stimuli can reveal differences between biological and artificial neural networks, and show how these methods can point towards model improvements. Binxu Wang will describe ongoing work using artificial neural networks (ANNs) to fully explore the dynamic range of macaque visual cortical neurons, enabling improved model performance. Andreas Tolias will describe the development and use of "digital twins", performing in-silico experiments on brain foundation models to improve our understanding of the brain while simultaneously developing more comprehensible, energy-efficient AI models. Choosing the right stimuli to test a hypothesis is a fundamental aspect of experimental design, and this symposium will be of general interest to VSS members interested in testing the properties of neural populations underlying visual perception. This symposium will give attendees an overview of the various uses of these stimuli, and inspire them to apply such methods in their own research.

Talk 1

Synthesizing model predictions supercharges understanding

Ruth Rosenholtz1, Benjamin Balas2; 1NVIDIA, 2North Dakota State University

At VSS 2008 we first presented our model of peripheral crowding. Serendipitously, we captured existing intuitions about crowding with Portilla & Simoncelli’s (2001) texture analysis/synthesis algorithm. Doing so allowed us to synthesize images in which distorted and retained image structure revealed the visual information available, ambiguous, or lost according to the model (Balas et al, 2009; Freeman & Simoncelli, 2011; Rosenholtz, Huang, & Ehinger, 2012). Many researchers have similarly used a wide range of image-computable models to visualize predictions, and the value of this approach cannot be overstated. Image-computability combined with synthesized outputs supports rapid development of predictions for a wide range of stimuli and tasks. Formal experiments, almost as an afterthought, can quantify those intuitions as needed, but the instant, informal psychophysics afforded by this framework is often sufficient. In our case, this enabled us to quickly build novel intuitions about crowding, search, reading, scene perception, maze-solving, change blindness, illusions, choice of fixations, and vision for action. This methodology helped correct conceptual errors; for example, that nearby flanker things crowd (a transitive verb) a target thing, unless some other mechanism intervenes to relieve crowding. Instead, model syntheses clearly reveal the importance of stimulus and task, and the generality of crowding well beyond the original empirical phenomena.

Talk 2

Plenoptic: A python library for synthesizing model-optimized visual stimuli

William F. Broderick1, Edoardo Balzani1, Kathryn Bonnen2, Hanna Dettki3, Lyndon Duong4, Pierre-Étienne Fiquet1, Daniel Herrera-Esposito5, Nikhil Parthasarathy6, Thomas Yerxa3, Xinyuan Zhao3, Eero P. Simoncelli1,3; 1Center for Computational Neuroscience, Flatiron Institute, 2Indiana University, 3New York University, 4Apple, 5University of Pennsylvania, 6Google Deep Mind

In sensory perception and neuroscience, new computational models are most often tested and compared in terms of their ability to fit existing data sets. However, experimental data are inherently limited and complex models often saturate their explainable variance, resulting in similar performance across models. Here, we present "Plenoptic", a python software library for synthesizing model-optimized visual stimuli for understanding, testing, and comparing models. Plenoptic provides a unified framework containing three previously-published synthesis methods -- model metamers (Freeman and Simoncelli, 2011), Maximum Differentiation (MAD) competition (Wang and Simoncelli, 2008), and eigen-distortions (Berardino et al. 2017) -- which enable visualization of different aspects of model representations. The resulting images can then be used to experimentally test model alignment with biological visual systems. Plenoptic leverages modern machine-learning methods to enable application of these synthesis methods to any computational model that satisfies a small set of common requirements: the model must be image-computable, implemented in PyTorch, and end-to-end differentiable. The package includes examples of several previously-published low- and mid-level visual models, as well as a set of perceptual quality metrics, and is compatible with the pre-trained machine learning models included in PyTorch's torchvision library. Plenoptic is open source, tested, documented, modular, and extensible, allowing the broader research community to contribute new models, examples, and methods. In summary, Plenoptic leverages machine learning tools to tighten the scientific hypothesis-testing loop, facilitating the development of computational models aligned with biological visual representations.

Talk 3

How is visual perception constructed by visually responsive neurons?

Arash Afraz1; 1NIH

Local perturbation of neural activity in high-level visual cortical areas alters visual perception. Quantitative characterization of these perceptual alterations holds the key to understanding the mapping between patterns of neuronal activity and elements of visual perception. Nevertheless, the complexity and subjective nature of these perceptual alterations make them elusive for scientific examination. Here, combining high throughput behavioral optogenetics with cutting edge machine learning tools, we introduce a new experimental approach, “Perceptography”, to develop graphical descriptors (pictures) of the subjective experience induced by local cortical stimulation in the inferior temporal cortex of macaque monkeys. According to the “labeled line hypothesis” the causal contribution of inferior temporal neurons to visual perception is expected to be a constant feature determined by the best visual driver of each neuron. However, our results clearly demonstrate that the perceptual events induced by local neural stimulation in inferior temporal cortex highly depend on the contents of concurrent visual perception, refuting the labeled line hypothesis.

Talk 4

Synthesizing stimuli for targeted comparison of biological and artificial perception

Jenelle Feather1; 1Center for Computational Neuroscience, Flatiron Institute

The past decade has given rise to artificial neural networks that transform sensory inputs into representations useful for complex visual behaviors. These models can improve our understanding of biological sensory systems and may provide a test bed for technology that aids sensory impairments, provided that model representations resemble those in the brain. Here, I will highlight recent lines of work probing aspects of complex model representations using model optimized stimuli, and detail how these stimuli can be used for comparing the representations of the models with human representations. I will first describe work using “model metamers”–stimuli whose activations within a model stage are matched to those of a natural stimulus. Metamers for state-of-the-art supervised and unsupervised neural visual network models were often completely unrecognizable to humans when generated from late model stages, suggesting differences between model and human invariances. While targeted model changes improved human recognizability of model metamers, they did not fully eliminate the human–model discrepancy. Notably, human recognizability of a model’s metamers was well predicted by their recognizability by other models, suggesting that models contain idiosyncratic invariances in addition to those required by the task, and that removing these idiosyncrasies may lead to better models of visual perception. To conclude, I will discuss how behavioral results on model metamers, adversarial examples, and synthetic “out-of-distibution” stimuli can show differences between models even when traditional brain-based benchmarks of similarity do not, demonstrating how coupling behavioral measures with targeted stimuli can be an effective tool for comparing biological and artificial representations.

Talk 5

On the importance of dynamic range and sample statistics in fitting neuronal encoding models

Binxu Wang1, Carlos R. Ponce1; 1Harvard University

Understanding visual neurons involves examining how their responses vary with input stimuli. Treating neuronal response prediction as a regression problem, we identify two key data factors for achieving high R²: sufficient variance in predictors (images) and responses (neuronal firing). Traditionally, we lack control over neuronal firing before image selection. Recently, neuron-guided image synthesis allows us to generate images that control neuronal responses in real time, providing an opportunity to study the effect of response range and image covariance on encoding models. From experiments on hidden units in CNNs and visual cortical neurons, we found that randomly selected natural stimuli often underestimate the full response range, particularly at higher levels visual cortices. Neuron-guided optimization can find stimulus set with higher activation and variance. Encoding models trained on generated images along optimization trajectory can predict graded neuronal responses within the dynamic range for held-out generated images. However, randomly selected natural images, lacking the dynamic range, often have worse R² value. We also compared encoding models trained on pre-selected natural images and neuron-guided image samples in different generative spaces (e.g., DeepSim, BigGAN), via feature visualization. Advanced generative models exhibit image statistics similar to those of natural images, helping generalization; however, these priors also bias the encoding model, leading it to infer that neurons respond to more complex features than they do. In contrast, simpler generative models lead to more parsimonious features in encoding model, aiding interpretation. Our study offers valuable insights for the design of image sets for future vision experiments.

Talk 6

A Less Artificial Intelligence

Andreas Tolias1; 1Stanford University

Neural activity fundamentally shapes our perceptions, behaviors, and cognition, propelling one of neuroscience’s greatest quests: decrypting the neural code. This challenge is hindered by our limited ability to precisely record and manipulate extensive neuronal networks under complex conditions and to accurately model the relationships between stimuli, behaviors, and brain states within the natural world’s complexity. Recent advancements have started addressing these barriers. Concurrently, advancements in AI now enable analysis of this complex data, facilitating the construction of brain foundation models. These models, akin to AI systems like Video-LLaMA, which decipher video and language relationships, can systematically compile large-scale neural and behavioral data from diverse natural settings. These digital twins of the brain allow for unlimited in silico experiments and the application of AI interpretability tools, enhancing our understanding of neural computations. By applying these insights to AI, we aim to develop more robust, energy-efficient, and comprehensible systems, advancing beyond Big Tech’s practice of scaling models with just more behavioral data. Additionally, brain foundation models could revolutionize the diagnosis and treatments for neuropsychiatric disorders. To effectively build these models, we must now decisively move away from traditional hypothesis-driven neuroscience and commit to generating extensive, combined neural and behavioral data across a range of diverse natural tasks.