Foveated sensing with KNN-convolutional neural networks
Poster Presentation: Tuesday, May 20, 2025, 8:30 am – 12:30 pm, Banyan Breezeway
Session: Object Recognition: Models
Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions
Nicholas M. Blauch1 (), George A. Alvarez1, Talia Konkle1; 1Harvard University
Human vision prioritizes the center of gaze through spatially-variant retinal sampling, leading to magnification of the fovea in cortical visual maps. In contrast, deep neural network models (DNNs) almost always operate on spatially uniform inputs, a severe mismatch that limits their use in understanding the active and foveated nature of human vision. Some work has explored foveated sampling in DNNs, however, these methods have been forced to wrangle retinal samples into grid-like representations, sacrificing faithful cortical retinotopy and creating undesirable warped receptive field shapes that depend on eccentricity. Here, we offer an alternative approach by adapting the model architecture to enable realistic foveated encoding of visual space. First, we use a spatially-variant input sensor derived from the log polar map model, which links retinal sampling to cortical magnification (Schwartz, 1980), but does not produce grid-like images. To handle the sensor’s outputs, we convert spatial kernels for convolution and pooling into k-nearest neighborhoods (KNNs) defined in pixel space, and generalize convolution to KNNs. Filters are learned in a canonical reference frame, and are spatially mapped into each neighborhood for perception. This approach allows us to build hierarchical KNN convolutional neural networks (KNN-CNNs) closely matched to their CNN counterparts. Architecturally, these models naturally exhibit realistic cortical retinotopy and desirable receptive field properties, such as exponentially increasing size and constant shape as a function of eccentricity. Training these models end-to-end over natural images, we find that they perform competitively with resource-matched CNNs trained on grid-like foveated images, and exhibit increasing performance with multiple fixations. Broadly, this model class offers a more biologically-aligned sampling of the visual world, enabling future computational work to model the active and spatial nature of human vision, with applications in understanding visual recognition, crowding, and visual search. Last, this approach holds promise in building more neurally mappable models.
Acknowledgements: This work was supported by NSF CRCNS 2309041.