Towards Holistic Vision in Deep Neural Networks: Disentangling Local and Global Processing

Poster Presentation: Tuesday, May 20, 2025, 8:30 am – 12:30 pm, Banyan Breezeway
Session: Object Recognition: Models

Fenil R. Doshi1,2 (), Thomas Fel1,2, Talia Konkle1,2, George A. Alvarez1,2; 1Harvard University, 2Kempner Institute for the Study of Natural and Artificial Intelligence

Configural shape processing, the integration of parts into cohesive wholes, is a hallmark of human visual processing. However, current deep neural networks, even those exhibiting shape bias (Geirhos et al., 2019), struggle to capture this critical capacity due to reliance on spurious local features like texture and color (Baker & Elder, 2022). This texture-bias limits their ability to learn robust but more complex shape-based features (Shah et al., 2020). We propose a novel training routine that simultaneously trains two distinct network architectures: a target network, ConfigNet, a standard CNN with progressively expanding receptive fields, and an auxiliary network, BagNet, an all-convolutional architecture with a fixed, restricted receptive field enforcing local featural processing. We introduce Divergence–Variance–Covariance Loss (DVCL), a novel objective function that disentangles local and global processing across the two networks. DVCL enforces orthogonality between models’ intermediate representations by decorrelating redundant features while preserving variance. We hypothesize that, because ConfigNet cannot rely on the local strategy employed by Bagnet, it must instead learn a more global encoding strategy. We test this on a colored MNIST dataset where both the shape and color perfectly predict the digit category. BagNet primarily relies on color for classification, while ConfigNet focuses on shape, showing greater robustness to color changes. After training, when colors are shuffled, decoupling them from digit identity, BagNet’s performance drops to chance (10%), with misclassifications strongly predicted by shuffled color patterns. In contrast, ConfigNet maintains 42.6% accuracy, with a near-diagonal confusion matrix, showing reliance on shape. When shapes are removed, leaving only color, ConfigNet’s accuracy drops to 21.67%, while BagNet achieves 80.17%. This double dissociation highlights ConfigNet’s global, shape-based strategy and BagNet’s local, color-based strategy. Our findings demonstrate DVCL’s potential to steer networks toward global configural processing, fostering robust, shape-based representations and advancing vision architectures toward human-like perception.

Acknowledgements: NSF PAC COMP-COG 1946308 to GAA; NSF CAREER BCS-1942438 to TK; Kempner Institute Graduate Fellowship to FD