Convolutional neural networks lack human-level robustness at recognizing reversed-contrast and two-tone Mooney faces

Undergraduate Just-In-Time Abstract

Poster Presentation: Tuesday, May 20, 2025, 2:45 – 6:45 pm, Banyan Breezeway
Session: Undergraduate Just-In-Time 2

Zelin (Linda) Zhao1 (), Connor Parde1, Frank Tong1; 1Vanderbilt University

Although convolutional neural networks (CNNs) have attained human-level accuracy at identifying faces (e.g., O'Toole & Castillo, 2021), it remains unclear whether their visual strategies match those of human observers. Here, we compared human and CNN performance using image manipulations known to challenge human perception. Mooney faces are two-tone black and white images that can be perceived as faces while lacking clear facial features (Mooney, 1957). Reversing contrast polarity (i.e., photographic negation) further impairs face recognition by disrupting the interpretation of shading and shadows. We tested humans and CNNs with four conditions: blurry grayscale images (Gaussian blur, sigma 2 pixels), blurry grayscale images with reversed contrast polarity, thresholded Mooney images, and Mooney images with reversed contrast polarity. Participants viewed face images of 10 different celebrities and were asked to report the face identity. Each face image was presented with one of the four image manipulations. Likewise, the entire set of images was presented to two different CNNs (AlexNet and VGG19), both trained on the FaceScrub database. Humans outperformed CNNs overall with the performance gap becoming more pronounced under more challenging conditions. Average human accuracy was 97.1, 69.8, 76.2, and 53.1% for blurry faces, reversed-contrast blur, Mooney, and reversed-contrast Mooney, respectively, whereas CNN accuracy was 96.7, 24.2, 37.5, and 18.0%. CNN face identification performance was disproportionately impaired by reversing contrast polarity, with accuracy dropping sharply in the reversed-contrast and Mooney conditions. By comparison, humans showed greater robustness to these manipulations, highlighting fundamental differences in face-processing strategies. Our findings reveal the vulnerability of CNNs when tasked with identifying faces in ambiguous contexts and generalizing to novel visual conditions.

Acknowledgements: Supported by NEI grant R01EY035157 to F.T.