CHARM is essential for human perception and aesthetics, and can be implemented in computer vision too
Poster Presentation: Tuesday, May 20, 2025, 8:30 am – 12:30 pm, Banyan Breezeway
Session: Perceptual Organization: Aesthetics
Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions
Fatemeh Behrad1 (), Tinne Tuytelaars1, Johan Wagemans1; 1KU Leuven
Human perception is shaped by a complex interaction of bottom-up and top-down processes. Our visual system actively focuses on critical details, preserves spatial relationships, and interprets scenes at multiple scales. This process relies on selective attention and eye movements, which allow us to focus on essential elements while integrating them into cohesive scenes. All of these characteristics of the human visual system are equally important, but they are not (yet) incorporated into computer vision systems, in spite of claims to the contrary. The ability to balance fine details with a broader compositional awareness is key to our aesthetic appreciation as well. Computer vision and machine learning models typically use fixed-size inputs obtained through downscaling or cropping, which ignore the essential ways in which human brains handle large-scale inputs. This standard approach can lead to significant information loss, limiting models’ sensitivity to detail and spatial organization, as well as its capacity for aesthetic assessment. We introduce CHARM, a novel approach inspired by principles of human visual perception that preserves Composition, High-resolution, Aspect Ratio, and Multiscale information for Vision Transformers (ViTs). CHARM enables ViTs to mimic the brain’s selective processing strategy by retaining high-resolution details in important regions and downscaling less relevant areas. This approach avoids the need for cropping or altering the aspect ratio, providing the model with richer contextual and compositional cues that enhance its performance and generalizability in image aesthetic assessment. In experiments on multiple image aesthetic assessment datasets, CHARM achieves up to an 8.1% performance improvement. CHARM marks an advancement in ViTs' capability to process visual information in a way that mirrors human perceptual efficiency, emphasizing the importance of high-resolution details, compositional integrity, and multiscale processing in both human and artificial vision systems.