Visual Relation Detection in Humans and Deep Learning Models
Poster Presentation: Tuesday, May 20, 2025, 8:30 am – 12:30 pm, Banyan Breezeway
Session: Perceptual Organization: Individual differences, events and relations
Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions
Hongjing Lu1 (), Tanaya Jog2, Zehui Zhao3, Shuhao Fu4; 1UCLA
Humans can rapidly detect visual relations in images, but how the visual system represents these relations between objects remains unclear. We conducted two experiments to investigate the detection of spatial and agentic relations in both naturalistic and controlled settings. In Experiment 1, participants viewed a sequence of realistic images, each displayed for 67 ms, and performed a Go/No-Go task, pressing a button when detecting a target relation in an image. Participants detected agentic relations more accurately than spatial relations. In Experiment 2, we removed the complex backgrounds and used synthetic images containing two objects either spatially related (e.g., "on top of," "next to") or unrelated. On each trial, participants viewed a pair of images including the same objects in different relational configurations. The two images were separated by a mask image, and participants identified which image appeared first. Humans achieved above-chance performance even with image display durations as brief as 40 ms and reached a performance plateau at durations exceeding 100 ms. We examine whether deep learning models can account for human performance in relation detection. Four models were evaluated: two vision models (ResNet and Vision Transformer) without explicit relational representations, and two relation detection models with explicit relational embeddings—a closed-vocabulary model based on a fixed set of object concepts and an open-vocabulary model without constraints on object concepts. Models computed the similarity of image pairs for each trial. While the two vision models demonstrated some correlation with human performance, the relation model with an open vocabulary of object concepts showed the strongest correlation (r=0.38, p<.001). In contrast, the relation model with a closed vocabulary failed to account for human performance. These findings suggest that humans' sensitivity to detecting visual relations result from flexible representations of object and relational concepts, as exemplified by the open-vocabulary relational model.
Acknowledgements: NSF BCS 2142269