Visual Cognition in Vision-Language Models

Poster Presentation: Saturday, May 17, 2025, 8:30 am – 12:30 pm, Pavilion
Session: Theory

Krista A. Ehinger1; 1The University of Melbourne

Large language models (LLMs) show human-level performance in a range of language tasks such as question answering, text editing, and text composition. These models are trained on massive text datasets and show an impressive ability to flexibly recombine what they have learned in novel ways to perform arbitrary tasks (Brown, et al., 2020). The latest generation of LLMs are multimodal and able to process images as well as text. Do these vision-language models (VLMs) learn similarly flexible representations for visual tasks?