Assaying the effect of recent sensory history on object categorization via human psychophysics and computational modeling
Poster Presentation: Tuesday, May 20, 2025, 8:30 am – 12:30 pm, Banyan Breezeway
Session: Object Recognition: Models
Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions
Lynn K. A. Sörensen1, Michael J. Lee1,2, James J. DiCarlo1,2; 1McGovern Institute for Brain Research at MIT, 2MIT Quest for Intelligence
Sensory history is thought to strongly influence object perception. While the field has image-computable models of how images map to behavioral reports, we lack a comparable understanding of how dynamic sequences affect object perception. Here, we take initial steps to address this gap: we measure how sensory history affects human categorization performance (online psychophysics, N=500). Using 300 naturalistic videos and a binary object detection task, we compared pre-cued categorization reports based on video clips (200-1600ms) that end at a particular target frame with reports on the same frame shown statically for 200 ms. Surprisingly, single-frame-based reports explained substantial behavioral variance, even in the longest clips, challenging the notion that object recognition heavily depends on sensory history. Still, longer sensory history reports increasingly differed from frame-based recognition, yielding performance increases suggestive of evidence accumulation over time. Next, we focused on what mechanisms might explain these effects of sensory history. We hypothesized that frame-based encoding (e.g., via the ventral visual stream) combined with downstream temporal integration mechanisms may account for the emerging differences with longer sensory history. To test specific instantiations of this hypothesis, we augmented a pre-trained artificial neural network with diverse temporal decoders, including max-pooling, mean integration, leaky-integrators, and recurrent architectures (RNNs, GRUs, LSTMs), each optimized for categorization on a separate set of videos (40 repetitions). Interestingly, unlike simpler decoders, we found that non-linear temporal decoders increasingly captured the unique behavioral variance emerging with extended sensory history. Still, compared to human frame-based reports, frame-based ANN predictions (without temporal decoders) proved much less powerful at explaining human behavior overall, highlighting weaknesses of current image-based encoding models. Leveraging powerful, rapid, frame-based inferences as a foundation, our results demonstrate how sensory history could enrich object recognition through dynamic temporal integration of high-level visual representations.
Acknowledgements: This work was funded in part by the National Science Foundation-STC [CCF-1231216] and [2124136] (JJD) and a DFG Walter-Benjamin Fellowship 547591872 (LKAS).