Statistics of egocentric video during everyday tasks
Poster Presentation: Tuesday, May 20, 2025, 2:45 – 6:45 pm, Pavilion
Session: Scene Perception: Natural images, virtual environments
Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions
Xueyan Niu1,2, Lili Zhang1, Charlie Burlingham1, Romain Bachy1, James Hillis1; 1Realty Labs, Meta Platforms Inc., 2New York University
Power spectrum analysis on curated images provides good insight into the statistical regularities of visual inputs and explains many observations of visual processing, but those images differ strongly from the dynamic visual experience during active behavior in daily life. In this work, we revisited these frequency characterizations on a very large egocentric video dataset recorded using light-weight glasses designed for Project Aria (Engel et al., 2023), for people performing a variety of everyday activities (gardening, sports, cooking, etc.). We found that 1) While photographs of different image categories have distinct average spatial power spectra (Torralba & Oliva, 2003), our analyses of video frames showed similar power spectra (both orientation and spatial scale) across different activities. Notably, previous work analyzed photographs and identified that man-made indoor environments have a higher frequency falloff than natural outdoor ones and suggested that this may be one driver of myopia (Flitcroft, Harb, Wildsoet, 2020). However, we found much smaller differences for egocentric images indoor and outdoor, emphasizing how different data sources may lead to different conclusions; 2) The spatiotemporal power spectrum of egocentric videos showed a significant boost in mid-to-high temporal frequencies and essentially whitening of temporal dynamics for mid-high spatial frequencies, inconsistent with previous analyses on curated videos that identified an inseparable relationship between the spatial and temporal power spectra (Dong & Attick, 1995). This whitening effect was previously observed only when accounting for eye movements in participants wearing a heavy head-mounted device (DuTell et al., 2020). The differences we found between statistics from egocentric video during natural interaction compared to those from curated images and videos point to the importance of using data representative of natural human behavior and, we believe, lay a better foundation for understanding the mechanisms of visual encoding.