Visual features for egocentric action recognition

Poster Presentation: Sunday, May 18, 2025, 8:30 am – 12:30 pm, Pavilion
Session: Action: Perception and recognition

Filip Rybansky1, Sadegh Rahmani2, Andrew Gilbert2, Frank Guerin2, Anya Hurlbert1, Quoc Vuong1; 1Newcastle University, 2University of Surrey

People quickly recognize daily actions (e.g., washing dishes) in visual stimuli. Evidence suggests that Minimal Recognizable Configurations (MIRCs) contain spatial and temporal features for reliable recognition (Ben-Yosef et al., 2020) and could be used to identify important features to improve computer vision. To further investigate the contribution of MIRCs to action recognition, we progressively reduced the available spatial information in egocentric action videos. We selected 18 videos from the Epic-Kitchens-100 dataset (Damen et al., 2022) which were correctly categorised (Easy) by our computer vision network (Ahmadian et al., 2023), and 18 incorrectly categorised (Hard). Participants (N=3800) viewed Easy and Hard videos online and identified the action. Videos recognized by ≥50% of the participants were cropped to produce four reduced quadrants for the next data collection round. A spatial MIRC was identified if all subsequent quadrants (sub-MIRCs) were not recognized. Videos were reduced until we obtained at least one spatial MIRC for each video. We tested 7604 video quadrants, identifying an average of 15.17 spatial MIRCs per video. In terms of the recognition gap, which is the difference in accuracy between MIRC and sub-MIRC quadrants, we extended Ben-Yosef et al.’s findings to complex egocentric action videos. Recognition gaps were significantly greater for Hard (Md=0.40, IQR=0.25) than Easy (Md=0.35, IQR=0.20) videos, suggesting that feature importance is concentrated into a smaller subset of features in Hard than Easy videos. SHAP importance analysis of surface-area features for recognizability also supported this claim and indicated that visibility of the active hand and active object are important for action recognition. Finally, Graph-Based Visual Saliency (Harel et al., 2006) was significantly greater in MIRCs (Md=0.32, IQR=0.19) vs. randomly located size-matched quadrants (Md=0.26, IQR=0.19), and in Hard (Md=0.33, IQR=0.20) vs. Easy (Md=0.31, IQR=0.19) MIRCs. Our results suggest that MIRCs reveal the important features for egocentric action recognition.

Acknowledgements: Leverhulme Trust