Farnoosh Heidarivincheh, Majid Mirmehdi, Dima Damen
Robust motion representations for action recognition have achieved remarkable performance in both controlled and in-the-wild scenarios.
Such representations are primarily assessed for their ability to label a sequence according to some predefined action classes (e.g. walk, wave, open). Although increasingly accurate, these classifiers are likely to label a sequence, even if the action has not been fully completed, because the motion observed is similar enough to the training set. Consider the case where one attempts to drink but realises the beverage is too hot. A drinking-vs-all classifier is likely to recognise this action as drinking regardless.
We introduce the term action completion as a step beyond the task of action recognition. It aims to recognise whether the action's goal has been successfully achieved. The notion of completion differs per action and could be infeasible to verify using a visual sensor, however, for many actions, an observer would be able to make the distinction by noticing subtle differences in motion.
Since the notion of completion differs per action, a general action completion method should investigate the performance of different types of features to accommodate the various action classes. For example, for the action pick, the difference between complete and incomplete actions originates from the subtle change in body pose when holding an object, or by observing an object in the hand. On the other hand, for the action drink, the speed at which the action is performed is better able to assess the completion. We propose a method that chooses the feature(s) suitable for recognising completion from the pool of depth features using by leave-one-person-out cross validation on the training set and automatically selecting the most discriminative feature(s).