Language Motivated Approaches for Human Action Recognition and Spotting
Malgireddy, Manavender Reddy
MetadataShow full item record
Action recognition has become an important area of computer vision research. "Given a sequence of images with people performing different actions over time, can a system be designed to automatically recognize what action is being performed in the sequence, and in what specific frames it occurred?". Till date, much of the computer vision community has approached this problem from a single action perspective where the problem is reduced to classifying a sequence of images containing one action. Hence given an image sequence, the assumption already exists that only one major action from a known class of actions occurs in that sequence. This dissertation targets not only the recognition of actions, but also the problem of spotting actions (or localization) from video data. Our proposed approach involves the sharing of sub-actions to understand the underlying patterns of motions in actions and the use of these for recognition and spotting. Firstly, as a proof-of-concept, we build a framework using a predefined sequence of sub-actions to model an action. We then perform experiments to show that our framework is indeed useful for action recognition and spotting. Next, we build upon our previous approach and learn sub-actions automatically rather than defining them manually. In order to obtain statistical insight into the underlying patterns of motions in actions, we have developed a dynamic, hierarchical Bayesian model which connects low-level visual features in videos with poses, motion patterns and classes of activities. This process is somewhat analogous to the method of detecting topics or categories from documents based on the word content of the documents, except that our documents are dynamic. The proposed generative model harnesses both the temporal ordering power of dynamic Bayesian networks such as hidden Markov models (HMMs) and the automatic clustering power of hierarchical Bayesian models such as the latent Dirichlet allocation (LDA) model. We have introduced a probabilistic framework for detecting and localizing pre-specified actions (or ges- tures) in a video sequence, analogous to the use of filler models for keyword detection in speech processing. We demonstrate the robustness of our classification model and our spotting framework by recognizing actions in unconstrained real-life video sequences and by spotting gestures via a one-shot-learning approach. Due to advancements in human action recognition, there are currently several publicly available datasets which have a large number of actions collected from various sources of media, reflecting real world scenarios. We have evaluated the proposed methods on these datasets and outperformed several techniques described in the literature. We have proposed a new robust framework for modeling actions which gives a better insight into building blocks of actions rather than just performing recognition.