Structured models for video understanding
Video understanding comprises rich research topics and directions, and attracts a vast number of researchers from computer vision, machine learning, data mining, multimedia and NLP communities. We understand video as a composition of objects (especially human), motion, scene, and so on, which could generate a predicted label of an action class, a sequence of bounding boxes, or even natural language that describes the video, depending on the granularity at which a video is analyzed. So, our core motivation is to analyze videos and generate descriptions in a structured, compositional way. Specifically, we mainly follow two lines of research directions: 1) structured models for human action modeling and 2) a more generic video perception and language generation. First, we utilize human pose coupled with local motion around the body joints as a middle-level representation for activity recognition. This method demonstrates advantages against skeletal pose and also shows complementarity to other low-level motion/appearance features. However, the orderless bag-of-dynamic-pose framework leverages human pose models trained with still images, making the coupling of pose space and motion space vulnerable to complex background or unreliable appearance features. Therefore, we design a framework to capture the variable structure of dynamic human activity over a long range. In the lower level, we learn the compositionality of low-level features, i.e. dense trajectory, by recursively grouping frequently co-occurring pairs of trajectories. In the higher level, we design a locally articulated spatiotemporal deformable parts model to capture global articulation of an action with parts that are more locally discriminative. Then, we move towards understanding unconstrained videos "in-the-wild". To generate more complex and natural output from video, we further study how to jointly model video and language by introducing a unified framework that connects the two modalities. Our model jointly learns the compositionality of words and the embeddings between video and sentences. The joint model is able to perform video retrieval, text retrieval and natural language generation from novel video. Our fourth contribution is a new distributed word learning framework that using visual features as global context and text segment as local context. The learned word embeddings can capture better visually grounded semantic meaning than traditional language models without visual context. To conclude, this dissertation presents five works that demonstrate how structured human action models and video-language models lead to better video understanding.