Dynamics-based human pose estimation using monocular vision
MetadataShow full item record
Human pose estimation using monocular vision is a challenging problem for both the computer vision and robotics communities. Past work has focused on developing efficient inference algorithms and probabilistic models employing articulated-multibody-dynamics-based priors generated using captured kinematic/dynamic measurements. However, such algorithms face challenges in generalization beyond the learned dataset with tracking performance significantly depending on the underlying articulated-multibody system-parameter estimates, which can be difficult to obtain from unstructured and uncalibrated video sequences. In this work, we propose a model-based generative approach for estimating the human pose solely from uncalibrated monocular video in unconstrained environments without any prior learning on motion capture/image annotation data. We propose a novel Product of Heading Experts ( PoHE ) based generalized heading estimation framework by probabilistically-merging heading outputs (probabilistic/non-probabilistic) from time varying number of estimators. Our current implementation employs motion cues based human heading estimation framework to bootstrap a synergistically integrated probabilistic-deterministic sequential optimization framework to robustly estimate human pose. Novel pixel-distance based performance measures are developed to penalize false human detections and ensure identity-maintained human tracking. We test our framework with varied inputs (silhouette and bounding boxes) to evaluate, compare and benchmark it against ground-truth data (collected using our human annotation tool) for 52 video vignettes in the publicly available Defense Advanced Research Projects Agency ( DARPA ) Mind's Eye Year I dataset (ARL-RT1). Results show robust pose estimates on this challenging dataset of highly diverse activities. Building upon this framework, we further propose a technique for estimating the lower-limb dynamics of a human solely based on captured behavior using an uncalibrated monocular video camera. We leverage the proposed framework for human pose estimation to (i) deduce the correct sequence of temporally coherent gap-filled pose estimates, (ii) estimate physical parameters, employing a dynamics model incorporating the anthropometric constraints, and (iii) filter out the optimized gap-filled pose estimates, using an Unscented Kalman Filter ( UKF ) with the estimated dynamically-equivalent human dynamics model. We test this extended framework on videos from the publicly available DARPA Mind's Eye Year 1 corpus (ARL-RT1). The combined estimation and filtering framework not only results in more accurate physically plausible pose estimates, but also provides pose estimates for frames, where the original human pose estimation framework failed to provide one.