![]() ![]() While deep learning has reshaped the classical motion capture pipeline, generative, analysis-by-synthesis elements are still in use to recover fine details if a high-quality 3D model of the user is available. Rhodin 1ġ The University of British Columbia 2 Facebook Reality Labs Research Our model generates a smooth reconstruction even on YouTube videos.A-NeRF: Surface-free Human 3D Pose Refinement via Neural RenderingĬonference on Neural Information Processing Systems 2021 S-Y. We provide below an overview of the main Human3.6M results with various keypoint detectors and architectures. Our model achieves state-of-the-art results on Human3.6M and HumanEva-I. ![]() To avoid this, we add a soft constraint to approximately match the mean bone lengths of the subjects in the unlabeled batch to the subjects of the labeled batch. At this stage, however, the model has no incentive to predict a plausible 3D pose and might just learn to copy the input (i.e. We therefore also regress the 3D trajectory of the person, so that the back-projection to 2D can be performed correctly. the position of the human referential in space at each time step) and the 3D pose (the position of joints in the human referential). Due to the perspective projection, the 2D pose on the screen depends both on the trajectory (i.e. The key idea is to solve an autoencoding problem with the unlabeled data where the 3D pose estimator is used as the encoder and the predicted poses are then mapped back to 2D space, based on which a reconstruction loss can be computed. ![]() Specifically, we predict 2D keypoints for an unlabeled video with an off the shelf 2D keypoint detector, predict 3D poses, and then map these back to 2D space. Our method is inspired by unsupervised machine translation, where a sentence available in only a single language is translated to another language and then back into the original language. Low resource settings are particularly challenging for neural network models which require large amounts of labeled training data and collecting labels for 3D human pose estimation requires an expensive motion capture setup as well as lengthy recording sessions. Semi-supervised learning via back-projectionĮquipped with a highly accurate and efficient architecture, we turn to settings where labeled training data is scarce and introduce a new scheme to leverage unlabeled video data for semi-supervised training. Additionally, convolutional models enable parallel processing of multiple frames which is not possible with recurrent networks. Compared to approaches relying on RNNs, it provides higher accuracy, simplicity, as well as efficiency, both in terms of computational complexity as well as the number of parameters. By contrast, we adopt a convolutional approach which performs 1D dilated convolutions across time to cover a large receptive field. Previous work tackled this ambiguity by modeling temporal information with recurrent neural networks. While splitting up the problem arguably reduces the difficulty of the task, it is inherently ambiguous as multiple 3D poses can map to the same 2D keypoints. We build on the approach of state-of-the-art methods which formulate the problem as 2D keypoint detection followed by 3D pose estimation. Moreover, experiments with back-projection show that it comfortably outperforms previous state-of-the-art results in semi-supervised settings where labeled data is scarce. In the supervised setting, our fully-convolutional model outperforms the previous best result from the literature by 6 mm mean per-joint position error on Human3.6M, corresponding to an error reduction of 11%, and the model also shows significant improvements on HumanEva-I. ![]() We start with predicted 2D keypoints for unlabeled video, then estimate 3D poses and finally back-project to the input 2D keypoints. We also introduce back-projection, a simple and effective semi-supervised training method that leverages unlabeled video data. In this work, we demonstrate that 3D poses in video can be effectively estimated with a fully convolutional model based on dilated temporal convolutions over 2D keypoints. #3d model poser code#Temporal convolutions and semi-supervised trainingĭario Pavllo Christoph Feichtenhofer David Grangier Michael Auliįacebook AI Research in CVPR 2019 Paper Code Demo Abstract 3D human pose estimation in video with temporal convolutions and semi-supervised training 3D human pose estimation in video with ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |