horse

Abstract

Prior works for reconstructing hand-held objects from a single image rely on direct 3D shape supervision which is challenging to gather in the real world at scale. Consequently, these approaches do not generalize well when presented with novel objects in in-the-wild settings. While 3D supervision is a major bottleneck, there is an abundance of in-the-wild raw video data showing hand-object interactions. In this paper, we automatically extract 3D supervision (via multiview 2D supervision) from such raw video data to scale up the learning of models for hand-held object reconstruction. Specifically, we train our models on 144 object categories (obtained using videos from VISOR dataset), which is 4 times larger than existing hand-object datasets for this task. This requires tackling two key challenges: unknown camera pose and occlusion. For the former, we use hand pose (predicted from existing techniques: FrankMocap) as a proxy for object pose. For the latter, we learn data-driven object shape priors from existing datasets. We use these indirect 3D cues to train occupancy networks that predict the 3D shape of objects from a single RGB image. Our experiments in the challenging object generalization setting on in-the-wild \mow dataset show 13% relative improvement over models trained with 3D supervision on existing datasets. Code, data, and models will be made available upon acceptance.

Citation

@article{Prakash2023ARXIV,
author = {Prakash, Aditya and Chang, Matthew and Jin, Matthew and Gupta, Saurabh},
title = {Learning Hand-Held Object Reconstruction from In-The-Wild Videos},
journal = {arXiv},
volume = {2305.03036},
year = {2023}
}

Learning Hand-Held Object Reconstruction from In-The-Wild Videos ARXIV 2023

Citation

Learning Hand-Held Object Reconstruction
from In-The-Wild Videos
ARXIV 2023