Bimanual 3D Hand Motion and Articulation Forecasting in Everyday Images

  • University of Illinois Urbana-Champaign
Abstract

We tackle the problem of forecasting bimanual 3D hand motion & articulation from a single image in everyday settings. To address the lack of 3D hand annotations in diverse settings, we design an annotation pipeline consisting of a diffusion model to lift 2D hand keypoint sequences to 4D hand motion. For the forecasting model, we adopt a diffusion loss to account for the multimodality in hand motion distribution. Extensive experiments across 6 datasets show the benefits of training on diverse data with imputed labels (14% improvement) and effectiveness of our lifting (42% better) & forecasting (16.4% gain) models, over the best baselines, especially in zero-shot generalization to everyday images.

Pipeline

We first use the 2D & 3D annotations in lab datasets to train a lifting diffusion model, that maps 2D keypoints sequences to 3D MANO hands. We then run the lifting model on diverse datasets with 2D annotations to generate 3D MANO annotations. Finally, the forecasting model is trained on both lab & diverse datasets with complete 3D supervision.

Zero-shot generalization results
Predictions from ForeHand4D on more datasets

ARCTIC:

H2O:

DexYCB:

Lifting Model

We modify MDM to condition on a sequence of 2D hand keypoints \& camera parameters. The conditioning module combines different input representations: 3D pose (rotation, translation) of camera, Plucker rays & KPE.

Forecasting Model

We modify MDM to condition on images features extracted from a ViT backbone. Each input & output token is 198-dimensional: 2 hands x (16 (joints) x (6 (6D rotation for each joint) + 3 (wrist translation))).

Citation

@article{Prakash2025Forehand4D,
author = {Prakash, Aditya and Forsyth, David and Gupta, Saurabh},
title = {Bimanual 3D Hand Motion and Articulation Forecasting in Everyday Images},
journal = {arXiv:2510.06145},
year = {2025}
}

Template from this website