Bimanual 3D Hand Motion and Articulation Forecasting in Everyday Images

  • University of Illinois Urbana-Champaign
Abstract

We tackle the problem of forecasting bimanual 3D hand motion & articulation from a single image in everyday settings. To address the lack of 3D hand annotations in diverse settings, we design an annotation pipeline consisting of a diffusion model to lift 2D hand keypoint sequences to 4D hand motion. For the forecasting model, we adopt a diffusion loss to account for the multimodality in hand motion distribution. Extensive experiments across 6 datasets show the benefits of training on diverse data with imputed labels (14% improvement) and effectiveness of our lifting (42% better) & forecasting (16.4% gain) models, over the best baselines, especially in zero-shot generalization to everyday images.

Pipeline

We first use the 2D & 3D annotations in lab datasets to train a lifting diffusion model, that maps 2D keypoints sequences to 3D MANO hands. We then run the lifting model on diverse datasets with 2D annotations to generate 3D MANO annotations. Finally, the forecasting model is trained on both lab & diverse datasets with complete 3D supervision.

Zero-shot generalization results on EgoExo4D

Predictions from the Transformer Regressor baseline are roughly placed near the center of the image, which could indicate that it is regressing to the mean pose of all hands:

Predictions from the ForeHand4D model span longer trajectories, are smoother & better placed in the scene compared to the baseline:

More predictions from our ForeHand4D model:

Predictions from ForeHand4D on more datasets

ARCTIC:

H2O:

DexYCB:

Lifting Model

We modify MDM to condition on a sequence of 2D hand keypoints \& camera parameters. The conditioning module combines different input representations: 3D pose (rotation, translation) of camera, Plucker rays & KPE.

Forecasting Model

We modify MDM to condition on images features extracted from a ViT backbone. Each input & output token is 198-dimensional: 2 hands x (16 (joints) x (6 (6D rotation for each joint) + 3 (wrist translation))).

Citation

@article{Prakash2025Forehand4D,
author = {Prakash, Aditya and Forsyth, David and Gupta, Saurabh},
title = {Bimanual 3D Hand Motion and Articulation Forecasting in Everyday Images},
journal = {arXiv:2510.06145},
year = {2025}
}

Template from this website