How Do I Do That?
Synthesizing 3D Hand Motion and Contacts for Everyday Interactions

CVPR 2025

  • 1University of Illinois Urbana-Champaign
  • 2Microsoft
  • Part of the work done during internship at Microsoft
  • Joint last authors, indicates equal contribution
Abstract

We tackle the novel problem of predicting 3D hand motion and contact maps (or Interaction Trajectories) given a single RGB view, action text, and a 3D contact point on the object as input. Our approach consists of (1) Interaction Codebook: a VQVAE model to learn a latent codebook of hand poses and contact points, effectively tokenizing interaction trajectories, (2) Interaction Predictor: a transformer-decoder module to predict the interaction trajectory from test time inputs by using an indexer module to retrieve a latent affordance from the learned codebook. To train our model, we develop a data engine that extracts 3D hand poses and contact trajectories from the diverse HoloAssist dataset. We evaluate our model on a benchmark that is 2.5-10X larger than existing works, in terms of diversity of objects and interactions observed, and test for generalization of the model across object categories, action categories, tasks, and scenes. Experimental results show the effectiveness of our approach over transformer & diffusion baselines across all settings.

Data Engine

We process diverse interaction videos from HoloAssist dataset showing atomic actions. Since it does not provide 3D annotations for hand poses & contact points, we design a semi-automatic data annotation pipeline using 2D segmentation masks, 3D hand poses & 2D contact region. Object masks are extracted using SAMv2, 3D hand poses & masks (2D rendering of mesh) are from HaMeR, contact points are computed by projecting the 3D hand points into the 2D contact region (intersection of hand & object masks). While we only consider right hand motion in this work, the pipeline can be extended to both hands.

Some examples of the extracted annotations are visualized below. We show the input video on the top left along with text description of the action being performaned. The hand mesh and contacts are rendered from 2 different viewpoints. Each sequence consists of 30 timesteps (rendered at 5 fps), which is about 1 - 1.25 seconds in duration.

Approach

Our framework involves a 2-stage training procedure: (1) Interaction Codebook: to learn a latent codebook of hand poses and contact points, i.e., tokenizing interaction trajectories, (2) a learned Indexer & an Interaction Predictor module to predict the interaction trajectories from single image, action text & 3D contact point. We use pretrained features for images (from DeiT) and text (from CLIP). 3D contact point is input as a 3D gaussian heatmap in a 3D voxel grid (omitted here for clarity).

Interaction Codebook is a VQVAE model that learns a latent codebook of 3D interaction trajectories, consisting of a transformer encoder-decoder architecture. The encoder features are used to sample codebook indices and corresponding embeddings which are passed to a decoder that also takes in the video input and text describing the action and reconstructs the 3D interaction trajectories.

Indexer module maps the test time inputs to the codebook indices to extract the relevant embeddings. These are then passed to a interaction predictor module that outputs the 3D interaction trajectories. We consider 2 settings: (1) Forecasting: text describing the action, single image & 3D contact point, (2) Interpolation: also providing the goal image showing the final state of the interaction (not shown here for clarity).

Results

We consider 2 settings: (1) Forecasting: input consists of a textual description of the action, a current image showing the object of interaction, and a 3D contact point. (2) Interpolation: We provide the goal image, along with the previous inputs, showing the final state of interaction. For each setting, we 2 two variants: (a) current image with a hand, (b) current image without a hand (the hand may not be visible before the start of the interaction in many practical scenarios).

Forecasting (no hands)
Forecasting (with hands)
Interpolation (no hands)
Interpolation (with hands)

Citation

@inproceedings{Prakash2025LatentAct,
author = {Prakash, Aditya and Lundell, Benjamin and Andreychuk, Dmitry and Forsyth, David and Gupta, Saurabh and Sawhney, Harpreet},
title = {How Do I Do That? Synthesizing 3D Hand Motion and Contacts for Everyday Interactions},
booktitle = {Computer Vision and Pattern Recognition (CVPR)},
year = {2025}
}

Template from this website