|
3D Hand Pose Estimation in Egocentric Images in the Wild
Aditya Prakash,
Ruisen Tu,
Matthew Chang,
Saurabh Gupta
arXiv, 2023
abstract /
bibtex /
project page /
video
|
|
We present WildHands, a method for 3D hand pose estimation in egocentric images in the wild. This is challenging due to (a) lack of 3D hand pose annotations for images in the wild, and (b) a form of perspective distortion-induced shape ambiguity that arises in the analysis of crops around hands. For the former, we use auxiliary supervision on in-the-wild data in the form of segmentation masks & grasp labels in addition to 3D supervision available in lab datasets. For the latter, we provide spatial cues about the location of the hand crop in the camera's field of view. Our approach achieves the best 3D hand pose on the ARCTIC leaderboard and outperforms FrankMocap, a popular and robust approach for estimating hand pose in the wild, by 45.3% when evaluated on 2D hand pose on our EPIC-HandKps dataset.
@article{Prakash2023Hands,
author = {Prakash, Aditya and Tu, Ruisen and Chang, Matthew and Gupta, Saurabh},
title = {3D Hand Pose Estimation in Egocentric Images in the Wild},
journal = {arXiv},
volume = {2312.06583},
year = {2023}
}
|
|
Mitigating Perspective Distortion-induced Shape Ambiguity in Image Crops
Aditya Prakash,
Arjun Gupta,
Saurabh Gupta
arXiv, 2023
abstract /
bibtex /
project page /
video
|
|
Objects undergo varying amounts of perspective distortion as they move across a camera's field of view. Models for predicting 3D from a single image often work with crops around the object of interest and ignore the location of the object in the camera's field of view. We note that ignoring this location information further exaggerates the inherent ambiguity in making 3D inferences from 2D images and can prevent models from even fitting to the training data. To mitigate this ambiguity, we propose Intrinsics-Aware Positional Encoding (KPE), which incorporates information about the location of crops in the image and camera intrinsics. Experiments on three popular 3D-from-a-single-image benchmarks: depth prediction on NYU, 3D object detection on KITTI & nuScenes, and predicting 3D shapes of articulated objects on ARCTIC, show the benefits of KPE.
@article{Prakash2023Ambiguity,
author = {Prakash, Aditya and Gupta, Arjun and Gupta, Saurabh},
title = {Mitigating Perspective Distortion-induced Shape Ambiguity in Image Crops},
journal = {arXiv},
volume = {2312.06594},
year = {2023}
}
|
|
Look Ma, No Hands! Agent-Environment Factorization of Egocentric Videos
Matthew Chang,
Aditya Prakash,
Saurabh Gupta
Neural Information Processing Systems (NeurIPS), 2023
abstract /
bibtex /
project page /
video
|
|
The analysis and use of egocentric videos for robotic tasks is made challenging by occlusion due to the hand and the visual mismatch between the human hand and a robot end-effector. In this sense, the human hand presents a nuisance. However, often hands also provide a valuable signal, e.g. the hand pose may suggest what kind of object is being held. In this work, we propose to extract a factored representation of the scene that separates the agent (human hand) and the environment. This alleviates both occlusion and mismatch while preserving the signal, thereby easing the design of models for downstream robotics tasks. At the heart of this factorization is our proposed Video Inpainting via Diffusion Model (VIDM) that leverages both a prior on real-world images (through a large-scale pre-trained diffusion model) and the appearance of the object in earlier frames of the video (through attention). Our experiments demonstrate the effectiveness of VIDM at improving inpainting quality on egocentric videos and the power of our factored representation for numerous tasks: object detection, 3D reconstruction of manipulated objects, and learning of reward functions, policies, and affordances from videos.
@inproceedings{Chang2023NEURIPS,
author = {Chang, Matthew and Prakash, Aditya and Gupta, Saurabh},
title = {Look Ma, No Hands! Agent-Environment Factorization of Egocentric Videos},
booktitle = {Neural Information Processing Systems (NeurIPS)},
year = {2023}
}
|
|
Learning Hand-Held Object Reconstruction from In-The-Wild Videos
Aditya Prakash,
Matthew Chang,
Matthew Jin,
Saurabh Gupta
arXiv, 2023
abstract /
bibtex /
project page /
video
|
|
Prior works for reconstructing hand-held objects from a single image rely on direct 3D shape supervision which is challenging to gather in real world at scale. Consequently, these approaches do not generalize well when presented with novel objects in in-the-wild settings. While 3D supervision is a major bottleneck, there is an abundance of in-the-wild raw video data showing hand-object interactions. In this paper, we automatically extract 3D supervision (via multiview 2D supervision) from such raw video data to scale up the learning of models for hand-held object reconstruction. This requires tackling two key challenges: unknown camera pose and occlusion. For the former, we use hand pose (predicted from existing techniques, e.g. FrankMocap) as a proxy for object pose. For the latter, we learn data-driven 3D shape priors using synthetic objects from the ObMan dataset. We use these indirect 3D cues to train occupancy networks that predict the 3D shape of objects from a single RGB image. Our experiments on the MOW and HO3D datasets show the effectiveness of these supervisory signals at predicting the 3D shape for real-world hand-held objects without any direct real-world 3D supervision.
@article{Prakash2023ARXIV,
author = {Prakash, Aditya and Chang, Matthew and Jin, Matthew and Gupta, Saurabh},
title = {Learning Hand-Held Object Reconstruction from In-The-Wild Videos},
journal = {arXiv},
volume = {2305.03036},
year = {2023}
}
|
|
TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving
Kashyap Chitta,
Aditya Prakash,
Bernhard Jaeger,
Zehao Yu,
Katrin Renz,
Andreas Geiger
Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
abstract /
bibtex /
code
|
|
How should we integrate representations from complementary sensors for autonomous driving? Geometry-based fusion has shown promise for perception (e.g. object detection, motion forecasting). However, in the context of end-to-end driving, we find that imitation learning based on existing sensor fusion methods underperforms in complex driving scenarios with a high density of dynamic agents. Therefore, we propose TransFuser, a mechanism to integrate image and LiDAR representations using self-attention. Our approach uses transformer modules at multiple resolutions to fuse perspective view and bird's eye view feature maps. We experimentally validate its efficacy on a challenging new benchmark with long routes and dense traffic, as well as the official leaderboard of the CARLA urban driving simulator. At the time of submission, TransFuser outperforms all prior work on the CARLA leaderboard in terms of driving score by a large margin. Compared to geometry-based fusion, TransFuser reduces the average collisions per kilometer by 48%.
@article{Chitta2022PAMI,
author = {Chitta, Kashyap and Prakash, Aditya and Jaeger, Bernhard and Yu, Zehao and Renz, Katrin and Geiger, Andreas},
title = {TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving},
journal = {Transactions on Pattern Analysis and Machine Intelligence (TPAMI)},
year = {2022}
}
|
|
NEAT: Neural Attention Fields for End-to-End Autonomous Driving
Kashyap Chitta*,
Aditya Prakash*,
Andreas Geiger
International Conference on Computer Vision (ICCV), 2021
Transformers for Vision (T4V) Workshop at CVPR 2022 (Spotlight)
abstract /
bibtex /
code /
video
|
|
Efficient reasoning about the semantic, spatial, and temporal structure of a scene is a crucial pre-requisite for autonomous driving. We present NEural ATtention fields (NEAT), a novel representation that enables such reasoning for end-to-end Imitation Learning (IL) models. Our representation is a continuous function which maps locations in Bird's Eye View (BEV) scene coordinates to waypoints and semantics, using intermediate attention maps to iteratively compress high-dimensional 2D image features into a compact representation. This allows our model to selectively attend to relevant regions in the input while ignoring information irrelevant to the driving task, effectively associating the images with the BEV representation. NEAT nearly matches the state-of-the-art on the CARLA Leaderboard while being far less resource-intensive. Furthermore, visualizing the attention maps for models with NEAT intermediate representations provides improved interpretability. On a new evaluation setting involving adverse environmental conditions and challenging scenarios, NEAT outperforms several strong baselines and achieves driving scores on par with the privileged CARLA expert used to generate its training data.
@inproceedings{Chitta2021ICCV,
author = {Chitta, Kashyap and Prakash, Aditya and Geiger, Andreas},
title = {NEAT: Neural Attention Fields for End-to-End Autonomous Driving},
booktitle = {International Conference on Computer Vision (ICCV)},
year = {2021}
}
|
|
Multi-Modal Fusion Transformer for End-to-End Autonomous Driving
Aditya Prakash*,
Kashyap Chitta*,
Andreas Geiger
Conference on Computer Vision and Pattern Recognition (CVPR), 2021
abstract /
bibtex /
project page /
code /
video /
poster /
blog
|
|
How should representations from complementary sensors be integrated for autonomous driving? Geometry-based sensor fusion has shown great promise for perception tasks such as object detection and motion forecasting. However, for the actual driving task, the global context of the 3D scene is key, e.g. a change in traffic light state can affect the behavior of a vehicle geometrically distant from that traffic light. Geometry alone may therefore be insufficient for effectively fusing representations in end-to-end driving models. In this work, we demonstrate that imitation learning policies based on existing sensor fusion methods under-perform in the presence of a high density of dynamic agents and complex scenarios, which require global contextual reasoning, such as handling traffic oncoming from multiple directions at uncontrolled intersections. Therefore, we propose TransFuser, a novel Multi-Modal Fusion Transformer, to integrate image and LiDAR representations using attention. We experimentally validate the efficacy of our approach in urban settings involving complex scenarios using the CARLA urban driving simulator. Our approach achieves state-of-the-art driving performance while reducing collisions by 76% compared to geometry-based fusion.
@inproceedings{Prakash2021CVPR,
author = {Prakash, Aditya and Chitta, Kashyap and Geiger, Andreas},
title = {Multi-Modal Fusion Transformer for End-to-End Autonomous Driving},
booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2021}
}
|
|
Exploring Data Aggregation in Policy Learning for Vision-based Urban Autonomous Driving
Aditya Prakash,
Aseem Behl,
Eshed Ohn-Bar,
Kashyap Chitta,
Andreas Geiger
Conference on Computer Vision and Pattern Recognition (CVPR), 2020
abstract /
bibtex /
project page /
code /
video
|
|
Data aggregation techniques can significantly improve vision-based policy learning within a training environment, e.g., learning to drive in a specific simulation condition. However, as on-policy data is sequentially sampled and added in an iterative manner, the policy can specialize and overfit to the training conditions. For real-world applications, it is useful for the learned policy to generalize to novel scenarios that differ from the training conditions. To improve policy learning while maintaining robustness when training end-to-end driving policies, we perform an extensive analysis of data aggregation techniques in the CARLA environment. We demonstrate how the majority of them have poor generalization performance, and develop a novel approach with empirically better generalization performance compared to existing techniques. Our two key ideas are (1) to sample critical states from the collected on-policy data based on the utility they provide to the learned policy in terms of driving behavior, and (2) to incorporate a replay buffer which progressively focuses on the high uncertainty regions of the policy’s state distribution. We evaluate the proposed approach on the CARLA NoCrash benchmark, focusing on the most challenging driving scenarios with dense pedestrian and vehicle traffic. Our approach improves driving success rate by 16% over stateof-the-art, achieving 87% of the expert performance while also reducing the collision rate by an order of magnitude without the use of any additional modality, auxiliary tasks, architectural modifications or reward from the environment.
@inproceedings{Prakash2020CVPR,
author = {Prakash, Aditya and Behl, Aseem and Ohn-Bar, Eshed and Chitta, Kashyap and Geiger, Andreas},
title = {Exploring Data Aggregation in Policy Learning for Vision-based Urban Autonomous Driving},
booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2020}
}
|
|
Learning Situational Driving
Eshed Ohn-Bar,
Aditya Prakash,
Aseem Behl,
Kashyap Chitta,
Andreas Geiger
Conference on Computer Vision and Pattern Recognition (CVPR), 2020
abstract /
bibtex /
project page /
video
|
|
Human drivers have a remarkable ability to drive in diverse visual conditions and situations, e.g., from maneuvering in rainy, limited visibility conditions with no lane markings to turning in a busy intersection while yielding to pedestrians. In contrast, we find that state-of-the-art sensorimotor driving models struggle when encountering diverse settings with varying relationships between observation and action. To generalize when making decisions across diverse conditions, humans leverage multiple types of situation-specific reasoning and learning strategies. Motivated by this observation, we develop a framework for learning a situational driving policy that effectively captures reasoning under varying types of scenarios. Our key idea is to learn a mixture model with a set of policies that can capture multiple driving modes. We first optimize the mixture model through behavior cloning, and show it to result in significant gains in terms of driving performance in diverse conditions. We then refine the model by directly optimizing for the driving task itself, i.e., supervised with the navigation task reward. Our method is more scalable than methods assuming access to privileged information, e.g., perception labels, as it only assumes demonstration and reward-based supervision. We achieve over 98% success rate on the CARLA driving benchmark as well as state-of-the-art performance on a newly introduced generalization benchmark.
@inproceedings{Ohn-Bar2020CVPR,
author = {Ohn-Bar, Eshed and Prakash, Aditya and Behl, Aseem and Chitta, Kashyap and Geiger, Andreas},
title = {Learning Situational Driving},
booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2020}
}
|
|
Label Efficient Visual Abstractions for Autonomous Driving
Aseem Behl*,
Kashyap Chitta*,
Aditya Prakash,
Eshed Ohn-Bar,
Andreas Geiger
International Conference on Intelligent Robots and Systems (IROS), 2020
abstract /
bibtex /
arxiv
|
|
It is well known that semantic segmentation can be used as an effective intermediate representation for learning driving policies. However, the task of street scene semantic segmentation requires expensive annotations. Furthermore, segmentation algorithms are often trained irrespective of the actual driving task, using auxiliary image-space loss functions which are not guaranteed to maximize driving metrics such as safety or distance traveled per intervention. In this work, we seek to quantify the impact of reducing segmentation annotation costs on learned behavior cloning agents. We analyze several segmentation-based intermediate representations. We use these visual abstractions to systematically study the trade-off between annotation efficiency and driving performance, i.e., the types of classes labeled, the number of image samples used to learn the visual abstraction model, and their granularity (e.g., object masks vs. 2D bounding boxes). Our analysis uncovers several practical insights into how segmentation-based visual abstractions can be exploited in a more label efficient manner. Surprisingly, we find that state-of-the-art driving performance can be achieved with orders of magnitude reduction in annotation cost. Beyond label efficiency, we find several additional training benefits when leveraging visual abstractions, such as a significant reduction in the variance of the learned policy when compared to state-of-the-art end-to-end driving models.
@inproceedings{Behl2020IROS,
author = {Behl, Aseem and Chitta, Kashyap and Prakash, Aditya and Ohn-Bar, Eshed and Geiger, Andreas},
title = {Label Efficient Visual Abstractions for Autonomous Driving},
booktitle = {International Conference on Intelligent Robots and Systems (IROS)},
year = {2020}
}
|
|
Deep Fundamental Matrix Estimation without Correspondences
Omid Poursaeed*,
Guandao Yang*,
Aditya Prakash*,
Qiuren Fang,
Hanqing Jiang,
Bharath Hariharan,
Serge Belongie
European Conference on Computer Vision (ECCV) Workshops, 2018
abstract /
bibtex /
arxiv /
code
|
|
Estimating fundamental matrices is a classic problem in computer vision. Traditional methods rely heavily on the correctness of estimated key-point correspondences, which can be noisy and unreliable. As a result, it is difficult for these methods to handle image pairs with large occlusion or significantly different camera poses. In this paper, we propose novel neural network architectures to estimate fundamental matrices in an end-to-end manner without relying on point correspondences. New modules and layers are introduced in order to preserve mathematical properties of the fundamental matrix as a homogeneous rank-2 matrix with seven degrees of freedom. We analyze performance of the proposed model on the KITTI dataset, and show that they achieve competitive performance with traditional methods without the need for extracting correspondences.
@inproceedings{Poursaeed2018ECCVW,
author = {Poursaeed, Omid and Yang, Guandao and Prakash, Aditya and Fang, Qiuren and Jiang, Hanqing and Hariharan, Bharath and Belongie, Serge},
title = {Deep Fundamental Matrix Estimation without Correspondences},
booktitle = {European Conference on Computer Vision (ECCV) Workshops},
year = {2018}
}
|
|
iSPA-Net: Iterative Semantic Pose Alignment Network
Jogendra Nath Kundu*,
Aditya Ganeshan*,
Rahul M Venkatesh*,
Aditya Prakash,
R. Venkatesh Babu
ACM Conference on Multimedia (ACM MM), 2018
abstract /
bibtex /
arxiv /
code
|
|
Understanding and extracting 3D information of objects from monocular 2D images is a fundamental problem in computer vision. In the task of 3D object pose estimation, recent data driven deep neural network based approaches suffer from scarcity of real images with 3D keypoint and pose annotations. Drawing inspiration from human cognition, where the annotators use a 3D CAD model as structural reference to acquire ground-truth viewpoints for real images; we propose an iterative Semantic Pose Alignment Network, called iSPA-Net. Our approach focuses on exploiting semantic 3D structural regularity to solve the task of fine-grained pose estimation by predicting viewpoint difference between a given pair of images. Such image comparison based approach also alleviates the problem of data scarcity and hence enhances scalability of the proposed approach for novel object categories with minimal annotation. The fine-grained object pose estimator is also aided by correspondence of learned spatial descriptor of the input image pair. The proposed pose alignment framework enjoys the faculty to refine its initial pose estimation in consecutive iterations by utilizing an online rendering setup along with effectiveness of a non-uniform bin classification of pose-difference. This enables iSPA-Net to achieve state-of-the-art performance on various real image viewpoint estimation datasets. Further, we demonstrate effectiveness of the approach for multiple applications. First, we show results for active object viewpoint localization to capture images from similar pose considering only a single image as pose reference. Second, we demonstrate the ability of the learned semantic correspondence to perform unsupervised part-segmentation transfer using only a single part-annotated 3D template model per object class. To encourage reproducible research, we have released the codes for our proposed algorithm.
@inproceedings{Kundu2018ACMMM,
author = {Kundu, Jogendra Nath and Ganeshan, Aditya and Venkatesh, Rahul M. and Prakash, Aditya and Babu, R. Venkatesh},
title = {iSPA-Net: Iterative Semantic Pose Alignment Network},
booktitle = {ACM Conference on Multimedia (ACM MM)},
year = {2018}
}
|
|