Contact-rich bimanual manipulation involves precise coordination of two arms to change the state of objects through strategically selected contacts and motions. Due to the inherent complexity of these tasks, acquiring sufficient demonstration data and training policies that generalize to unseen scenarios remain a largely unresolved challenge. Building on recent advances in planning through contacts, we introduce Generalizable Planning-Guided Diffusion Policy Learning (GLIDE), an approach that effectively learns to solve contact-rich bimanual manipulation tasks by leveraging model-based motion planners to generate demonstration data in high-fidelity physics simulation. Through efficient planning in randomized environments, our approach generates large-scale and high-quality synthetic motion trajectories for tasks involving diverse objects and transformations. We then train a task-conditioned diffusion policy via behavior cloning using these demonstrations. To tackle the sim-to-real gap, we propose a set of essential design options in feature extraction, task representation, action prediction, and data augmentation that enable learning robust prediction of smooth action sequences and generalization to unseen scenarios. Through experiments in both simulation and the real world, we demonstrate that our approach can enable a bimanual robotic system to effectively manipulate objects of diverse geometries, dimensions, and physical properties.
We propose Generalizable PLanning-GuIded Diffusion Policy LEarning (GLIDE). We specifically consider a tabletop environment where a bimanual robot reorients a target object to a pose in SE(2). The policy takes in visual observations (specifically uncolored point clouds), proprioceptive robot joint angles, and task conditioning vector (i.e., delta transformation between the current and the target object pose) as input.
Overall, we adopt a sim-to-real approach. We generate large-scale demonstrations in randomized simulation environments by building on the very recent advances in efficient motion planning through contact. We then filter out unsuccessful planner trajectories and the suboptimal trajectories that take too long to reach the goal, resulting in set of high-quality synthetic demonstrations.
Next, we train a goal-conditioned point cloud diffusion policy using these synthetic demonstrations via Behavior Cloning. We clip the input point cloud within robot's workspace and remove irrelevant background. We also introduce several essential design choices that significantly enhance our policy's ability to transfer to the real world and generalize to unseen scenarios: (1) We introduce a Flying Point Augmentation approach that adds large Gaussian noise to points with a small probability; (2) Our policy predicts residual robot joint actions rather than absolute joint actions; (3) During inference, we use a larger action sequence length than originally used in the diffusion policy and DP3.
To deploy our policy in the real-world, we would like to track the changes in object pose for arbitrary objects. We achieve this by using open-world detection (Grounding-Dino) and segmentation (EfficientViT-SAM) to segment out the target object in the first frame. We then select keypoints within the segmented mask and track these keypoints across subsequent frames, allowing us to calculate the delta transformation in object pose.
To create our demonstration dataset, we generated 2,000 rectangular box primitives in simulation with randomized dimensions, mass, and friction coefficients. We then utilized our motion planning pipeline to generate ~30k successful demonstrations (takes about 2 days on 96-core cpu machine). We finally filter out suboptimal trajectories, rebalance the data, sample ~10k trajectories, and use Behavior Cloning to train our point cloud diffusion policy.
We perform evaluation on objects whose geometries, dimensions, and physical properties are within the distribution of training demonstration trajectories.
We perform evaluation on challenging out-of-distribution scenarios, including objects with novel geometries and physical properties, along with external perturbations. While we train our policy exclusively on box primitives, it already achieves good out-of-distribution generalization performance. A promising future direction is to further diversifying training environments by incorporating objects from large-scale datasets like Objaverse.