JUICER: Data-Efficient Imitation Learning for Robotic Assembly

1Harvard University 2Massachusetts Institute of Technology 3Improbable AI Lab

JUICER improves performance of imitation learning policies for assembly.

Abstract

While learning from demonstrations is powerful for acquiring visuomotor policies, high-performance imitation without large demonstration datasets remains challenging for tasks requiring precise, long-horizon manipulation. This paper proposes a pipeline for improving imitation learning performance with a small human demonstration budget. We apply our approach to assembly tasks that require precisely grasping, reorienting, and inserting multiple parts over long horizons and multiple task phases. Our pipeline combines expressive policy architectures and various techniques for dataset expansion and simulation-based data augmentation. These help expand dataset support and supervise the model with locally corrective actions near bottleneck regions requiring high precision. We demonstrate our pipeline on four furniture assembly tasks in simulation, enabling a manipulator to assemble up to five parts over nearly 2500 time steps directly from RGB images, outperforming imitation and data augmentation baselines.

TL;DR

JUICER is a pipeline for learning image-based policies for complex multi-step long-horizon assembly tasks from a small number of demonstrations by combining expressive policy architectures and various techniques for dataset expansion.

Policies get egocentric RGB image from the wrist (left video above) and a RGB image from the front view (right video above) as input, as well as the proprioceptive state of the robot end-effector. No explicit object pose is provided to the policy, neither from the AprilTag nor from the simulator. Note: AprilTags are present as a part of the benchmark environment.

These policies are trained with in the environment with AprilTags included and rolled out zero-shot in the environment with the tags removed. The rest of the experiments are done with the unaltered environment.

Methods

System Overview

JUICER is a pipeline for improving imitation learning performance with a small human demonstration budget, consisting of 5 main steps illustrated in this figure:

A schematic representation of the JUICER method

Overview of the proposed approach. (1) Collect a small number of demonstrations for the task (and related tasks, if available). (2) Using annotations of bottleneck states, augment the demonstration trajectories to create synthetic corrective actions and increase coverage around the bottleneck states. (3) Use this dataset to train models with different hyperparameters. (4) Store all rollouts throughout model evaluations. (5) Add any successful rollout to the training set and train the best architecture on all data amassed.


Trajectory Augmentation System

By identifying bottleneck states in the demonstration trajectories, we can generate corrective actions to supervise the model with locally corrective actions near bottleneck regions requiring high precision.

A schematic representation of the trajectory augmentation tool

An example of extracting "bottleneck" states and using trajectory augmentation tool to create an arbitrary amount of counterfactual trajectories near the bottleneck, effectively increasing the data support and teaching corrective actions.

Numerical Results

Comparison of Pipeline and Baselines

We test the methods on four furniture assembly tasks in simulation, and compare the success rates for the different methods across the tasks below.

The main results presented as a grouped barchart

Average and maximum success rates (%) of methods across tasks. Bolded methods are components of our JUICER pipeline.


Starting Pipeline from Minimal Human Demonstration

Furthermore, we show that JUICER can start from only 10 human demonstrations and outperform models trained on 50 demonstrations without JUICER.

Success rate for the one_leg task with and without JUICER

Examples of Successful Rollouts

Round Table

Lamp

Square Table

An interesting behavior to notice in these square table rollout is that the robot is adjusting the tabletop up against the guard after screwing in the first leg and thus ensures it stays more in-distribution in the second. It does not do this in the first rollout as the tabletop is already sufficiently aligned with the guard.

Examples of Failure Modes

Round Table

Fail insert leg in tabletop

Fail insert base in leg

Lamp

Miss hole when inserting bulb

Partially screwing in bulb

Bulb rolling into unseen location

Misgrasping base causing distribution shift

Square Table

Not properly inserting the 3rd leg

Missing the insertion of the 1st leg

Incomplete screwing of leg 1 and misplace leg 2

Misgrasping the 2nd leg

Related Links

There's a lot of excellent work related to ours in the space of manipulation and assembly, for example:

FurnitureBench introduces a real-world furniture assembly benchmark, which aims at providing a reproducible and easy-to-use platform for long-horizon complex robotic manipulation that we use in our work.

ASAP is a physics-based planning approach for automatically generating sequences for general-shaped assemblies, accounting for gravity to design a sequence where each sub-assembly is physically stable.

InsertionNet 1.0 and 2.0 address the problem of insertion specifically and propose regression-based methods that combine visual and force inputs to solve various insertion tasks efficiently and robustly. InsertionNet 2.0 improves upon the original by introducing multimodal input, contrastive learning, and a one-shot learning technique using a relation network scheme, achieving near-perfect performance on 16 real-life insertion tasks while minimizing execution time and contact during insertion.

Grasping with Chopsticks develops an autonomous chopsticks-equipped robotic manipulator for picking up small objects using mainly two approaches to reduce covariate shift and improve generalization: applying an invariant operator to increase data support at critical grasping steps and generating synthetic corrective labels to prevent error accumulation.

There's been an increasing amount of theoretical analysis of imitation learning, with recent works focusing on the properties of noise injection and corrective actions. Provable Guarantees for Generative Behavior Cloning proposes a framework for generative behavior cloning, ensuring continuity through data augmentation and noise injection. CCIL generates corrective data using local continuity in environment dynamics, while TaSIL penalizes deviations in higher-order Taylor series terms between learned and expert policies. These works aim to enhance the robustness and sample efficiency of imitation learning algorithms.

BibTeX

@article{ankile2024juicer,
  author    = {Ankile, Lars and Simeonov, Anthony and Shenfeld, Idan and Agrawal, Pulkit},
  title     = {JUICER: Data-Efficient Imitation Learning for Robotic Assembly},
  journal   = {arXiv},
  year      = {2024},
}