💬 Rebuttal Materials

  1. Detailed comparison between DICE-RL and ResFiT (Reviewer b6jW, t2wZ, N9LR)
  2. DICE-RL on Libero10 (Reviewer b6jW, t2wZ)
  3. DICE-RL on Libero10 with VLA (Reviewer b6jW, 24a2, t2wZ, N9LR)
  4. DSRL with varying amounts of pretraining data (Reviewer t2wZ)
  5. Updated results on BoK action sampling (Reviewer t2wZ)
  6. DICE-RL with various RLPD schedule (Reviewer t2wZ)
  7. Reward labeling for real world tasks (Reviewer 24a2)
  8. Wall-clock time cost (Reviewer 24a2)

1. Detailed comparison between DICE-RL and ResFiT (Reviewer b6jW, t2wZ, N9LR)

ablation ours vs resfit breakdown

Caption: We compare DICE-RL against three variants: DICE-RL without the BC loss, DICE-RL without chunked RL, and ResFiT. DICE-RL outperforms both DICE-RL without the BC loss and DICE-RL without chunked RL, and both of these variants in turn outperform ResFiT. These results support the two core design differences between DICE-RL and ResFiT: (i) DICE-RL couples chunked residual action learning with chunked value learning, whereas ResFiT uses a per-step residual actor with a per-step critic; and (ii) DICE-RL includes an explicit BC regularization term that keeps fine-tuning close to the pretrained prior.

2. DICE-RL on Libero10 (Reviewer b6jW, t2wZ)

ablation libero average

Caption: Average success rate across the 10 tasks in LIBERO-10, using approximately 3,000 online episodes in total across all tasks. Pretrained policy is a flow matching policy trained on the full Libero-10 dataset.

ablation libero per task

Caption: Average success rate for each task. Pretrained policy is a flow matching policy trained on the full Libero-10 dataset.

3. DICE-RL on Libero10 with VLA (π0) (Reviewer b6jW, 24a2, t2wZ, N9LR)

ablation libero vla average

Caption: Average success rate across the 10 tasks in LIBERO-10, using approximately 3,000 online episodes in total across all tasks. The pretrained policy is π0 finetuned on a subset of the LIBERO-10 dataset using 30 demonstrations per task.

ablation libero vla per task

Caption: Average success rate for each task. The pretrained policy is π0 finetuned on a subset of the LIBERO-10 dataset using 30 demonstrations per task.

4. DSRL with varying amounts of pretraining data (Reviewer t2wZ)

analysis num demo dsrl

Caption: DSRL finetuning with varying number of demonstrations used for BC policy pretraining.

5. Updated results on BoK action sampling (Reviewer t2wZ)

best of k bar plot

Caption: We evaluate DICE-RL with K = 4, 16, 32, 64, where K is the number of action samples used in Best-of-K action selection. On Tool Hang, K = 16 achieves the best performance. Increasing K further to 64 leads to a slight drop in performance, likely due to critic overestimation.

6. DICE-RL with various RLPD schedule (Reviewer t2wZ)

ablation rlpd schedule

Caption: DICE-RL finetuning with varying RLPD schedule. DICE-RL is not particularly sensitive to ratio of offline data.

7. Reward labeling for real world tasks (Reviewer 24a2)

Pretrained BC Policy (x10) 56.67%

Finetuned RL Policy (x10) 90%

Caption: Example task with automatic reward labeling. We use OpenCV-based color thresholding to detect whether the light bulb is lit, and assign a binary reward of 0 or 1 accordingly. Part of the video is blurred to preserve anonymity.

8. Wall-clock time cost (Reviewer 24a2)

Timelapse of the gear insertion task (10× speed)

Caption: Timelapse of the gear insertion task (10× speed). The wall-clock time required to improve performance from 46.67% to 90% is 1.5 hours. For the light bulb insertion and gear assembly tasks, the corresponding wall-clock times are approximately 3 hours and 6 hours, respectively.

Pretrained BC Policy (x10) 46.67%

Finetuned RL Policy (x10) 90%


📄 Original Submission

From Prior to Pro:
Efficient Skill Mastering via Distribution Contractive RL Finetuning

teaser figure main results bar plot

TL;DR

We introduce Distribution Contractive Reinforcement Learning (DICE-RL), a framework that uses reinforcement learning (RL) as a "distribution contractor" to refine pretrained generative robot policies. DICE-RL turns a pretrained behavior prior into a high-performing "pro" policy by amplifying high-success behaviors from online feedback. We pretrain a diffusion-based policy for broad behavioral coverage, then finetune it with a stable, sample-efficient residual off-policy RL framework that combines selective behavior regularization with value-guided action selection. Extensive experiments and analyses show that DICE-RL reliably improves performance with strong stability and sample efficiency, enabling mastery of complex long-horizon manipulation skills both in simulation and on a real robot.

Real Robot Results

We evaluate DICE-RL on a challenging real-world Belt Assembly task from the NIST benchmark. The BC policy pretrained with 265 demos has three dominant failure modes. After 420 online episodes, DICE-RL reliably succeeds on this contact-rich task. We overlay a representative rollout with running change in action entropy (\(\Delta H\)) and value improvement (\(\Delta V\)), with the largest entropy drop and value gains occurring around critical contact transitions.

real robot results

DICE-RL improves the pretrained policy from 56.67% to 93.33% success rate over 30 runs. See our uncut evaluation video (10x).

Pretrained BC Policy (x10)

Finetuned RL Policy (x10)

Simulation Results & Analyses

Main Results

We compare DICE-RL against prior methods, focusing on approaches that build on pretrained diffusion-based policies. We benchmark on Can, Square, Transport,Tool Hang tasks from the Robomimic benchmark, and report results for both state-based and pixel-based observations. For Can, the BC policies are trained on 20 demonstrations; while other tasks using 50 demonstrations from Proficient-Human (PH) dataset.

main results 2x4 main results legend

DICE-RL attains the highest final performance while also being more stable and sample efficient across all tasks, and it succeeds across all difficulty levels with a single training recipe.

Understanding DICE-RL

Distribution Sharpening

Our finetuning objective combines critic value maximization with a BC-style residual penalty. As training progresses, this shifts probability mass away from low-value action samples and toward consistently high-value regions, yielding a sharper action distribution at visited states.

value entropy correlation scatter

Using states from the offline demonstrations as anchors, we sample actions from the finetuned policy and compute (i) value gain relative to the pretrained BC policy and (ii) the empirical entropy drop of the sampled actions. We find a clear coupling: states with larger value improvements exhibit larger entropy drops, suggesting that successful finetuning coincides with stronger distributional concentration.

value improve entropy reduction

In a representative rollout trajectory of the RL policy on Tool Hang, together with the running change in action entropy (\(\Delta H\)) and value improvement (\(\Delta V\)). We zoom in on frames where value improvement spikes and action entropy drops; these states are often critical for task success (e.g., pre-insertion and insertion). In contrast, during free-space motions that are less consequential for success, we observe less reduction in action entropy

Contraction and Robustness

Distribution sharpening is a state-local effect. Contraction, in contrast, is a trajectory-level property of the closed-loop dynamics induced by a policy: over a task-relevant region and in a chosen metric, trajectories initialized from nearby states move closer over time, reflecting reduced sensitivity to initial conditions. This notion is closely related to incremental stability and connects to robust-control ''funnel'' or trajectory-tube intuitions, where stability of a tube around nominal behavior implies improved robustness to perturbations.

To probe contraction empirically, we sample many pairs of nearby anchor states \((s_0,s'_0)\) from the offline demonstrations \(D_{\text{demo}}\). From each pair, we rollout (i) the fine-tuned RL policy for \(T\) steps to obtain \(\{s_t^{\text{RL}}\}_{t=0}^T\) and \(\{s_t^{\prime\,\text{RL}}\}_{t=0}^T\), (ii) the pretrained BC policy for \(T\) steps to obtain \(\{s_t^{\text{Pre}}\}_{t=0}^T\) and \(\{s_t^{\prime\,\text{Pre}}\}_{t=0}^T\), and (iii) the corresponding expert trajectories of length \(T\) starting from the same anchors, denoted \(\{s_t^{\text{E}}\}_{t=0}^T\) and \(\{s_t^{\prime\,\text{E}}\}_{t=0}^T\). We then measure the normalized pairwise divergence for each rollout type \(x\in\{\text{RL},\text{Pre},\text{E}\}\): \[ c^{x}(t)\;=\;\frac{\big\|s_t^{x}-s_t^{\prime\,x}\big\|_2^2}{\big\|s_0-s'_0\big\|_2^2} \] Rollouts under our RL policy exhibit a more stable (and typically smaller) evolution of \(c(t)\) than both the pretrained BC policy and the expert demonstration rollouts, indicating stronger contraction of the closed-loop behavior.

value entropy correlation scatter
value improve entropy reduction
Rollout Distribution Visualization (Tool Hang)

We overlay the rollouts from the finetuned RL policy with the demonstration trajectories . The RL policy contracts around critical, contact-rich states.

tool hang visualization
Rollout Distribution Visualization (Transport)
transport visualization