Caption: We compare DICE-RL against three variants: DICE-RL without the BC loss, DICE-RL without chunked RL, and ResFiT. DICE-RL outperforms both DICE-RL without the BC loss and DICE-RL without chunked RL, and both of these variants in turn outperform ResFiT. These results support the two core design differences between DICE-RL and ResFiT: (i) DICE-RL couples chunked residual action learning with chunked value learning, whereas ResFiT uses a per-step residual actor with a per-step critic; and (ii) DICE-RL includes an explicit BC regularization term that keeps fine-tuning close to the pretrained prior.
Caption: Average success rate across the 10 tasks in LIBERO-10, using approximately 3,000 online episodes in total across all tasks. Pretrained policy is a flow matching policy trained on the full Libero-10 dataset.
Caption: Average success rate for each task. Pretrained policy is a flow matching policy trained on the full Libero-10 dataset.
Caption: Average success rate across the 10 tasks in LIBERO-10, using approximately 3,000 online episodes in total across all tasks. The pretrained policy is π0 finetuned on a subset of the LIBERO-10 dataset using 30 demonstrations per task.
Caption: Average success rate for each task. The pretrained policy is π0 finetuned on a subset of the LIBERO-10 dataset using 30 demonstrations per task.
Caption: DSRL finetuning with varying number of demonstrations used for BC policy pretraining.
Caption: We evaluate DICE-RL with K = 4, 16, 32, 64, where K is the number of action samples used in Best-of-K action selection. On Tool Hang, K = 16 achieves the best performance. Increasing K further to 64 leads to a slight drop in performance, likely due to critic overestimation.
Caption: DICE-RL finetuning with varying RLPD schedule. DICE-RL is not particularly sensitive to ratio of offline data.
Pretrained BC Policy (x10) 56.67%
Finetuned RL Policy (x10) 90%
Caption: Example task with automatic reward labeling. We use OpenCV-based color thresholding to detect whether the light bulb is lit, and assign a binary reward of 0 or 1 accordingly. Part of the video is blurred to preserve anonymity.
Caption: Timelapse of the gear insertion task (10× speed). The wall-clock time required to improve performance from 46.67% to 90% is 1.5 hours. For the light bulb insertion and gear assembly tasks, the corresponding wall-clock times are approximately 3 hours and 6 hours, respectively.
Pretrained BC Policy (x10) 46.67%
Finetuned RL Policy (x10) 90%
We evaluate DICE-RL on a challenging real-world Belt Assembly task from the NIST benchmark. The BC policy pretrained with 265 demos has three dominant failure modes. After 420 online episodes, DICE-RL reliably succeeds on this contact-rich task. We overlay a representative rollout with running change in action entropy (\(\Delta H\)) and value improvement (\(\Delta V\)), with the largest entropy drop and value gains occurring around critical contact transitions.
DICE-RL improves the pretrained policy from 56.67% to 93.33% success rate over 30 runs. See our uncut evaluation video (10x).
Pretrained BC Policy (x10)
Finetuned RL Policy (x10)
We compare DICE-RL against prior methods, focusing on approaches that build on pretrained diffusion-based policies. We benchmark on Can, Square, Transport,Tool Hang tasks from the Robomimic benchmark, and report results for both state-based and pixel-based observations. For Can, the BC policies are trained on 20 demonstrations; while other tasks using 50 demonstrations from Proficient-Human (PH) dataset.
DICE-RL attains the highest final performance while also being more stable and sample efficient across all tasks, and it succeeds across all difficulty levels with a single training recipe.
Our finetuning objective combines critic value maximization with a BC-style residual penalty. As training progresses, this shifts probability mass away from low-value action samples and toward consistently high-value regions, yielding a sharper action distribution at visited states.
Using states from the offline demonstrations as anchors, we sample actions from the finetuned policy and compute (i) value gain relative to the pretrained BC policy and (ii) the empirical entropy drop of the sampled actions. We find a clear coupling: states with larger value improvements exhibit larger entropy drops, suggesting that successful finetuning coincides with stronger distributional concentration.
In a representative rollout trajectory of the RL policy on Tool Hang, together with the running change in action entropy (\(\Delta H\)) and value improvement (\(\Delta V\)). We zoom in on frames where value improvement spikes and action entropy drops; these states are often critical for task success (e.g., pre-insertion and insertion). In contrast, during free-space motions that are less consequential for success, we observe less reduction in action entropy
Distribution sharpening is a state-local effect. Contraction, in contrast, is a trajectory-level property of the closed-loop dynamics induced by a policy: over a task-relevant region and in a chosen metric, trajectories initialized from nearby states move closer over time, reflecting reduced sensitivity to initial conditions. This notion is closely related to incremental stability and connects to robust-control ''funnel'' or trajectory-tube intuitions, where stability of a tube around nominal behavior implies improved robustness to perturbations.
To probe contraction empirically, we sample many pairs of nearby anchor states \((s_0,s'_0)\) from the offline demonstrations \(D_{\text{demo}}\). From each pair, we rollout (i) the fine-tuned RL policy for \(T\) steps to obtain \(\{s_t^{\text{RL}}\}_{t=0}^T\) and \(\{s_t^{\prime\,\text{RL}}\}_{t=0}^T\), (ii) the pretrained BC policy for \(T\) steps to obtain \(\{s_t^{\text{Pre}}\}_{t=0}^T\) and \(\{s_t^{\prime\,\text{Pre}}\}_{t=0}^T\), and (iii) the corresponding expert trajectories of length \(T\) starting from the same anchors, denoted \(\{s_t^{\text{E}}\}_{t=0}^T\) and \(\{s_t^{\prime\,\text{E}}\}_{t=0}^T\). We then measure the normalized pairwise divergence for each rollout type \(x\in\{\text{RL},\text{Pre},\text{E}\}\): \[ c^{x}(t)\;=\;\frac{\big\|s_t^{x}-s_t^{\prime\,x}\big\|_2^2}{\big\|s_0-s'_0\big\|_2^2} \] Rollouts under our RL policy exhibit a more stable (and typically smaller) evolution of \(c(t)\) than both the pretrained BC policy and the expert demonstration rollouts, indicating stronger contraction of the closed-loop behavior.
We overlay the rollouts from the finetuned RL policy with the demonstration trajectories . The RL policy contracts around critical, contact-rich states.