Posterior Optimization with Clipped Objective for Bridging Efficiency and Stability in Generative Policy Learning

Yuhui Chen 1,2 , Haoran Li 1,2 , Zhennan Jiang 1,2 , Yuxing Qin 1,2 , Yuxuan Wan 3 , Weiheng Liu 1,2 , Dongbin Zhao 1,2
1 SKL-MAIS, Institute of Automation, Chinese Academy of Sciences
2 School of Artificial Intelligence, University of Chinese Academy of Sciences
3 CFCS, School of Computer Science, Peking University

Abstract

Expressive generative models have advanced robotic manipulation by capturing complex, multi-modal action distributions over temporally extended trajectories. However, fine-tuning these policies via RL remains challenging due to instability and sample inefficiency. We introduce Posterior Optimization with Clipped Objective (POCO), a principled RL framework that formulates policy improvement as a posterior inference problem tailored for temporal action chunks. Through an Expectation-Maximization procedure, POCO distills a reward-weighted implicit posterior into the policy without likelihood estimation. Furthermore, POCO adopts an offline-to-online paradigm that anchors online exploration to pre-trained priors, and its model-agnostic design scales to fine-tune large VLA models without architectural modifications. Evaluations across 7 simulation benchmarks and 4 contact-rich real-world tasks demonstrate that POCO prevents catastrophic policy collapse, outperforms SOTA baselines, and achieves a 96.7% success rate on real-world tasks.

Overview

We present Posterior Optimization with Clipped Objective (POCO), a two-stage offline-to-online reinforcement learning framework designed to bridge efficiency and stability for generative policy fine-tuning in real-world environments. During the offline pre-training stage, POCO focuses on extracting a robust generative prior via supervised learning from pre-collected expert demonstrations, providing a safe and high-quality behavioral initialization. During online fine-tuning, POCO formulates policy improvement as a likelihood-free posterior inference problem. By integrating an implicit Expectation-Maximization procedure with a clipped surrogate objective, it achieves rapid, stable exploration and securely anchors updates to the pre-trained prior, seamlessly scaling to large-scale VLA models.

Method

When trying to fine-tune a pre-trained generative policy (like Diffusion, Flow Matching or VLAs) in real-world environments, the immediate instinct is to use standard RL algorithms. However, we quickly hit a brutal dilemma:

  1. Off-policy methods (e.g., SAC): Highly sample-efficient but violently unstable. Backpropagating noisy Q-gradients from overestimated OOD states instantly shatters the pre-trained generative manifold, causing catastrophic policy collapse.

  2. On-policy methods (e.g., PPO): Safe and stable via strict trust regions, but their inability to reuse offline data makes them far too sample-inefficient for physical robots.

To achieve off-policy efficiency with on-policy safety, we therefore avoid “direct parameter optimization” via Q-gradients and shift our mathematical perspective to posterior inference.

The Offline-to-Online Paradigm

To safely deploy RL on physical robots, POCO operates under a two-stage offline-to-online training paradigm:

  1. Offline Pre-training: We first train the policy on a static dataset of expert demonstrations using standard supervised learning. This equips the robot with a robust initial behavioral prior.

  2. Online Fine-tuning: The robot then interacts with the real environment to explore and improve beyond the suboptimal expert data, with the replay buffer $\mathcal{D}$ initialized with the offline dataset. This is exactly where POCO steps in, guaranteeing the stable and sample-efficient policy improvement, rather than destroying the pre-trained prior.

Step 1

The Perspective Shift — RL as “Posterior Inference”

We define an “Optimality Variable” $\mathcal{O}=1$, indicating that a trajectory is expected, such as reaching a goal, accumulating high reward, or satisfying task constraints. Under the Maximum-Entropy framework, the probability of a trajectory being optimal is exponentially proportional to its cumulative reward:

\[p(\mathcal{O}=1|\tau) \propto \exp\left(\sum \frac{r_t}{\eta}\right)\]

To make our policy $\pi_\theta$ produce optimal trajectories, we want to maximize the log-likelihood \(\log p_{\pi}(\mathcal{O}=1)\). We use \(\vec{\mathbf{a}}_t\) to denote the action chunk \(a_{t:t+T}\). By introducing an auxiliary variational distribution $q(\tau)$ and applying Jensen’s Inequality, we derive the Evidence Lower Bound (ELBO):

\[\mathcal{J}(q, \pi) = \mathbb{E}_q \left[ \sum_{t=0}^H \gamma^t [r_t - \eta D_{KL}(q(\vec{\mathbf{a}}_t|s_t) || \pi(\vec{\mathbf{a}}_t|s_t, \theta))] \right]+\log p(\theta)\]

This justifies a classic Expectation-Maximization (E-M) procedure: find the optimal $q$ (E-step), then update $\theta$ towards it (M-step).

In the E-step, solving for the optimal $q$ by taking the derivative of the ELBO yields a closed-form proportional solution:

\[q_i(\vec{\mathbf{a}}_t|s_t) \propto \pi(\vec{\mathbf{a}}_t|s_t, \theta_i) \exp\left(\frac{Q_{\pi_i}(s_t, \vec{\mathbf{a}}_t)}{\eta}\right)\]

Given the posterior \(q_i(\vec{\mathbf{a}}_t\|s_t)\), the M-step updates the parametric policy $\pi_\theta$ to match this target distribution. We instantiate the Bayesian prior $\log p(\theta)$ as a trust-region constraint to ensure stability. Solving this constrained optimization via Lagrange multipliers yields the following weighted objective:

\[\mathcal{J}_{\text{M-step}}(\theta)=\mathbb{E}_{s_t\sim\mu(s_t)}[\mathbb{E}_{a\sim q_i(\vec{\mathbf{a}}_t|s_t)}[\log\pi(\vec{\mathbf{a}}_t|s_t,\theta)]-\eta D_{\text{KL}}(\pi(\cdot|s_t,\theta_i)||\pi(\cdot|s_t,\theta))]\]

Step 2

POCO’s Implicit E-step — Parameter-Free Action Audition

However, when we apply this E-M framework to expressive generative models, we meet a mathematical challenge: The Likelihood Problem. To execute the standard E-M projection, we have to explicitly evaluate the log-likelihood $\log \pi_\theta(\vec{\mathbf{a}}_t|s_t)$ of the policy. Generating an action in models like Diffusion or Flow Matching involves simulating an Ordinary Differential Equation (ODE) or a multi-step denoising process. Their explicit probability density is intractable or computationally prohibitive to calculate in real-time. If we cannot calculate the likelihood, the inference framework breaks down. This is where POCO performs.

POCO bypasses the need for an analytical formula entirely. We approximate $q$ using Monte Carlo Sampling. For a given state, POCO prompts the current policy to generate $N$ candidate action chunks. The Critic scores these actions via Q-values, and we assign a normalized importance weight to each:

\[\bar{w}_j = \frac{\exp(Q(s_t, \vec{\mathbf{a}}_t^j)/\eta)}{\sum_{k=1}^N \exp(Q(s_t, \vec{\mathbf{a}}_t^k)/\eta)}\]

We then successfully sidestepped the likelihood trap. Using just a set of “weighted sample particles ${\vec{\mathbf{a}}_t^j, \bar{w}_j}$”, we implicitly construct our high-reward posterior distribution.

Step 3

POCO’s Likelihood-Free M-step

Next, we update the policy $\theta$ to fit this high-value distribution $q_i$. Because $\log \pi_\theta$ is intractable, POCO utilizes a profound Variational Mapping from generative modeling theory: The standard supervised training loss $\mathcal{L}_{BC}$ (e.g., the vector field matching loss for flow matching policy) used to pre-train the generative model is mathematically a variational upper bound on the negative log-likelihood:

\[\mathcal{L}_{BC, \theta}(\vec{\mathbf{a}}_t|s_t) \approx -\log \pi_\theta(\vec{\mathbf{a}}_t|s_t) + C\]

Furthermore, the M-step inherently requires a trust-region constraint (the KL divergence in the ELBO) to anchor the update. POCO approximates this reference distribution using the replay buffer $\mathcal{D}$. By replacing the intractable likelihood with the native supervised loss, the M-step transforms into a tractable objective coupled with a BC regularization term:

\[\mathcal{J}_{M-Step}(\theta) \approx \mathbb{E}_{(s_t, \vec{\mathbf{a}}_t) \sim \mathcal{D}, \{\vec{\mathbf{a}}_t^j\}_{j=1}^N \sim \pi(\cdot|s_t,\theta_i)} \left[ \beta \sum_{j=1}^N \bar{w}_j \mathcal{L}_{BC, \theta}(\vec{\mathbf{a}}_t^j, s_t) + \mathcal{L}_{BC, \theta}(\vec{\mathbf{a}}_t|s_t) \right]\]

Step 4

The Safety Belt — The Clipped Objective

During early exploration, the Critic will inevitably hallucinate, assigning a massive weight $\bar{w}_j$ to a highly sub-optimal, OOD action. If the M-step tries to fit this heavily weighted outlier, the supervised loss \(\mathcal{L}_{BC}\) will explode, destroy the offline pre-trained prior.

POCO solves this with the Clipped Surrogate Objective. We introduce a safety threshold $\zeta$ to mathematically bound the maximum geometric deformation allowed per update on the generated actions, while retaining the baseline BC anchor:

\[\mathcal{J}_{POCO}(\theta) \approx \mathbf{\mathbb{E}_{(s_t, \vec{\mathbf{a}}_t) \sim \mathcal{D}, \{\vec{\mathbf{a}}_t^j\}_{j=1}^N \sim \pi(\cdot|s_t,\theta_i)}} \left[ \mathbf{\beta} \sum_{j=1}^N \bar{w}_j \text{clip}\left(\mathcal{L}_{BC, \theta}(\vec{\mathbf{a}}_t^j, s_t), 0, \zeta\right) \mathbf{+ \mathcal{L}_{BC, \theta}(\vec{\mathbf{a}}_t|s_t)} \right]\]

By fundamentally restructuring RL into likelihood-free posterior inference and wrapping it in a clipped objective, POCO successfully fine-tunes generative models, allowing them to learn autonomously and safely in the physical world.

Experiments

Simulation Experiments

Pure Online Learning Curves

Offline-to-online Learning Curves

Real-world Experiments

Learning Curves

Policy Rollouts

Pick Cube
Route Cable
Insert USB
Assemble SSD

Contact

If you have any questions, please feel free to contact Yuhui Chen.