Posterior Optimization with Clipped Objective for Bridging Efficiency and Stability in Generative Policy Learning
Yuhui Chen 1,2 , Haoran Li 1,2 , Zhennan Jiang 1,2 , Yuxing Qin 1,2 , Yuxuan Wan 3 , Weiheng Liu 1,2 , Dongbin Zhao 1,22 School of Artificial Intelligence, University of Chinese Academy of Sciences
3 CFCS, School of Computer Science, Peking University
Abstract
Expressive generative models have advanced robotic manipulation by capturing complex, multi-modal action distributions over temporally extended trajectories. However, fine-tuning these policies via RL remains challenging due to instability and sample inefficiency. We introduce Posterior Optimization with Clipped Objective (POCO), a principled RL framework that formulates policy improvement as a posterior inference problem tailored for temporal action chunks. Through an Expectation-Maximization procedure, POCO distills a reward-weighted implicit posterior into the policy without likelihood estimation. Furthermore, POCO adopts an offline-to-online paradigm that anchors online exploration to pre-trained priors, and its model-agnostic design scales to fine-tune large VLA models without architectural modifications. Evaluations across 7 simulation benchmarks and 4 contact-rich real-world tasks demonstrate that POCO prevents catastrophic policy collapse, outperforms SOTA baselines, and achieves a 96.7% success rate on real-world tasks.
Overview
We present Posterior Optimization with Clipped Objective (POCO), a two-stage offline-to-online reinforcement learning framework designed to bridge efficiency and stability for generative policy fine-tuning in real-world environments. During the offline pre-training stage, POCO focuses on extracting a robust generative prior via supervised learning from pre-collected expert demonstrations, providing a safe and high-quality behavioral initialization. During online fine-tuning, POCO formulates policy improvement as a likelihood-free posterior inference problem. By integrating an implicit Expectation-Maximization procedure with a clipped surrogate objective, it achieves rapid, stable exploration and securely anchors updates to the pre-trained prior, seamlessly scaling to large-scale VLA models.
Method
When trying to fine-tune a pre-trained generative policy (like Diffusion, Flow Matching or VLAs) in real-world environments, the immediate instinct is to use standard RL algorithms. However, we quickly hit a brutal dilemma:
-
Off-policy methods (e.g., SAC): Highly sample-efficient but violently unstable. Backpropagating noisy Q-gradients from overestimated OOD states instantly shatters the pre-trained generative manifold, causing catastrophic policy collapse.
-
On-policy methods (e.g., PPO): Safe and stable via strict trust regions, but their inability to reuse offline data makes them far too sample-inefficient for physical robots.
To achieve off-policy efficiency with on-policy safety, we therefore avoid “direct parameter optimization” via Q-gradients and shift our mathematical perspective to posterior inference.
The Offline-to-Online Paradigm
To safely deploy RL on physical robots, POCO operates under a two-stage offline-to-online training paradigm:
-
Offline Pre-training: We first train the policy on a static dataset of expert demonstrations using standard supervised learning. This equips the robot with a robust initial behavioral prior.
-
Online Fine-tuning: The robot then interacts with the real environment to explore and improve beyond the suboptimal expert data, with the replay buffer $\mathcal{D}$ initialized with the offline dataset. This is exactly where POCO steps in, guaranteeing the stable and sample-efficient policy improvement, rather than destroying the pre-trained prior.
Step 1
The Perspective Shift — RL as “Posterior Inference”
We define an “Optimality Variable” $\mathcal{O}=1$, indicating that a trajectory is expected, such as reaching a goal, accumulating high reward, or satisfying task constraints. Under the Maximum-Entropy framework, the probability of a trajectory being optimal is exponentially proportional to its cumulative reward:
\[p(\mathcal{O}=1|\tau) \propto \exp\left(\sum \frac{r_t}{\eta}\right)\]To make our policy $\pi_\theta$ produce optimal trajectories, we want to maximize the log-likelihood \(\log p_{\pi}(\mathcal{O}=1)\). We use \(\vec{\mathbf{a}}_t\) to denote the action chunk \(a_{t:t+T}\). By introducing an auxiliary variational distribution $q(\tau)$ and applying Jensen’s Inequality, we derive the Evidence Lower Bound (ELBO):
\[\mathcal{J}(q, \pi) = \mathbb{E}_q \left[ \sum_{t=0}^H \gamma^t [r_t - \eta D_{KL}(q(\vec{\mathbf{a}}_t|s_t) || \pi(\vec{\mathbf{a}}_t|s_t, \theta))] \right]+\log p(\theta)\]This justifies a classic Expectation-Maximization (E-M) procedure: find the optimal $q$ (E-step), then update $\theta$ towards it (M-step).
In the E-step, solving for the optimal $q$ by taking the derivative of the ELBO yields a closed-form proportional solution:
\[q_i(\vec{\mathbf{a}}_t|s_t) \propto \pi(\vec{\mathbf{a}}_t|s_t, \theta_i) \exp\left(\frac{Q_{\pi_i}(s_t, \vec{\mathbf{a}}_t)}{\eta}\right)\]Given the posterior \(q_i(\vec{\mathbf{a}}_t\|s_t)\), the M-step updates the parametric policy $\pi_\theta$ to match this target distribution. We instantiate the Bayesian prior $\log p(\theta)$ as a trust-region constraint to ensure stability. Solving this constrained optimization via Lagrange multipliers yields the following weighted objective:
\[\mathcal{J}_{\text{M-step}}(\theta)=\mathbb{E}_{s_t\sim\mu(s_t)}[\mathbb{E}_{a\sim q_i(\vec{\mathbf{a}}_t|s_t)}[\log\pi(\vec{\mathbf{a}}_t|s_t,\theta)]-\eta D_{\text{KL}}(\pi(\cdot|s_t,\theta_i)||\pi(\cdot|s_t,\theta))]\]Step 2
POCO’s Implicit E-step — Parameter-Free Action Audition
However, when we apply this E-M framework to expressive generative models, we meet a mathematical challenge: The Likelihood Problem. To execute the standard E-M projection, we have to explicitly evaluate the log-likelihood $\log \pi_\theta(\vec{\mathbf{a}}_t|s_t)$ of the policy. Generating an action in models like Diffusion or Flow Matching involves simulating an Ordinary Differential Equation (ODE) or a multi-step denoising process. Their explicit probability density is intractable or computationally prohibitive to calculate in real-time. If we cannot calculate the likelihood, the inference framework breaks down. This is where POCO performs.
POCO bypasses the need for an analytical formula entirely. We approximate $q$ using Monte Carlo Sampling. For a given state, POCO prompts the current policy to generate $N$ candidate action chunks. The Critic scores these actions via Q-values, and we assign a normalized importance weight to each:
\[\bar{w}_j = \frac{\exp(Q(s_t, \vec{\mathbf{a}}_t^j)/\eta)}{\sum_{k=1}^N \exp(Q(s_t, \vec{\mathbf{a}}_t^k)/\eta)}\]We then successfully sidestepped the likelihood trap. Using just a set of “weighted sample particles ${\vec{\mathbf{a}}_t^j, \bar{w}_j}$”, we implicitly construct our high-reward posterior distribution.
Step 3
POCO’s Likelihood-Free M-step
Next, we update the policy $\theta$ to fit this high-value distribution $q_i$. Because $\log \pi_\theta$ is intractable, POCO utilizes a profound Variational Mapping from generative modeling theory: The standard supervised training loss $\mathcal{L}_{BC}$ (e.g., the vector field matching loss for flow matching policy) used to pre-train the generative model is mathematically a variational upper bound on the negative log-likelihood:
\[\mathcal{L}_{BC, \theta}(\vec{\mathbf{a}}_t|s_t) \approx -\log \pi_\theta(\vec{\mathbf{a}}_t|s_t) + C\]Furthermore, the M-step inherently requires a trust-region constraint (the KL divergence in the ELBO) to anchor the update. POCO approximates this reference distribution using the replay buffer $\mathcal{D}$. By replacing the intractable likelihood with the native supervised loss, the M-step transforms into a tractable objective coupled with a BC regularization term:
\[\mathcal{J}_{M-Step}(\theta) \approx \mathbb{E}_{(s_t, \vec{\mathbf{a}}_t) \sim \mathcal{D}, \{\vec{\mathbf{a}}_t^j\}_{j=1}^N \sim \pi(\cdot|s_t,\theta_i)} \left[ \beta \sum_{j=1}^N \bar{w}_j \mathcal{L}_{BC, \theta}(\vec{\mathbf{a}}_t^j, s_t) + \mathcal{L}_{BC, \theta}(\vec{\mathbf{a}}_t|s_t) \right]\]Step 4
The Safety Belt — The Clipped Objective
During early exploration, the Critic will inevitably hallucinate, assigning a massive weight $\bar{w}_j$ to a highly sub-optimal, OOD action. If the M-step tries to fit this heavily weighted outlier, the supervised loss \(\mathcal{L}_{BC}\) will explode, destroy the offline pre-trained prior.
POCO solves this with the Clipped Surrogate Objective. We introduce a safety threshold $\zeta$ to mathematically bound the maximum geometric deformation allowed per update on the generated actions, while retaining the baseline BC anchor:
\[\mathcal{J}_{POCO}(\theta) \approx \mathbf{\mathbb{E}_{(s_t, \vec{\mathbf{a}}_t) \sim \mathcal{D}, \{\vec{\mathbf{a}}_t^j\}_{j=1}^N \sim \pi(\cdot|s_t,\theta_i)}} \left[ \mathbf{\beta} \sum_{j=1}^N \bar{w}_j \text{clip}\left(\mathcal{L}_{BC, \theta}(\vec{\mathbf{a}}_t^j, s_t), 0, \zeta\right) \mathbf{+ \mathcal{L}_{BC, \theta}(\vec{\mathbf{a}}_t|s_t)} \right]\]By fundamentally restructuring RL into likelihood-free posterior inference and wrapping it in a clipped objective, POCO successfully fine-tunes generative models, allowing them to learn autonomously and safely in the physical world.
Experiments
Simulation Experiments
Pure Online Learning Curves
Offline-to-online Learning Curves
Real-world Experiments
Learning Curves
Policy Rollouts
Contact
If you have any questions, please feel free to contact Yuhui Chen.