Posterior Optimization with Clipped Objective for Bridging Efficiency and Stability in Generative Policy Learning
Yuhui Chen 1,2 , Haoran Li 1,2 , Zhennan Jiang 1,2 , Yuxing Qin 1,2 , Yuxuan Wan 3 , Weiheng Liu 1,2 , Dongbin Zhao 1,22 School of Artificial Intelligence, University of Chinese Academy of Sciences
3 CFCS, School of Computer Science, Peking University
Abstract
Expressive generative models have advanced robotic manipulation by capturing complex, multi-modal action distributions over temporally extended trajectories. However, fine-tuning these policies via RL remains challenging due to instability and sample inefficiency. We introduce Posterior Optimization with Clipped Objective (POCO), a principled RL framework that formulates policy improvement as a posterior inference problem tailored for temporal action chunks. Through an Expectation-Maximization procedure, POCO distills a reward-weighted implicit posterior into the policy without likelihood estimation. Furthermore, POCO adopts an offline-to-online paradigm that anchors online exploration to pre-trained priors, and its model-agnostic design scales to fine-tune large VLA models without architectural modifications. Evaluations across 7 simulation benchmarks and 4 contact-rich real-world tasks demonstrate that POCO prevents catastrophic policy collapse, outperforms SOTA baselines, and achieves a 96.7% success rate on real-world tasks.
Overview
We present Posterior Optimization with Clipped Objective (POCO), a two-stage offline-to-online reinforcement learning framework designed to bridge efficiency and stability for generative policy fine-tuning in real-world environments. During the offline pre-training stage, POCO focuses on extracting a robust generative prior via supervised learning from pre-collected expert demonstrations, providing a safe and high-quality behavioral initialization. During online fine-tuning, POCO formulates policy improvement as a likelihood-free posterior inference problem. By integrating an implicit Expectation-Maximization procedure with a clipped surrogate objective, it achieves rapid, stable exploration and securely anchors updates to the pre-trained prior, seamlessly scaling to large-scale VLA models.
Method
When trying to fine-tune a pre-trained generative policy (like Diffusion, Flow Matching or VLAs) in real-world environments, the immediate instinct is to use standard RL algorithms. However, we quickly hit a brutal dilemma:
-
Off-policy methods (e.g., SAC): Highly sample-efficient but violently unstable. Backpropagating noisy Q-gradients from overestimated OOD states instantly shatters the pre-trained generative manifold, causing catastrophic policy collapse.
-
On-policy methods (e.g., PPO): Safe and stable via strict trust regions, but their inability to reuse offline data makes them far too sample-inefficient for physical robots.
To achieve off-policy efficiency with on-policy safety, we therefore avoid “direct parameter optimization” via Q-gradients and shift our mathematical perspective to posterior inference.
The Offline-to-Online Paradigm
To safely deploy RL on physical robots, POCO operates under a two-stage offline-to-online training paradigm:
-
Offline Pre-training: We first train the policy on a static dataset of expert demonstrations using standard supervised learning. This equips the robot with a robust initial behavioral prior.
-
Online Fine-tuning: The robot then interacts with the real environment to explore and improve beyond the suboptimal expert data, with the replay buffer $\mathcal{D}$ initialized with the offline dataset. This is exactly where POCO steps in, guaranteeing the stable and sample-efficient policy improvement, rather than destroying the pre-trained prior.
Step 1
The Perspective Shift — RL as “Posterior Inference”
We define an “Optimality Variable” $\mathcal{O}=1$, indicating that a trajectory is expected, such as reaching a goal, accumulating high reward, or satisfying task constraints. Under the Maximum-Entropy framework, the probability of a trajectory being optimal is exponentially proportional to its cumulative reward:
\[p(\mathcal{O}=1|\tau) \propto \exp\left(\sum \frac{r_t}{\eta}\right)\]To make our policy $\pi_\theta$ produce optimal trajectories, we want to maximize the log-likelihood \(\log p_{\pi}(\mathcal{O}=1)\). We use \(\vec{\mathbf{a}}_t\) to denote the action chunk \(a_{t:t+T}\). By introducing an auxiliary variational distribution $q(\tau)$ and applying Jensen’s Inequality, we derive the Evidence Lower Bound (ELBO):
\[\mathcal{J}(q, \pi) = \mathbb{E}_q \left[ \sum_{t=0}^H \gamma^t [r_t - \eta D_{KL}(q(\vec{\mathbf{a}}_t|s_t) || \pi(\vec{\mathbf{a}}_t|s_t, \theta))] \right]+\log p(\theta)\]This justifies a classic Expectation-Maximization (E-M) procedure: find the optimal $q$ (E-step), then update $\theta$ towards it (M-step).
In the E-step, solving for the optimal $q$ by taking the derivative of the ELBO yields a closed-form proportional solution:
\[q_i(\vec{\mathbf{a}}_t|s_t) \propto \pi(\vec{\mathbf{a}}_t|s_t, \theta_i) \exp\left(\frac{Q_{\pi_i}(s_t, \vec{\mathbf{a}}_t)}{\eta}\right)\]Given the posterior \(q_i(\vec{\mathbf{a}}_t\|s_t)\), the M-step updates the parametric policy $\pi_\theta$ to match this target distribution. We instantiate the Bayesian prior $\log p(\theta)$ as a trust-region constraint to ensure stability. Solving this constrained optimization via Lagrange multipliers yields the following weighted objective:
\[\mathcal{J}_{\text{M-step}}(\theta)=\mathbb{E}_{s_t\sim\mu(s_t)}[\mathbb{E}_{a\sim q_i(\vec{\mathbf{a}}_t|s_t)}[\log\pi(\vec{\mathbf{a}}_t|s_t,\theta)]-\eta D_{\text{KL}}(\pi(\cdot|s_t,\theta_i)||\pi(\cdot|s_t,\theta))]\]Step 2
POCO’s Implicit E-step — Parameter-Free Action Audition
However, when we apply this E-M framework to expressive generative models, we meet a mathematical challenge: The Likelihood Problem. To execute the standard E-M projection, we have to explicitly evaluate the log-likelihood $\log \pi_\theta(\vec{\mathbf{a}}_t|s_t)$ of the policy. Generating an action in models like Diffusion or Flow Matching involves simulating an Ordinary Differential Equation (ODE) or a multi-step denoising process. Their explicit probability density is intractable or computationally prohibitive to calculate in real-time. If we cannot calculate the likelihood, the inference framework breaks down. This is where POCO performs.
POCO bypasses the need for an analytical formula entirely. We approximate $q$ using Monte Carlo Sampling. For a given state, POCO prompts the current policy to generate $N$ candidate action chunks. The Critic scores these actions via Q-values, and we assign a normalized importance weight to each:
\[\bar{w}_j = \frac{\exp(Q(s_t, \vec{\mathbf{a}}_t^j)/\eta)}{\sum_{k=1}^N \exp(Q(s_t, \vec{\mathbf{a}}_t^k)/\eta)}\]We then successfully sidestepped the likelihood trap. Using just a set of “weighted sample particles ${\vec{\mathbf{a}}_t^j, \bar{w}_j}$”, we implicitly construct our high-reward posterior distribution.
Step 3
POCO’s Likelihood-Free M-step
Next, we update the policy $\theta$ to fit this high-value distribution $q_i$. Because $\log \pi_\theta$ is intractable, POCO utilizes a profound Variational Mapping from generative modeling theory: The standard supervised training loss $\mathcal{L}_{BC}$ (e.g., the vector field matching loss for flow matching policy) used to pre-train the generative model is mathematically a variational upper bound on the negative log-likelihood:
\[\mathcal{L}_{BC, \theta}(\vec{\mathbf{a}}_t|s_t) \approx -\log \pi_\theta(\vec{\mathbf{a}}_t|s_t) + C\]Furthermore, the M-step inherently requires a trust-region constraint (the KL divergence in the ELBO) to anchor the update. POCO approximates this reference distribution using the replay buffer $\mathcal{D}$. By replacing the intractable likelihood with the native supervised loss, the M-step transforms into a tractable objective coupled with a BC regularization term:
\[\mathcal{J}_{M-Step}(\theta) \approx \mathbb{E}_{(s_t, \vec{\mathbf{a}}_t) \sim \mathcal{D}, \{\vec{\mathbf{a}}_t^j\}_{j=1}^N \sim \pi(\cdot|s_t,\theta_i)} \left[ \beta \sum_{j=1}^N \bar{w}_j \mathcal{L}_{BC, \theta}(\vec{\mathbf{a}}_t^j, s_t) + \mathcal{L}_{BC, \theta}(\vec{\mathbf{a}}_t|s_t) \right]\]Step 4
The Safety Belt — The Clipped Objective
During early exploration, the Critic will inevitably hallucinate, assigning a massive weight $\bar{w}_j$ to a highly sub-optimal, OOD action. If the M-step tries to fit this heavily weighted outlier, the supervised loss \(\mathcal{L}_{BC}\) will explode, destroy the offline pre-trained prior.
POCO solves this with the Clipped Surrogate Objective. We introduce a safety threshold $\zeta$ to mathematically bound the maximum geometric deformation allowed per update on the generated actions, while retaining the baseline BC anchor:
\[\mathcal{J}_{POCO}(\theta) \approx \mathbf{\mathbb{E}_{(s_t, \vec{\mathbf{a}}_t) \sim \mathcal{D}, \{\vec{\mathbf{a}}_t^j\}_{j=1}^N \sim \pi(\cdot|s_t,\theta_i)}} \left[ \mathbf{\beta} \sum_{j=1}^N \bar{w}_j \text{clip}\left(\mathcal{L}_{BC, \theta}(\vec{\mathbf{a}}_t^j, s_t), 0, \zeta\right) \mathbf{+ \mathcal{L}_{BC, \theta}(\vec{\mathbf{a}}_t|s_t)} \right]\]By fundamentally restructuring RL into likelihood-free posterior inference and wrapping it in a clipped objective, POCO successfully fine-tunes generative models, allowing them to learn autonomously and safely in the physical world.
Experiments
Simulation Experiments
Pure Online Learning Curves
Offline-to-online Learning Curves
Real-world Experiments
Learning Curves
Policy Rollouts
Cite our paper
If you find our research helpful and would like to reference it in your work, please consider using one of the following citations, depending on the format that best suits your needs:
-
For the Arxiv version:
@article{chen2026pocp, title={Posterior Optimization with Clipped Objective for Bridging Efficiency and Stability in Generative Policy Learning}, author={Chen, Yuhui and Li, Haoran and Jiang, Zhennan and Qin, Yuxing and Wan, Yuxuan and Liu, Weiheng and Zhao, Dongbin}, journal={arXiv preprint arXiv:2604.01860}, year={2026} }
Contact
If you have any questions, please feel free to contact Yuhui Chen.