🎉 News • 📖 Introduction • 📝 Unified Policy Gradient Estimator • ✨ Hybrid Post-Training
🚀 Getting Started • 📊 Main Results • 💖 Acknowledgements • 📨 Contact • 🎈 Citation
Two major sources of training data exist for post-training modern language models: on-policy (model-generated rollouts) data and off-policy (human or other-model demonstrations) data.
In this paper, we show that these approaches are not in contradiction, but are instances of a single optimization process.
- [2025-09-05] We introduce Unified Policy Gradient Estimator and Hybrid Post-Training (HPT).
We introduce Unified Policy Gradient Estimator, a unified theoretical framework that bridges a broad class of LLM post-training algorithms, including SFT and RL. Building upon the insights derived from this framework, we further propose Hybrid Post-Training (HPT), an adaptive algorithm that dynamically alternates between SFT and RL signals in response to model performance.
SFT and RL, though usually seen as separate training paradigms, are actually instances of a single optimization process.
We derive a unified framework that mathematically shows how diverse post-training methods naturally emerge as gradients of a shared objective, shaped by assumptions about the data distribution and the bias–variance tradeoff. Specifically, we decompose the gradient of all post-training algorithms into four interchangeable components:
- Stabilization mask
- Reference-Policy denominator
- Advantage estimate
- Likelihood gradient
Within this framework, the policy gradient of different post-training algorithms can be expressed as:
The figure below illustrates how several widely used post-training algorithms can be represented under this unified view:
Our HPT algorithm dynamically adapts the mixing ratio between SFT and RL losses based on model performance. Please refer to the following code snippet for details about the implementation of HPT:
To run the Hybrid Post-Training (HPT) algorithm, follow these steps:
conda create -n hpt python=3.10
conda activate hpt
cd hpt
pip install git+https://github.com/NICTA/pyairports.git
pip install -r requirements.txt
pip install -e .
cd verl
pip install -e .
# For NVIDIA H20 Device
pip install nvidia-cublas-cu12==12.4.5.8For training data, you can directly download the openr1.parquet of LUFFY in Elliott/Openr1-Math-46k-8192 and put it in the data folder. Or you can run the following script:
cd data
python prepare_train.pyFor validation data, you can the preprocess script:
cd data
python preprocess.pyYou can run the following command to start our HPT algorithm:
# For Qwen Model
bash exp_scripts/train.sh
# For LLaMA Model
bash exp_scripts/train_llama.shWe also provide the scripts of our main baselines:
# LUFFY
bash exp_scripts/train_luffy.sh
# SRFT
bash exp_scripts/train_srft.shWe perform the evaluation using the scripts provided by DeepMath. You can conduct the evaluation by following this instruction.
HPT demonstrate consistent improvements across multiple models and benchmarks:
Our project mainly builds upon LUFFY and veRL. We utilize vLLM for inference. We also leverage the datasets of LUFFY and backbone models of Qwen2.5-Math and Llama-3.1. We are grateful for these significant open-source contributions.
For questions about this work, please contact:
- Xingtai Lv: lvxt24@mails.tsinghua.edu.cn
- Youbang Sun: ybsun@mail.tsinghua.edu.cn
- Ning Ding: dn97@mail.tsinghua.edu.cn
If you find this work helpful, please cite our paper:
@article{lv2025towards,
title={Towards a Unified View of Large Language Model Post-Training},
author={Lv, Xingtai and Zuo, Yuxin and Sun, Youbang and Liu, Hongyi and Wei, Yuntian and Chen, Zhekai and He, Lixuan and Zhu, Xuekai and Zhang, Kaiyan and Wang, Bingning and others},
journal={arXiv preprint arXiv:2509.04419},
year={2025}
}



