Towards a Unified View of Large Language Model Post-Training

🎉 News • 📖 Introduction • 📝 Unified Policy Gradient Estimator • ✨ Hybrid Post-Training

🚀 Getting Started • 📊 Main Results • 💖 Acknowledgements • 📨 Contact • 🎈 Citation

Two major sources of training data exist for post-training modern language models: on-policy (model-generated rollouts) data and off-policy (human or other-model demonstrations) data.

In this paper, we show that these approaches are not in contradiction, but are instances of a single optimization process.

🎉News

[2025-09-05] We introduce Unified Policy Gradient Estimator and Hybrid Post-Training (HPT).

📖Introduction

We introduce Unified Policy Gradient Estimator, a unified theoretical framework that bridges a broad class of LLM post-training algorithms, including SFT and RL. Building upon the insights derived from this framework, we further propose Hybrid Post-Training (HPT), an adaptive algorithm that dynamically alternates between SFT and RL signals in response to model performance.

📝Unified Policy Gradient Estimator

SFT and RL, though usually seen as separate training paradigms, are actually instances of a single optimization process.

We derive a unified framework that mathematically shows how diverse post-training methods naturally emerge as gradients of a shared objective, shaped by assumptions about the data distribution and the bias–variance tradeoff. Specifically, we decompose the gradient of all post-training algorithms into four interchangeable components:

Stabilization mask
Reference-Policy denominator
Advantage estimate
Likelihood gradient

Within this framework, the policy gradient of different post-training algorithms can be expressed as:

$$\text{grad}_{Uni} = \mathbb{1}_{stable} \frac{1}{\pi_{ref}} \hat{A} \nabla \pi_{\theta}.$$

The figure below illustrates how several widely used post-training algorithms can be represented under this unified view:

✨Hybrid Post-Training

Our HPT algorithm dynamically adapts the mixing ratio between SFT and RL losses based on model performance. Please refer to the following code snippet for details about the implementation of HPT:

🚀Getting Started

To run the Hybrid Post-Training (HPT) algorithm, follow these steps:

Env Setup

conda create -n hpt python=3.10
conda activate hpt

cd hpt
pip install git+https://github.com/NICTA/pyairports.git
pip install -r requirements.txt
pip install -e .

cd verl
pip install -e .

# For NVIDIA H20 Device 
pip install nvidia-cublas-cu12==12.4.5.8

Data Preparation

For training data, you can directly download the openr1.parquet of LUFFY in Elliott/Openr1-Math-46k-8192 and put it in the data folder. Or you can run the following script:

cd data
python prepare_train.py

For validation data, you can the preprocess script:

cd data
python preprocess.py

Training

You can run the following command to start our HPT algorithm:

# For Qwen Model
bash exp_scripts/train.sh

# For LLaMA Model
bash exp_scripts/train_llama.sh

We also provide the scripts of our main baselines:

# LUFFY
bash exp_scripts/train_luffy.sh

# SRFT
bash exp_scripts/train_srft.sh

Testing

We perform the evaluation using the scripts provided by DeepMath. You can conduct the evaluation by following this instruction.

📊Main Results

HPT demonstrate consistent improvements across multiple models and benchmarks:

💖Acknowledgements

Our project mainly builds upon LUFFY and veRL. We utilize vLLM for inference. We also leverage the datasets of LUFFY and backbone models of Qwen2.5-Math and Llama-3.1. We are grateful for these significant open-source contributions.

📨Contact

For questions about this work, please contact:

Xingtai Lv: lvxt24@mails.tsinghua.edu.cn
Youbang Sun: ybsun@mail.tsinghua.edu.cn
Ning Ding: dn97@mail.tsinghua.edu.cn

🎈Citation

If you find this work helpful, please cite our paper:

@article{lv2025towards,
  title={Towards a Unified View of Large Language Model Post-Training},
  author={Lv, Xingtai and Zuo, Yuxin and Sun, Youbang and Liu, Hongyi and Wei, Yuntian and Chen, Zhekai and He, Lixuan and Zhu, Xuekai and Zhang, Kaiyan and Wang, Bingning and others},
  journal={arXiv preprint arXiv:2509.04419},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
exp_scripts		exp_scripts
figs		figs
hpt		hpt
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Towards a Unified View of Large Language Model Post-Training

🎉News

📖Introduction

📝Unified Policy Gradient Estimator

✨Hybrid Post-Training

🚀Getting Started

Env Setup

Data Preparation

Training

Testing

📊Main Results

💖Acknowledgements

📨Contact

🎈Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

TsinghuaC3I/Unify-Post-Training

Folders and files

Latest commit

History

Repository files navigation

Towards a Unified View of Large Language Model Post-Training

🎉News

📖Introduction

📝Unified Policy Gradient Estimator

✨Hybrid Post-Training

🚀Getting Started

Env Setup

Data Preparation

Training

Testing

📊Main Results

💖Acknowledgements

📨Contact

🎈Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages