Skip to content

TsinghuaC3I/Unify-Post-Training

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Towards a Unified View of Large Language Model Post-Training

Paper Github

Two major sources of training data exist for post-training modern language models: on-policy (model-generated rollouts) data and off-policy (human or other-model demonstrations) data.

In this paper, we show that these approaches are not in contradiction, but are instances of a single optimization process.

🎉News

  • [2025-09-05] We introduce Unified Policy Gradient Estimator and Hybrid Post-Training (HPT).

📖Introduction

We introduce Unified Policy Gradient Estimator, a unified theoretical framework that bridges a broad class of LLM post-training algorithms, including SFT and RL. Building upon the insights derived from this framework, we further propose Hybrid Post-Training (HPT), an adaptive algorithm that dynamically alternates between SFT and RL signals in response to model performance.

Overview of Unified Post-Training Framework.

📝Unified Policy Gradient Estimator

SFT and RL, though usually seen as separate training paradigms, are actually instances of a single optimization process.

We derive a unified framework that mathematically shows how diverse post-training methods naturally emerge as gradients of a shared objective, shaped by assumptions about the data distribution and the bias–variance tradeoff. Specifically, we decompose the gradient of all post-training algorithms into four interchangeable components:

  • Stabilization mask
  • Reference-Policy denominator
  • Advantage estimate
  • Likelihood gradient

Within this framework, the policy gradient of different post-training algorithms can be expressed as:

$$\text{grad}_{Uni} = \mathbb{1}_{stable} \frac{1}{\pi_{ref}} \hat{A} \nabla \pi_{\theta}.$$

The figure below illustrates how several widely used post-training algorithms can be represented under this unified view:

Theoretical unified view of various post-training algorithms.

✨Hybrid Post-Training

Our HPT algorithm dynamically adapts the mixing ratio between SFT and RL losses based on model performance. Please refer to the following code snippet for details about the implementation of HPT:

The pseudo-code of Hybrid Post-Training.

🚀Getting Started

To run the Hybrid Post-Training (HPT) algorithm, follow these steps:

Env Setup

conda create -n hpt python=3.10
conda activate hpt

cd hpt
pip install git+https://github.com/NICTA/pyairports.git
pip install -r requirements.txt
pip install -e .

cd verl
pip install -e .

# For NVIDIA H20 Device 
pip install nvidia-cublas-cu12==12.4.5.8

Data Preparation

For training data, you can directly download the openr1.parquet of LUFFY in Elliott/Openr1-Math-46k-8192 and put it in the data folder. Or you can run the following script:

cd data
python prepare_train.py

For validation data, you can the preprocess script:

cd data
python preprocess.py

Training

You can run the following command to start our HPT algorithm:

# For Qwen Model
bash exp_scripts/train.sh

# For LLaMA Model
bash exp_scripts/train_llama.sh

We also provide the scripts of our main baselines:

# LUFFY
bash exp_scripts/train_luffy.sh

# SRFT
bash exp_scripts/train_srft.sh

Testing

We perform the evaluation using the scripts provided by DeepMath. You can conduct the evaluation by following this instruction.

📊Main Results

HPT demonstrate consistent improvements across multiple models and benchmarks:

💖Acknowledgements

Our project mainly builds upon LUFFY and veRL. We utilize vLLM for inference. We also leverage the datasets of LUFFY and backbone models of Qwen2.5-Math and Llama-3.1. We are grateful for these significant open-source contributions.

📨Contact

For questions about this work, please contact:

🎈Citation

If you find this work helpful, please cite our paper:

@article{lv2025towards,
  title={Towards a Unified View of Large Language Model Post-Training},
  author={Lv, Xingtai and Zuo, Yuxin and Sun, Youbang and Liu, Hongyi and Wei, Yuntian and Chen, Zhekai and He, Lixuan and Zhu, Xuekai and Zhang, Kaiyan and Wang, Bingning and others},
  journal={arXiv preprint arXiv:2509.04419},
  year={2025}
}

About

Towards a Unified View of Large Language Model Post-Training

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published