Rethinking Expert Trajectory Utilization in LLM Post-training

Official Repository of RETU: The systematic study of expert trajectory utilization in LLM post-training.

✨ Features

📐 Plasticity-Ceiling Framework: We introduce a unified mechanism to quantify post-training limits, decomposing the performance ceiling ($A_{\text{post}}$) into measurable components:

SFT Performance ($P_{\text{sft}}$): The foundational capability established via expert trajectories.
RL Plasticity ($PL_{\text{rl}}$): The maximum remaining potential for reinforcement learning scaling.
📊 Analytical Insight: Provides a rigorous standard to analyze why certain paradigms fail or succeed. Read more in our framework section.

🏆 Definitive Pipeline Standard: We compare the expert trajectory utilization paradigms, including pure-SFT, pure-RL, Synchronized SFT-RL, and Sequential SFT-then-RL, when large-scale expert trajectories are available.

⚖️ The characterization of different paradigms: The stable runs of Synchronized SFT-RL (e.g., UPT, LUFFY, SRFT) and pure-RL (GRPO, DAPO $_{d}$) converge prematurely, but platuate at a limited ceiling. Pure-SFT converges slowly but makes a remarkable comeback later on, which builds up a good foundation for the following RL scaling.
🫤 The limitations of Synchronized SFT-RL: SRFT demonstrates the highly unstable training, while the effectiveness of UPT and LUFFY depends on the model prior heavily.
✔️ Sequential Dominance: Empirically proves Sequential SFT-then-RL outperforms and is more stable than Synchronized SFT-RL (e.g., UPT, LUFFY, SRFT), pure-RL (GRPO, DAPO $_{d}$), and pure-SFT (SFT889K).

🧭 Actionable Scaling Recipe: We refute the "Less is More" hypothesis for the SFT-then-RL post-training and provide precise operational guidelines for practitioners:

✔️ Optimal Timing: Switching to RL only when SFT reaches the Stable or Mild-Overfitting Sub-phase (Validation Loss Saturation).
✔️Expert Trajectory Configuration: We confirm that SFT data scale dictates the ceiling, while trajectory difficulty acts as a performance multiplier.
✔️ The Minimum SFT Validation Loss as an indicator:

We identify that the strong negative correlation between the minimum SFT validation loss and the maximal subsequent post-training ceiling. This establishes minimum validation loss as a valuable a priori indicator requiring no expensive RL training: a lower minimum loss reliably signals greater overall post-training capacity within the SFT-then-RL pipeline.

📑 Check more detailed analysis and experiments in our paper!

🤖 SFT CKPTs:

SFT ckpts on SFT889K: Qwen2.5_7B_SFT_889K
SFT ckpts on S1K: Qwen2.5_7B_SFT_s1k_1_1
SFT ckpts on Easy/Uniform/Hard102K: Qwen2.5_7B_SFT_easy/uniform/hard102K

⛁ Datasets:

SFT training data:

SFT validation data:

Val-199

Synchonized SFT-RL data:

MIX37K

RL data:

RL62K

Benchmark:

benchmark

Scripts

SFT

All SFT runs are implemented on 16 GPUs with SLURM to manage the training jobs. To run it, for example,

cd RETU/new_verl/examples/sft/amthink
sbatch run_qwen_7_sp2_liger.slurm

Evaluation We use the val_only mode of verl to evaluate all ckpts. Taking SFT889K + Qwen2.5-7B ckpt evaluation as the example

cd /RETU/new_verl/examples/sft/amthink/eval/
bash val_temp0_7_topp1.sh

The scripts for synchronized SFT-RL baselines and the RL phase of the SFT-then-RL pipeline are coming soon!.

💖 Acknowledgement

We extend our gratitude to the open-source community for their valuable resources:

Datasets: We thank a-m-team for the large-scale R1-style trajectories (1.4M & 40M), Skywork for the high-quality RL data in Skywork-OR1-RL-Data, and simplescaling for the S1K expert trajectories.
Codebase: We acknowledge Unify-Post-Training for the implementations of synchronized algorithms (LUFFY, UPT, SRFT). Our pipeline is built upon verl, from which we also adapted the FLOPs estimation logic (flops_counter.py).

📚 Bibliography

If you find this repository helpful for your project, please consider citing our work:

@misc{ding2025rethinkingexperttrajectoryutilization,
      title={Rethinking Expert Trajectory Utilization in LLM Post-training}, 
      author={Bowen Ding and Yuhan Chen and Jiayang Lv and Jiyao Yuan and Qi Zhu and Shuangshuang Tian and Dantong Zhu and Futing Wang and Heyuan Deng and Fei Mi and Lifeng Shang and Tao Lin},
      year={2025},
      eprint={2512.11470},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2512.11470}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
new_verl		new_verl
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Rethinking Expert Trajectory Utilization in LLM Post-training

✨ Features

🤖 SFT CKPTs:

⛁ Datasets:

Scripts

💖 Acknowledgement

📚 Bibliography

About

Uh oh!

Releases

Packages

Languages

LINs-lab/RETU

Folders and files

Latest commit

History

Repository files navigation

Rethinking Expert Trajectory Utilization in LLM Post-training

✨ Features

🤖 SFT CKPTs:

⛁ Datasets:

Scripts

💖 Acknowledgement

📚 Bibliography

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages