Skip to content

LINs-lab/RETU

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Rethinking Expert Trajectory Utilization in LLM Post-training

πŸ€— DatasetsΒ Β  | Β Β  πŸ“‘ Paper Β Β 

Official Repository of RETU: The systematic study of expert trajectory utilization in LLM post-training.

✨ Features

πŸ“ Plasticity-Ceiling Framework: We introduce a unified mechanism to quantify post-training limits, decomposing the performance ceiling ($A_{\text{post}}$) into measurable components:

  • SFT Performance ($P_{\text{sft}}$): The foundational capability established via expert trajectories.
  • RL Plasticity ($PL_{\text{rl}}$): The maximum remaining potential for reinforcement learning scaling.
  • πŸ“Š Analytical Insight: Provides a rigorous standard to analyze why certain paradigms fail or succeed. Read more in our framework section.

πŸ† Definitive Pipeline Standard: We compare the expert trajectory utilization paradigms, including pure-SFT, pure-RL, Synchronized SFT-RL, and Sequential SFT-then-RL, when large-scale expert trajectories are available.

  • βš–οΈ The characterization of different paradigms: The stable runs of Synchronized SFT-RL (e.g., UPT, LUFFY, SRFT) and pure-RL (GRPO, DAPO $_{d}$) converge prematurely, but platuate at a limited ceiling. Pure-SFT converges slowly but makes a remarkable comeback later on, which builds up a good foundation for the following RL scaling.
  • 🫀 The limitations of Synchronized SFT-RL: SRFT demonstrates the highly unstable training, while the effectiveness of UPT and LUFFY depends on the model prior heavily.
  • βœ”οΈ Sequential Dominance: Empirically proves Sequential SFT-then-RL outperforms and is more stable than Synchronized SFT-RL (e.g., UPT, LUFFY, SRFT), pure-RL (GRPO, DAPO $_{d}$), and pure-SFT (SFT889K).

🧭 Actionable Scaling Recipe: We refute the "Less is More" hypothesis for the SFT-then-RL post-training and provide precise operational guidelines for practitioners:

  • βœ”οΈ Optimal Timing: Switching to RL only when SFT reaches the Stable or Mild-Overfitting Sub-phase (Validation Loss Saturation).
  • βœ”οΈExpert Trajectory Configuration: We confirm that SFT data scale dictates the ceiling, while trajectory difficulty acts as a performance multiplier.
  • βœ”οΈ The Minimum SFT Validation Loss as an indicator:

We identify that the strong negative correlation between the minimum SFT validation loss and the maximal subsequent post-training ceiling. This establishes minimum validation loss as a valuable a priori indicator requiring no expensive RL training: a lower minimum loss reliably signals greater overall post-training capacity within the SFT-then-RL pipeline.

πŸ“‘ Check more detailed analysis and experiments in our paper!

πŸ€– SFT CKPTs:

⛁ Datasets:

SFT training data:

SFT validation data:

Synchonized SFT-RL data:

RL data:

Benchmark:

Scripts

  • SFT

All SFT runs are implemented on 16 GPUs with SLURM to manage the training jobs. To run it, for example,

cd RETU/new_verl/examples/sft/amthink
sbatch run_qwen_7_sp2_liger.slurm
  • Evaluation We use the val_only mode of verl to evaluate all ckpts. Taking SFT889K + Qwen2.5-7B ckpt evaluation as the example
cd /RETU/new_verl/examples/sft/amthink/eval/
bash val_temp0_7_topp1.sh

The scripts for synchronized SFT-RL baselines and the RL phase of the SFT-then-RL pipeline are coming soon!.

πŸ’– Acknowledgement

We extend our gratitude to the open-source community for their valuable resources:

πŸ“š Bibliography

If you find this repository helpful for your project, please consider citing our work:

@misc{ding2025rethinkingexperttrajectoryutilization,
      title={Rethinking Expert Trajectory Utilization in LLM Post-training}, 
      author={Bowen Ding and Yuhan Chen and Jiayang Lv and Jiyao Yuan and Qi Zhu and Shuangshuang Tian and Dantong Zhu and Futing Wang and Heyuan Deng and Fei Mi and Lifeng Shang and Tao Lin},
      year={2025},
      eprint={2512.11470},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2512.11470}, 
}

About

[Preprint] Rethinking Expert Trajectory Utilization in LLM Post-training

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published