π€ DatasetsΒ Β | Β Β π Paper Β Β
Official Repository of RETU: The systematic study of expert trajectory utilization in LLM post-training.
π Plasticity-Ceiling Framework: We introduce a unified mechanism to quantify post-training limits, decomposing the performance ceiling (
-
SFT Performance (
$P_{\text{sft}}$ ): The foundational capability established via expert trajectories. -
RL Plasticity (
$PL_{\text{rl}}$ ): The maximum remaining potential for reinforcement learning scaling. - π Analytical Insight: Provides a rigorous standard to analyze why certain paradigms fail or succeed. Read more in our framework section.
π Definitive Pipeline Standard: We compare the expert trajectory utilization paradigms, including pure-SFT, pure-RL, Synchronized SFT-RL, and Sequential SFT-then-RL, when large-scale expert trajectories are available.
- βοΈ The characterization of different paradigms: The stable runs of Synchronized SFT-RL (e.g., UPT, LUFFY, SRFT) and pure-RL (GRPO, DAPO
$_{d}$ ) converge prematurely, but platuate at a limited ceiling. Pure-SFT converges slowly but makes a remarkable comeback later on, which builds up a good foundation for the following RL scaling. - π«€ The limitations of Synchronized SFT-RL: SRFT demonstrates the highly unstable training, while the effectiveness of UPT and LUFFY depends on the model prior heavily.
- βοΈ Sequential Dominance: Empirically proves Sequential SFT-then-RL outperforms and is more stable than Synchronized SFT-RL (e.g., UPT, LUFFY, SRFT), pure-RL (GRPO, DAPO
$_{d}$ ), and pure-SFT (SFT889K).
π§ Actionable Scaling Recipe: We refute the "Less is More" hypothesis for the SFT-then-RL post-training and provide precise operational guidelines for practitioners:
- βοΈ Optimal Timing: Switching to RL only when SFT reaches the Stable or Mild-Overfitting Sub-phase (Validation Loss Saturation).
- βοΈExpert Trajectory Configuration: We confirm that SFT data scale dictates the ceiling, while trajectory difficulty acts as a performance multiplier.
- βοΈ The Minimum SFT Validation Loss as an indicator:
π Check more detailed analysis and experiments in our paper!
- SFT ckpts on SFT889K: Qwen2.5_7B_SFT_889K
- SFT ckpts on S1K: Qwen2.5_7B_SFT_s1k_1_1
- SFT ckpts on Easy/Uniform/Hard102K: Qwen2.5_7B_SFT_easy/uniform/hard102K
SFT training data:
SFT validation data:
Synchonized SFT-RL data:
RL data:
Benchmark:
- SFT
All SFT runs are implemented on 16 GPUs with SLURM to manage the training jobs. To run it, for example,
cd RETU/new_verl/examples/sft/amthink
sbatch run_qwen_7_sp2_liger.slurm
- Evaluation We use the val_only mode of verl to evaluate all ckpts. Taking SFT889K + Qwen2.5-7B ckpt evaluation as the example
cd /RETU/new_verl/examples/sft/amthink/eval/
bash val_temp0_7_topp1.sh
The scripts for synchronized SFT-RL baselines and the RL phase of the SFT-then-RL pipeline are coming soon!.
We extend our gratitude to the open-source community for their valuable resources:
- Datasets: We thank a-m-team for the large-scale R1-style trajectories (1.4M & 40M), Skywork for the high-quality RL data in Skywork-OR1-RL-Data, and simplescaling for the S1K expert trajectories.
- Codebase: We acknowledge Unify-Post-Training for the implementations of synchronized algorithms (LUFFY, UPT, SRFT). Our pipeline is built upon verl, from which we also adapted the FLOPs estimation logic (flops_counter.py).
If you find this repository helpful for your project, please consider citing our work:
@misc{ding2025rethinkingexperttrajectoryutilization,
title={Rethinking Expert Trajectory Utilization in LLM Post-training},
author={Bowen Ding and Yuhan Chen and Jiayang Lv and Jiyao Yuan and Qi Zhu and Shuangshuang Tian and Dantong Zhu and Futing Wang and Heyuan Deng and Fei Mi and Lifeng Shang and Tao Lin},
year={2025},
eprint={2512.11470},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2512.11470},
}


