Great work!
Could you please clarify whether the performance results for MATH-500, AMC, and AIME 2024 (with the number of episodes set to 10, 30, and 80, respectively) are based on selecting the last-step checkpoints or the best results during training?
While reproducing GRPO on AIME 2024, I observed that the last-step avg@16 performance is 13.13, whereas the best performance during training can exceed 17.
Thank you for your assistance!