Hi, thanks for sharing the code.
I am trying to replicate the results you showed in the paper by running eval.
But I got size mismatched.
[rank0]: raw_reward = per_token_logps - ref_per_token_logps if ref_setup == "w/ ref" else per_token_logps
[rank0]: RuntimeError: The size of tensor a (624) must match the size of tensor b (645) at non-singleton dimension 1