-
Notifications
You must be signed in to change notification settings - Fork 11
Open
Description
thank you very much for open-sourcing the code. I noticed dpo data preparation method formalise_dpo_data creates chosen, rejected samples using the step-level labels, although in the paper you show that this doesn't bring any gain in performance.
Moreover, the way that training samples are generated loses the ordering of steps, so I assume the reward model only learns if a step is valid at all for a given problem, and not evaluating if a given steps is taken at a correct stage of reasoning. is that correct?
For example, let's say these are the responses and correctness_lists
responses ["aaa", "bbb", "ccc", "ddd", "eee"]
correctness = [False, False, True, True, False]
then the preference datapoints looks something like this
{
"chosen": [
{
"role": "user",
"content": "prompt text here"
},
{
"role": "assistant",
"content": "ccc"
}
],
"rejected": [
{
"role": "user",
"content": "prompt text here"
},
{
"role": "assistant",
"content": "aaa"
}
],
"id": "0-0",
"dataset": "dataset_name_here"
}
{
"chosen": [
{
"role": "user",
"content": "prompt text here"
},
{
"role": "assistant",
"content": "ccc"
}
],
"rejected": [
{
"role": "user",
"content": "prompt text here"
},
{
"role": "assistant",
"content": "bbb"
}
],
"id": "0-1",
"dataset": "dataset_name_here"
}
{
"chosen": [
{
"role": "user",
"content": "prompt text here"
},
{
"role": "assistant",
"content": "ccc"
}
],
"rejected": [
{
"role": "user",
"content": "prompt text here"
},
{
"role": "assistant",
"content": "eee"
}
],
"id": "0-2",
"dataset": "dataset_name_here"
}
{
"chosen": [
{
"role": "user",
"content": "prompt text here"
},
{
"role": "assistant",
"content": "ddd"
}
],
"rejected": [
{
"role": "user",
"content": "prompt text here"
},
{
"role": "assistant",
"content": "aaa"
}
],
"id": "0-3",
"dataset": "dataset_name_here"
}
{
"chosen": [
{
"role": "user",
"content": "prompt text here"
},
{
"role": "assistant",
"content": "ddd"
}
],
"rejected": [
{
"role": "user",
"content": "prompt text here"
},
{
"role": "assistant",
"content": "bbb"
}
],
"id": "0-4",
"dataset": "dataset_name_here"
}
{
"chosen": [
{
"role": "user",
"content": "prompt text here"
},
{
"role": "assistant",
"content": "ddd"
}
],
"rejected": [
{
"role": "user",
"content": "prompt text here"
},
{
"role": "assistant",
"content": "eee"
}
],
"id": "0-5",
"dataset": "dataset_name_here"
}Metadata
Metadata
Assignees
Labels
No labels