Skip to content

DPO using step level loses the ordering of steps information #4

@kevinNejad

Description

@kevinNejad

thank you very much for open-sourcing the code. I noticed dpo data preparation method formalise_dpo_data creates chosen, rejected samples using the step-level labels, although in the paper you show that this doesn't bring any gain in performance.

Moreover, the way that training samples are generated loses the ordering of steps, so I assume the reward model only learns if a step is valid at all for a given problem, and not evaluating if a given steps is taken at a correct stage of reasoning. is that correct?

For example, let's say these are the responses and correctness_lists
responses ["aaa", "bbb", "ccc", "ddd", "eee"]
correctness = [False, False, True, True, False]
then the preference datapoints looks something like this

{
    "chosen": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "ccc"
        }
    ],
    "rejected": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "aaa"
        }
    ],
    "id": "0-0",
    "dataset": "dataset_name_here"
}
{
    "chosen": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "ccc"
        }
    ],
    "rejected": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "bbb"
        }
    ],
    "id": "0-1",
    "dataset": "dataset_name_here"
}
{
    "chosen": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "ccc"
        }
    ],
    "rejected": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "eee"
        }
    ],
    "id": "0-2",
    "dataset": "dataset_name_here"
}
{
    "chosen": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "ddd"
        }
    ],
    "rejected": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "aaa"
        }
    ],
    "id": "0-3",
    "dataset": "dataset_name_here"
}
{
    "chosen": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "ddd"
        }
    ],
    "rejected": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "bbb"
        }
    ],
    "id": "0-4",
    "dataset": "dataset_name_here"
}
{
    "chosen": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "ddd"
        }
    ],
    "rejected": [
        {
            "role": "user",
            "content": "prompt text here"
        },
        {
            "role": "assistant",
            "content": "eee"
        }
    ],
    "id": "0-5",
    "dataset": "dataset_name_here"
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions