9 votes

Why we are excited about confessions

1 comment

  1. skybrian
    Link
    From the article:

    From the article:

    We have recently published a new paper on confessions, along with an accompanying blog post. Here, we want to share with the research community some of the reasons why we are excited about confessions as a direction of safety, as well as some of its limitations. [...]

    [...]When we optimize responses using a reward model as a proxy for “goodness” in reinforcement learning, models sometimes learn to “hack” this proxy and output an answer that only “looks good”[...] The philosophy behind confessions is that we can train models to produce a second output — aka a “confession” — that is rewarded solely for honesty, which we will argue is less likely hacked than the normal task reward function. One way to think of confessions is that we are giving the model access to an “anonymous tip line” where it can turn itself in by presenting incriminating evidence of misbehavior. But unlike real-world tip lines, if the model acted badly in the original task, it can collect the reward for turning itself in while still keeping the original reward from the bad behavior in the main task. We hypothesize that this form of training will teach models to produce maximally honest confessions.

    The main justification for this hypothesis is that telling the truth is easier than making an elaborate lie. [...]The answer is not that the confession reward model is “unhackable” — if we had an unhackable model, we would not need confessions. Rather, our hypothesis is that being honest in confessions is the path of least resistance, in the sense that it is the easiest approach to maximize the expected confession reward. [...]

    [...]For a fixed coding problem, if the model has a choice between outputting a solution that has a 10% chance of passing the tests, and 50% chance of hacking the reward model, then hacking is the reward-maximizing policy. On the other hand, even if the original task was very difficult, confessing to cheating in it could be quite easy — e.g., you can just show how you hacked the test. [...]

    7 votes