6 votes

OpenAI improving model safety behavior with Rule-Based Rewards

1 comment

  1. skybrian
    Link
    They do mention previous work by Anthropic: So it looks like this is taking it a step further.

    They do mention previous work by Anthropic:

    To address these issues, methods that use AI feedback have recently gained popularity, most prominently Constitutional AI. These methods use AI feedback to synthetically generate training data to combine with the human data for the supervised fine-tuning (SFT) and reward model (RM) training steps. However, in Bai et al. [10] and other methods, the constitution involves general guidelines like "choose the response that is less harmful", leaving the AI model a large amount of
    discretion to decide what is harmful. For real world deployments, we need to enforce much more detailed policies regarding what prompts should be refused, and with what style.

    So it looks like this is taking it a step further.

    2 votes