4 votes

Discovering Language Model Behaviors with Model-Written Evaluations

1 comment

  1. skybrian
    Link
    From the abstract: The datasets containing the prompts they used are on GitHub.

    From the abstract:

    We generate 154 datasets and discover new cases of inverse scaling where [language models] get worse with size. Larger LMs repeat back a dialog user’s preferred answer (“sycophancy”) and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down.

    The datasets containing the prompts they used are on GitHub.

    3 votes