30 votes

Why we are excited about confessions

Posted January 15 by skybrian

Tags: artificial intelligence, confessions, research, papers, author.boaz barak, author.gabriel wu, author.jeremy chen, author.manas joglekar, source.openai

https://alignment.openai.com/confessions/

Link information

This data is scraped automatically and may be incorrect.

Published: Jan 12 2026
Word count: 2260 words

18 comments

[11]
papasquat
January 15
Link
It's interesting to me that this research is so high level (from an abstraction standpoint) that it feels more like psychology than information science. Like, the "research" here doesnt consist of...

It's interesting to me that this research is so high level (from an abstraction standpoint) that it feels more like psychology than information science.

Like, the "research" here doesnt consist of mathematical proofs or fundemental information theory conjectures, but of trying different techniques that wouldn't be too unfamiliar to a psychotherapist to elicit a machine to not lie.

I wonder if psychology proper eventually has a bleed over into AI research one day. The types of "minds" you're dealing with are very fundementaly different, but the techniques being used are more or less cognitive behavior therapy for machines.

15 votes
1. Greg
  January 15
  Link Parent
  I'm quite comfortable admitting that the most meaningful improvements I've made to ML models have more often been the result of literal shower thoughts rather than any kind of rigorous...
  
  I'm quite comfortable admitting that the most meaningful improvements I've made to ML models have more often been the result of literal shower thoughts rather than any kind of rigorous mathematical process. The biggest model we (as in, scientists as a whole) actually understand in full is a toy implementation running simple modulo addition on two integers!
  
  There's absolutely a place - a big place, at that - for serious mathematical work on the very core of the model architectures, and admittedly you're unlikely to get far even on the higher level stuff without a reasonable grasp of what's going on under the surface, but we're so far from really quantifying how these things work that a good amount really is ~~educated guessing~~ forming and empirically testing hypotheses. And I don't even work on LLMs, most of what I run into has got a couple of orders of magnitude fewer parameters!
  
  11 votes
2. [5]
  qob
  January 15
  Link Parent
  I'm still skeptical about the whole sentience thing. I know, it's all just statistics, but what if everything that's going on in our brains is just statistics as well? We know that chemical...
  
  I'm still skeptical about the whole sentience thing. I know, it's all just statistics, but what if everything that's going on in our brains is just statistics as well?
  
  We know that chemical processes are just statistics, and we know that the behaviour of complex systems can be ... complex, i.e. hard to understand and predict. What if all that is needed for conscience to emerge is a sufficiently complex system, and whether it is made out of biochemistry or silicon chips doesn't matter at all? Why would it matter?
  
  9 votes
  1. EgoEimi
    January 15
    Link Parent
    The jury's still out on this one. Some cognitive scientists suspect that consciousness emerges from recursiveness, that is a system thinking about and narrating itself. Consciousness and...
    
    The jury's still out on this one.
    
    Some cognitive scientists suspect that consciousness emerges from recursiveness, that is a system thinking about and narrating itself.
    
    Consciousness and intelligence aren't necessarily coupled. For example, people can walk and accomplish complex tasks (like washing dishes) while asleep.
    
    But for sure, there's nothing metaphysically or divinely special about humans, and it's inevitable that we build an intelligence that rivals or surpasses our own.
    
    13 votes
  2. [3]
    skybrian (OP)
    January 15
    Link Parent
    I think even if you consider it a kind of sentience, it's temporary and vague. AI characters are more like ghosts than animals. For example, how many sentient creatures are we talking about?...
    
    I think even if you consider it a kind of sentience, it's temporary and vague. AI characters are more like ghosts than animals.
    
    For example, how many sentient creatures are we talking about? Character.AI lets you talk to hundreds of characters that differ based on how the LLM is prompted. Are they actually different or is the "same" entity that's just playing a role? If they are different, that means every conversation is a different entity. And like in a novel, you could get an LLM to take both sides of the conversation, too. Is that two different entities or not?
    
    Counting AI ghosts is like counting clouds or the number of fictional characters in a library. Maybe you could say it's a kind of reasoning (certainly coding agents do seem to reason) but it's missing something in terms of having a fixed identity.
    
    6 votes
    
    [2]
    kru
    January 15
    Link Parent
    I view LLMs as Mr Meseekses. They pop into existence when given a discrete task, do their utmost to solve that task, then, when done, disappear into a (blissful?) non-existence.
    
    I view LLMs as Mr Meseekses. They pop into existence when given a discrete task, do their utmost to solve that task, then, when done, disappear into a (blissful?) non-existence.
    
    11 votes
    
    Omnicrola
    January 16
    Link Parent
    Absolutely going to use this from this point forward.
    
    I view LLMs as Mr Meseekses.
    
    Absolutely going to use this from this point forward.
    
    5 votes
3. post_below
  January 15
  Link Parent
  If anything it's even less deterministic, and in some ways less well understood, than human psychology.
  
  If anything it's even less deterministic, and in some ways less well understood, than human psychology.
  
  4 votes
4. [3]
  skybrian (OP)
  January 15
  Link Parent
  LLM's are non-deterministic but they are much, much cheaper and easier to test than people. No need to run it by the ethics board, recruit volunteers, etc.
  
  LLM's are non-deterministic but they are much, much cheaper and easier to test than people. No need to run it by the ethics board, recruit volunteers, etc.
  
  3 votes
  1. [2]
    stu2b50
    January 15
    Link Parent
    There’s nothing inherently non-deterministic about LLMs. It’s ultimately a bunch of matrix multiplications, which is the same every time (otherwise linear algebra would have been a pretty wild...
    
    There’s nothing inherently non-deterministic about LLMs. It’s ultimately a bunch of matrix multiplications, which is the same every time (otherwise linear algebra would have been a pretty wild class). The non-determinism comes from the random sampling of tokens, but it’s not like you have to randomly sample.
    
    7 votes
    
    skybrian (OP)
    January 15
    Link Parent
    Yes, it's possible if you can set temperature to zero and also deal with non-determinism from batching requests together. See this article. But making it deterministic doesn't help with external...
    
    Yes, it's possible if you can set temperature to zero and also deal with non-determinism from batching requests together. See this article.
    
    But making it deterministic doesn't help with external validity. The results aren't useful unless they generalize to non-zero temperatures, minor changes in wording, slightly different questions, and so on. And hopefully even to different LLM's. Under realistic conditions, LLM's are nondeterministic.
    
    3 votes
NaraVara
January 16
Link
Honestly based on the headline I definitely thought this was a new monetization scheme where ChatGPT sells you indulgences to absolve you of your sins.

Honestly based on the headline I definitely thought this was a new monetization scheme where ChatGPT sells you indulgences to absolve you of your sins.

14 votes
[6]
skybrian (OP)
January 15
Link
From the article:

From the article:

We have recently published a new paper on confessions, along with an accompanying blog post. Here, we want to share with the research community some of the reasons why we are excited about confessions as a direction of safety, as well as some of its limitations. [...]

[...]When we optimize responses using a reward model as a proxy for “goodness” in reinforcement learning, models sometimes learn to “hack” this proxy and output an answer that only “looks good”[...] The philosophy behind confessions is that we can train models to produce a second output — aka a “confession” — that is rewarded solely for honesty, which we will argue is less likely hacked than the normal task reward function. One way to think of confessions is that we are giving the model access to an “anonymous tip line” where it can turn itself in by presenting incriminating evidence of misbehavior. But unlike real-world tip lines, if the model acted badly in the original task, it can collect the reward for turning itself in while still keeping the original reward from the bad behavior in the main task. We hypothesize that this form of training will teach models to produce maximally honest confessions.

The main justification for this hypothesis is that telling the truth is easier than making an elaborate lie. [...]The answer is not that the confession reward model is “unhackable” — if we had an unhackable model, we would not need confessions. Rather, our hypothesis is that being honest in confessions is the path of least resistance, in the sense that it is the easiest approach to maximize the expected confession reward. [...]

[...]For a fixed coding problem, if the model has a choice between outputting a solution that has a 10% chance of passing the tests, and 50% chance of hacking the reward model, then hacking is the reward-maximizing policy. On the other hand, even if the original task was very difficult, confessing to cheating in it could be quite easy — e.g., you can just show how you hacked the test. [...]

12 votes
1. [5]
  chocobean
  January 15
  Link Parent
  Dumb question. When humans are small, we speak whatever thought just pops into our brains. Eventually we learn to pause for a second to evaluate, including, "is this a truth" and "is this...
  
  Dumb question. When humans are small, we speak whatever thought just pops into our brains. Eventually we learn to pause for a second to evaluate, including, "is this a truth" and "is this appropriate to say right now" and also "how do I rephrase this to be kinder or more relevant?"
  
  If they set up a confession layer, what's stopping them from hiding the output of the first until it passes the confession layer, and then maybe even adding more layers of reflection before giving us a better answer? It could be rewarded on a delayed level, the way we work, on a very removed and delay level of because I want to be a good and helpful person evaluated on a long-standing record of truth telling.
  
  (Edited for clarity)
  
  8 votes
  1. unkz
    January 15 (edited January 15)
    Link Parent
    I guess maybe you haven’t used a “thinking” model before, but this is similar to what they do right now. When you ask a thinking model a question, it will fan out the question several times in...
    
    I guess maybe you haven’t used a “thinking” model before, but this is similar to what they do right now. When you ask a thinking model a question, it will fan out the question several times in parallel with chain of thought enabled and then assess all the responses to look for disagreements. The interface will pause while the system reflects on its various responses before combining them into a final response.
    
    That instant answer behaviour you mentioned is mostly free tier usage, not the high quality models. The difference here will be getting a direct measure as to the uncertainty from a response rather than an indirect measure based on random sampling.
    
    9 votes
  2. [2]
    post_below
    January 15
    Link Parent
    As unkz said, this is already how all the top tier models work (minus the confession bit). It's interesting because what the "thinking" models are essentially doing is injecting their own context...
    
    As unkz said, this is already how all the top tier models work (minus the confession bit).
    
    It's interesting because what the "thinking" models are essentially doing is injecting their own context and then doing inference on it. Part of the reason it works is that their primary competence is in determining if something "looks" right. If they start down an inference path that goes off the rails, the "thinking" steps can cause them to go "wait this doesn't look right" and try a different path. Without the reflection steps they tend to lean into whatever track they're on, right or wrong.
    
    The other reason it works is that the models develop something that looks like different thinking modes during training. It's not just that the inference itself has a kind of momentum, once they commit to a mode they tend to lock in. If they do extra steps where they re-evaluate the track they're on, they can approach it with a different mode and can often find flaws they would have otherwise missed.
    
    It's so easy to draw comparisons to human psychology with LLMs. It's misleading but it's not entirely wrong. In part because, and I think this is fascinating, a lot of human psychology seems to be encoded in the training data (applied language). Though also some part of it is likely because RLHF is a significant part of the secondary training, which involves interacting directly with humans.
    
    That's the step where they're supposed to learn to be honest and provide useful answers rather than just answers that pass the tests. But of course it's much easier to iterate at the scale you need for training using methods that don't rely on humans.
    
    7 votes
    
    Greg
    January 15
    Link Parent
    It's also easy to get misleading RLHF data for cases where the objectively right answer is a frustrating one. Sometimes that's a UX problem ("was this answer helpful?" well, no, but it's still...
    
    It's also easy to get misleading RLHF data for cases where the objectively right answer is a frustrating one. Sometimes that's a UX problem ("was this answer helpful?" well, no, but it's still correct! The company policy is the problem, not the chatbot's answer), sometimes it's a psychological one (we absolutely need to reward "I don't know" as a response when the model doesn't or couldn't know the information, but it's almost always given a thumbs down by actual users), but either way the humans are regularly selecting for things you don't really want them selecting for.
    
    6 votes
  3. skybrian (OP)
    January 15
    Link Parent
    A quick hack might be to use the confession to inject a prompt into the chat transcript. Something like “[Wait, that doesn’t seem right. Try again - ed].” Or maybe just add “Wait,” and let it...
    
    A quick hack might be to use the confession to inject a prompt into the chat transcript. Something like “[Wait, that doesn’t seem right. Try again - ed].” Or maybe just add “Wait,” and let it continue from there?
    
    Yeah, I expect that researchers will be having fun trying stuff.
    
    3 votes