9 votes

Deep Think with Confidence

3 comments

  1. [2]
    skybrian
    (edited )
    Link
    From the web page announcing the paper: Here’s an overview of the paper. Reasoning models are currently very inefficient, generating lots of tokens, but this might not continue for very long?

    From the web page announcing the paper:

    Deep Think with Confidence (DeepConf) is a parallel thinking method that enhances both LLM reasoning performance and efficiency at test time. It leverages model-internal confidence signals to dynamically filter low-quality reasoning traces during or after generation. It requires no additional model training or hyperparameter tuning and can be seamlessly integrated into existing serving frameworks (see the vLLM example provided, full source codes will be released soon). It achieves up to 99.9% accuracy on AIME 2025 while reducing generated tokens by up to 84.7% compared to the standard thinking approaches.

    Here’s an overview of the paper.

    Reasoning models are currently very inefficient, generating lots of tokens, but this might not continue for very long?

    3 votes
    1. okiyama
      Link Parent
      Wow, 85% cut is something else!

      Wow, 85% cut is something else!

      2 votes
  2. SloMoMonday
    Link
    This is probably the single biggest issue in "reasoning"/inference models and I am practically militant on how it keeps being brushed off by so many people who encourage its use: Same input does...

    This is probably the single biggest issue in "reasoning"/inference models and I am practically militant on how it keeps being brushed off by so many people who encourage its use: Same input does not equal same output (I know static seeding is possible but that is basically an exceptionally counterintuitive query system that changes with any shift in context or parameter update). But there has been untold anxiety, loss of income, trauma and at least one death (we know of) caused by this technology acting in unexpected ways. And no attention is paid to the fact that the safety measures keep failing because those measures are built on the same unreliable technology.

    Its one of the reasons why I've not been able to figure out a suitable AI product that I could put to market with clear conscious. Can't understand why I have a far lower failure tolerance than multi-billion dollar companies forcing this tech on retail and enterprise clients, but I guess that's why I'm not making the big bucks.

    I'm still reading through the specifics of this study and I do agree with the idea of multi factor confidence and multi-model assessment framework. This big an error reduction makes single check and mega-model stratergies favored by OpenAI and other big players seem outright reckless.
    But the gripe I have with this and most other strategies in solving big inference issues is the
    insistence on only developing method of inference and putting no attention to refinement of non-promt inputs. Parameter narrowing, cleaning context, getting certainty of intent, dynamic diagnostics.

    I've been working on my own ideas and it's obviously slow using retail hardware and having a few hours every other week. But I've been trying my best to isolate rectify error prone variables. My current strategy to prevent model collapse and hallucinations is to inject logical checks within and between queries. These checks include the creation of variables as they occur in context, querying known quantities within the dataset, simple arithmetic and security queries to ensure that important system prompts have not bled out. Adds a lot of time when working with a single environment, but splitting the validation to a mirror instance and measuring the delta can allows me to prompt the user for clarification or clearly communicate areas of uncertainty.

    Results are a mixed bag but I'm hoping it's a moot point if the bubble bursts. And then we do at least one good thing as a society and have a sci-fi style ban on all AI research because corporates and billionaires can not be trusted with it.

    2 votes