I’m reasonably confident that this is basically just a fine tune with built in “chain of thought” (possibly with Monte Carlo or similar parallel techniques) so you don’t need to explicitly do it...
I’m reasonably confident that this is basically just a fine tune with built in “chain of thought” (possibly with Monte Carlo or similar parallel techniques) so you don’t need to explicitly do it yourself. The benchmarks they give seem to match that, and it explains the lack of streaming. It doesn’t seem meaningfully better than what some of us have already been doing with GPT4, but the convenience counts for something.
I read a paper in nature a little while ago that used entropy and clustering techniques to determine if ChatGPT was hallucinating. They fed the same prompt and slight variations of the prompt in,...
I read a paper in nature a little while ago that used entropy and clustering techniques to determine if ChatGPT was hallucinating. They fed the same prompt and slight variations of the prompt in, and clustered the responses based on the entropy of the answers.
I forget the details but I think the gist was that ehen ChatGPT consistently provided answers in a particular cluster, it meant there was a greater likelihood that the answer was not a hallucination and correct. If there was an even clumping of answers it could mean the model was conflicted and giving incorrect answers confidently. If small perturbations in the prompt led to drastically different answers, then it is likely it was hallucinating or had a low confidence in the answer.
Or some variation of the above. It would be interesting if they baked that into the responses as a second stage to improve results.
This new model by OpenAI (which seems to be a new foundation model not based on the prior work of GPT-4 just based on branding and marketing) has one new feature: it "thinks" before giving an...
In our tests, the next model update performs similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology. We also found that it excels in math and coding. In a qualifying exam for the International Mathematics Olympiad (IMO), GPT-4o correctly solved only 13% of problems, while the reasoning model scored 83%. Their coding abilities were evaluated in contests and reached the 89th percentile in Codeforces competitions.
This new model by OpenAI (which seems to be a new foundation model not based on the prior work of GPT-4 just based on branding and marketing) has one new feature: it "thinks" before giving an output. In the sample videos they shared demoing the product, this does seem to be the case, and I'm cautiously optimistic that this might significantly reduce the number of hallucinations and errors a model can make, but I do find it suspicious that they're only rolling it out to their highest rollers in the API first.
Also: Isn't it kinda weird that they assured everyone GPT-4o was called that because it was "omnimodal", but now they're saying that o1 is such a departure and increase in capability that they're restarting the numbering system and completely abandoning the GPT-Number naming convention, despite o1 being only monomodal (only text)? What's the O stand for now, just OpenAI?
To be honest, when they speak of they probably just don’t want "the masses“ consisting of normal small-scale API or even web interface users to be disappointed by an early-ish version of a...
I do find it suspicious that they're only rolling it out to their highest rollers in the API first.
To be honest, when they speak of
This is a preview and we expect regular updates and improvements.
they probably just don’t want "the masses“ consisting of normal small-scale API or even web interface users to be disappointed by an early-ish version of a completely new thing. It might be slower, it might be more costly, it might be incorrect in edge cases (despite people lead to believe it’s not). It doesn’t support streaming and less requests per minute, which could disappoint users accustomed to a GPT-4o – we’ll see in the coming weeks how it performs and improves, I presume.
I for one am pretty excited about this. I don’t use the tech much, but favoring reasoning over quick/hallucinated output seems like a great idea.
The increase in performance is impressive, and the way it "thinks" to itself under the hood is intriguing (it literally goes "Hmm..." at points!). But I worry about the massive increase in...
The increase in performance is impressive, and the way it "thinks" to itself under the hood is intriguing (it literally goes "Hmm..." at points!). But I worry about the massive increase in computational cost -- not just in training, but in deployment. For example, its solution to the cipher problem is a reasonable 1968 characters/805 tokens, but that's just a summary -- the chain of thought process behind it is a whopping 14,800 characters/5334 tokens. Ramp that approach up to ChatGPT scale, even just subscribers, and it could more than erase any efficiency gains they've made in terms of power consumption.
I’m reasonably confident that this is basically just a fine tune with built in “chain of thought” (possibly with Monte Carlo or similar parallel techniques) so you don’t need to explicitly do it yourself. The benchmarks they give seem to match that, and it explains the lack of streaming. It doesn’t seem meaningfully better than what some of us have already been doing with GPT4, but the convenience counts for something.
I read a paper in nature a little while ago that used entropy and clustering techniques to determine if ChatGPT was hallucinating. They fed the same prompt and slight variations of the prompt in, and clustered the responses based on the entropy of the answers.
I forget the details but I think the gist was that ehen ChatGPT consistently provided answers in a particular cluster, it meant there was a greater likelihood that the answer was not a hallucination and correct. If there was an even clumping of answers it could mean the model was conflicted and giving incorrect answers confidently. If small perturbations in the prompt led to drastically different answers, then it is likely it was hallucinating or had a low confidence in the answer.
Or some variation of the above. It would be interesting if they baked that into the responses as a second stage to improve results.
This new model by OpenAI (which seems to be a new foundation model not based on the prior work of GPT-4 just based on branding and marketing) has one new feature: it "thinks" before giving an output. In the sample videos they shared demoing the product, this does seem to be the case, and I'm cautiously optimistic that this might significantly reduce the number of hallucinations and errors a model can make, but I do find it suspicious that they're only rolling it out to their highest rollers in the API first.
Also: Isn't it kinda weird that they assured everyone GPT-4o was called that because it was "omnimodal", but now they're saying that o1 is such a departure and increase in capability that they're restarting the numbering system and completely abandoning the GPT-Number naming convention, despite o1 being only monomodal (only text)? What's the O stand for now, just OpenAI?
To be honest, when they speak of
they probably just don’t want "the masses“ consisting of normal small-scale API or even web interface users to be disappointed by an early-ish version of a completely new thing. It might be slower, it might be more costly, it might be incorrect in edge cases (despite people lead to believe it’s not). It doesn’t support streaming and less requests per minute, which could disappoint users accustomed to a GPT-4o – we’ll see in the coming weeks how it performs and improves, I presume.
I for one am pretty excited about this. I don’t use the tech much, but favoring reasoning over quick/hallucinated output seems like a great idea.
The increase in performance is impressive, and the way it "thinks" to itself under the hood is intriguing (it literally goes "Hmm..." at points!). But I worry about the massive increase in computational cost -- not just in training, but in deployment. For example, its solution to the cipher problem is a reasonable 1968 characters/805 tokens, but that's just a summary -- the chain of thought process behind it is a whopping 14,800 characters/5334 tokens. Ramp that approach up to ChatGPT scale, even just subscribers, and it could more than erase any efficiency gains they've made in terms of power consumption.