I'm back, the teacher with another article and some ramblings to say. With all the talk about all these benchmarks that are beaten by new LLM models on release, I finally got around to seeing how...
I'm back, the teacher with another article and some ramblings to say.
With all the talk about all these benchmarks that are beaten by new LLM models on release, I finally got around to seeing how LLM's do a year after I checked last time.
A year ago I tested OpenAI’s o1 pro on real CEMC Problems of the Week and came away thinking secondary math was still “mostly safe.” I reran the experiment this year with GPT-5 on a new set,and I no longer doubt that frontier LLMs can now just truck through standard curriculum math, while education is mostly reacting as if we’re still in last year’s world. Very few institutions are running their own local benchmarks or “job interviews” for these tools.
As before, I’m mostly looking for readers, different perspectives, and whatever really. Been very very busy the past few months, happy I got to truck this out.
I think the following quite is an interesting point, and likely where the future of education (or at least self-guided exercises) in this new context lies; Personally, I have been using ChatGPT 5...
I think the following quite is an interesting point, and likely where the future of education (or at least self-guided exercises) in this new context lies;
If different students pick different AI “vibes” and get different types of explanations, hints, and levels of hand-holding, we will need to think carefully about equity, scaffolding, and what we count as independent work. The same underlying model might behave like a patient tutor for one student and an efficiency-obsessed problem solver for another.
Personally, I have been using ChatGPT 5 to work through exercise sheets on theoretical computer science (Turing machines, finite automata, register machines, Rice's theorem, and much more)- largely because the material that I have been provided with is in a language I do not yet speak well (German), as well as consisting of slides which, in many cases, are enough to prompt a lecturer but without said lecturer do not provide the entire picture- and the amount of time it takes for the LLM to insist on providing you with its own, fully-fleshed out answer to any problem stated is infuriatingly short. And this is without any form of vibes manipulation; this is standard behaviour.
It has been an incredibly useful tool in translation (once I have transcribed the slides- it still struggles mightily with the unusual layouts and mathematical formulae), and when trying to understand a concept and produce (my own) notes for it. I can see it becoming a standard in learning contexts, especially when it helps level the playing-field for people with issues such as language barriers, but there is a marked difference in utility between using the tool to further your understanding, and asking it to simply solve the problems for you. Which of these it does seems to ultimately be left up to the honesty of the student, or at the least understanding that having the answers given to you will not aid you later on in your curriculum.
I imagine that with the right prompting, it should be possible to build a decent math tutor that doesn’t tell students the answer? For example, Khan Academy has built Khanmigo, which promises “no...
I imagine that with the right prompting, it should be possible to build a decent math tutor that doesn’t tell students the answer? For example, Khan Academy has built Khanmigo, which promises “no answers.” I haven’t tried it.
There seem to be lots of tools out there! Reviewing them to figure out which ones are best would be a lot of work.
I mean, I use these things to shit-test my math ideas before they make it into papers. "Hey, this theorem seems interesting. Can you prove or disprove it?" [A few attempts, all of which the model...
I mean, I use these things to shit-test my math ideas before they make it into papers.
"Hey, this theorem seems interesting. Can you prove or disprove it?"
[A few attempts, all of which the model discards itself.]
"Ok, well, here's a line of thinking that could lead somewhere:"
[actually reasonably fleshed out theory].
Seriously. These things are that good.
To be fair, I'm doing the math as part of other studies, so this is applied math territory I suppose.
As always, double check LLM results before you embarrass yourself in peer review. But honestly? The fact that I can (ab)use these models like a research assistant and just throw half-cooked ideas at it and expect a halfway cromulent proof back? Baffling 2 years ago, but we're there.
One thing I'd say is that I am not entirely sure it is the LLMs themselves that have gotten better at math. Rather, from my understanding, a lot of the commerical models do tool calling in the...
One thing I'd say is that I am not entirely sure it is the LLMs themselves that have gotten better at math. Rather, from my understanding, a lot of the commerical models do tool calling in the background for things there are not natively good at. I might be wrong there though, or it might be more complex. But I figured I'd mention it as context anyway.
This is an interesting one, actually! I’m not fully up to date on it (did a bit of work with a team who were competing for the AIMO prize, but that was last year, so basically a lifetime ago in...
This is an interesting one, actually! I’m not fully up to date on it (did a bit of work with a team who were competing for the AIMO prize, but that was last year, so basically a lifetime ago in this field), but the balance between tool calling and LLM parsing actually skews (or did skew, at least) pretty heavily towards the latter for mathematical problems.
Most successful approaches absolutely did use code generation and a python interpreter for the actual arithmetic - but basic arithmetic, and even not-too-complicated symbolic algebra, are pretty much solved problems for computers. For a problem to be challenging at all to an LLM with tool calling abilities, you’re inherently testing its capacity to parse, “understand”, and reformulate the conceptual problem in order to use those tools effectively.
It’s similar to allowing calculators in an exam: we could spend time doing long division on paper, and similarly we could spend time training LLMs to do accurate arithmetic and symbolic manipulation internally, but for most real-world tests it’s fair to assume that tools will be available to assist. The questions are then formulated to test understanding rather than mechanical skills: did you, or did the LLM, select the right numbers to put into the calculator (if there are numbers in the answer at all)? The only way to do so is to interpret the question correctly, which puts the onus on the human or LLM rather than on the calculator or python runtime.
One of the unexpectedly tricky bits is actually getting decent questions to benchmark with, though! LLMs generally have an extremely good ability to recall things from their training data, and it’s natural to train a mathematically-focused LLM on any question-answer pairs you can find, but that means if you’re testing at the high school or early university difficulty level you’re going to have to write and test with completely new questions that have never been seen or published on the internet if you want a baseline of how well the model can actually generalise concepts. If you don’t do that, you’re likely to end up testing recall more than generalisation - which is worthwhile in itself, as long as you’re aware that’s what you’re doing, but will fall off a cliff once it hits something it doesn’t have a close example for encoded in the training data.
My understanding is that this tool-calling is usually explicit in the output, though perhaps collapsed-by-default. With the providers I'm using anyway, that was my impression, though of course it...
My understanding is that this tool-calling is usually explicit in the output, though perhaps collapsed-by-default. With the providers I'm using anyway, that was my impression, though of course it could be done on the down-low too.
But I'd be almost certain that they're being trained using such tools. Like, simply generating symbolic algebra problems, throwing them into a computer algebra system, and then training a LLM to do the same thing is very low-hanging fruit, but could pay dividends on all kinds of other problems of interest.
It’s difficult to try to separate tool use from the LLM when you look at how it works in training. The training corpus contains saved outputs from the tools, so much in the same way that good...
It’s difficult to try to separate tool use from the LLM when you look at how it works in training. The training corpus contains saved outputs from the tools, so much in the same way that good students use LLMs to improve their understanding, LLMs use their tools to improve their understanding even when they aren’t actively using tools.
That's more true for things like proofs, but LLMs are not calculators. They're trained on plenty of math, yet they frequently get basic arithmetic wrong. They don't have any deeper "understanding"...
That's more true for things like proofs, but LLMs are not calculators. They're trained on plenty of math, yet they frequently get basic arithmetic wrong. They don't have any deeper "understanding" of math. If they did, they'd be able to consistently and correctly apply basic mathematical operations.
I'm back, the teacher with another article and some ramblings to say.
With all the talk about all these benchmarks that are beaten by new LLM models on release, I finally got around to seeing how LLM's do a year after I checked last time.
A year ago I tested OpenAI’s o1 pro on real CEMC Problems of the Week and came away thinking secondary math was still “mostly safe.” I reran the experiment this year with GPT-5 on a new set,and I no longer doubt that frontier LLMs can now just truck through standard curriculum math, while education is mostly reacting as if we’re still in last year’s world. Very few institutions are running their own local benchmarks or “job interviews” for these tools.
As before, I’m mostly looking for readers, different perspectives, and whatever really. Been very very busy the past few months, happy I got to truck this out.
I think the following quite is an interesting point, and likely where the future of education (or at least self-guided exercises) in this new context lies;
Personally, I have been using ChatGPT 5 to work through exercise sheets on theoretical computer science (Turing machines, finite automata, register machines, Rice's theorem, and much more)- largely because the material that I have been provided with is in a language I do not yet speak well (German), as well as consisting of slides which, in many cases, are enough to prompt a lecturer but without said lecturer do not provide the entire picture- and the amount of time it takes for the LLM to insist on providing you with its own, fully-fleshed out answer to any problem stated is infuriatingly short. And this is without any form of vibes manipulation; this is standard behaviour.
It has been an incredibly useful tool in translation (once I have transcribed the slides- it still struggles mightily with the unusual layouts and mathematical formulae), and when trying to understand a concept and produce (my own) notes for it. I can see it becoming a standard in learning contexts, especially when it helps level the playing-field for people with issues such as language barriers, but there is a marked difference in utility between using the tool to further your understanding, and asking it to simply solve the problems for you. Which of these it does seems to ultimately be left up to the honesty of the student, or at the least understanding that having the answers given to you will not aid you later on in your curriculum.
I imagine that with the right prompting, it should be possible to build a decent math tutor that doesn’t tell students the answer? For example, Khan Academy has built Khanmigo, which promises “no answers.” I haven’t tried it.
There seem to be lots of tools out there! Reviewing them to figure out which ones are best would be a lot of work.
I mean, I use these things to shit-test my math ideas before they make it into papers.
"Hey, this theorem seems interesting. Can you prove or disprove it?"
[A few attempts, all of which the model discards itself.]
"Ok, well, here's a line of thinking that could lead somewhere:"
[actually reasonably fleshed out theory].
Seriously. These things are that good.
To be fair, I'm doing the math as part of other studies, so this is applied math territory I suppose.
As always, double check LLM results before you embarrass yourself in peer review. But honestly? The fact that I can (ab)use these models like a research assistant and just throw half-cooked ideas at it and expect a halfway cromulent proof back? Baffling 2 years ago, but we're there.
One thing I'd say is that I am not entirely sure it is the LLMs themselves that have gotten better at math. Rather, from my understanding, a lot of the commerical models do tool calling in the background for things there are not natively good at. I might be wrong there though, or it might be more complex. But I figured I'd mention it as context anyway.
This is an interesting one, actually! I’m not fully up to date on it (did a bit of work with a team who were competing for the AIMO prize, but that was last year, so basically a lifetime ago in this field), but the balance between tool calling and LLM parsing actually skews (or did skew, at least) pretty heavily towards the latter for mathematical problems.
Most successful approaches absolutely did use code generation and a python interpreter for the actual arithmetic - but basic arithmetic, and even not-too-complicated symbolic algebra, are pretty much solved problems for computers. For a problem to be challenging at all to an LLM with tool calling abilities, you’re inherently testing its capacity to parse, “understand”, and reformulate the conceptual problem in order to use those tools effectively.
It’s similar to allowing calculators in an exam: we could spend time doing long division on paper, and similarly we could spend time training LLMs to do accurate arithmetic and symbolic manipulation internally, but for most real-world tests it’s fair to assume that tools will be available to assist. The questions are then formulated to test understanding rather than mechanical skills: did you, or did the LLM, select the right numbers to put into the calculator (if there are numbers in the answer at all)? The only way to do so is to interpret the question correctly, which puts the onus on the human or LLM rather than on the calculator or python runtime.
One of the unexpectedly tricky bits is actually getting decent questions to benchmark with, though! LLMs generally have an extremely good ability to recall things from their training data, and it’s natural to train a mathematically-focused LLM on any question-answer pairs you can find, but that means if you’re testing at the high school or early university difficulty level you’re going to have to write and test with completely new questions that have never been seen or published on the internet if you want a baseline of how well the model can actually generalise concepts. If you don’t do that, you’re likely to end up testing recall more than generalisation - which is worthwhile in itself, as long as you’re aware that’s what you’re doing, but will fall off a cliff once it hits something it doesn’t have a close example for encoded in the training data.
My understanding is that this tool-calling is usually explicit in the output, though perhaps collapsed-by-default. With the providers I'm using anyway, that was my impression, though of course it could be done on the down-low too.
But I'd be almost certain that they're being trained using such tools. Like, simply generating symbolic algebra problems, throwing them into a computer algebra system, and then training a LLM to do the same thing is very low-hanging fruit, but could pay dividends on all kinds of other problems of interest.
It’s difficult to try to separate tool use from the LLM when you look at how it works in training. The training corpus contains saved outputs from the tools, so much in the same way that good students use LLMs to improve their understanding, LLMs use their tools to improve their understanding even when they aren’t actively using tools.
That's more true for things like proofs, but LLMs are not calculators. They're trained on plenty of math, yet they frequently get basic arithmetic wrong. They don't have any deeper "understanding" of math. If they did, they'd be able to consistently and correctly apply basic mathematical operations.