One thing I'd say is that I am not entirely sure it is the LLMs themselves that have gotten better at math. Rather, from my understanding, a lot of the commerical models do tool calling in the...
One thing I'd say is that I am not entirely sure it is the LLMs themselves that have gotten better at math. Rather, from my understanding, a lot of the commerical models do tool calling in the background for things there are not natively good at. I might be wrong there though, or it might be more complex. But I figured I'd mention it as context anyway.
This is an interesting one, actually! I’m not fully up to date on it (did a bit of work with a team who were competing for the AIMO prize, but that was last year, so basically a lifetime ago in...
This is an interesting one, actually! I’m not fully up to date on it (did a bit of work with a team who were competing for the AIMO prize, but that was last year, so basically a lifetime ago in this field), but the balance between tool calling and LLM parsing actually skews (or did skew, at least) pretty heavily towards the latter for mathematical problems.
Most successful approaches absolutely did use code generation and a python interpreter for the actual arithmetic - but basic arithmetic, and even not-too-complicated symbolic algebra, are pretty much solved problems for computers. For a problem to be challenging at all to an LLM with tool calling abilities, you’re inherently testing its capacity to parse, “understand”, and reformulate the conceptual problem in order to use those tools effectively.
It’s similar to allowing calculators in an exam: we could spend time doing long division on paper, and similarly we could spend time training LLMs to do accurate arithmetic and symbolic manipulation internally, but for most real-world tests it’s fair to assume that tools will be available to assist. The questions are then formulated to test understanding rather than mechanical skills: did you, or did the LLM, select the right numbers to put into the calculator (if there are numbers in the answer at all)? The only way to do so is to interpret the question correctly, which puts the onus on the human or LLM rather than on the calculator or python runtime.
One of the unexpectedly tricky bits is actually getting decent questions to benchmark with, though! LLMs generally have an extremely good ability to recall things from their training data, and it’s natural to train a mathematically-focused LLM on any question-answer pairs you can find, but that means if you’re testing at the high school or early university difficulty level you’re going to have to write and test with completely new questions that have never been seen or published on the internet if you want a baseline of how well the model can actually generalise concepts. If you don’t do that, you’re likely to end up testing recall more than generalisation - which is worthwhile in itself, as long as you’re aware that’s what you’re doing, but will fall off a cliff once it hits something it doesn’t have a close example for encoded in the training data.
My understanding is that this tool-calling is usually explicit in the output, though perhaps collapsed-by-default. With the providers I'm using anyway, that was my impression, though of course it...
My understanding is that this tool-calling is usually explicit in the output, though perhaps collapsed-by-default. With the providers I'm using anyway, that was my impression, though of course it could be done on the down-low too.
But I'd be almost certain that they're being trained using such tools. Like, simply generating symbolic algebra problems, throwing them into a computer algebra system, and then training a LLM to do the same thing is very low-hanging fruit, but could pay dividends on all kinds of other problems of interest.
It’s difficult to try to separate tool use from the LLM when you look at how it works in training. The training corpus contains saved outputs from the tools, so much in the same way that good...
It’s difficult to try to separate tool use from the LLM when you look at how it works in training. The training corpus contains saved outputs from the tools, so much in the same way that good students use LLMs to improve their understanding, LLMs use their tools to improve their understanding even when they aren’t actively using tools.
That's more true for things like proofs, but LLMs are not calculators. They're trained on plenty of math, yet they frequently get basic arithmetic wrong. They don't have any deeper "understanding"...
That's more true for things like proofs, but LLMs are not calculators. They're trained on plenty of math, yet they frequently get basic arithmetic wrong. They don't have any deeper "understanding" of math. If they did, they'd be able to consistently and correctly apply basic mathematical operations.
From my experience, which is limited as I can do basic arithmetic faster in my head than pulling out a device, they really don't get the basics wrong much anymore. Admittedly this is just a vibe...
From my experience, which is limited as I can do basic arithmetic faster in my head than pulling out a device, they really don't get the basics wrong much anymore. Admittedly this is just a vibe check on my end, but it feels like the errors are getting fewer and fewer over time.
Here's a a blog post about why they tend to get 3.10 - 3.9 wrong, especially in English. I personally wouldn't trust them, especially when calculators are so ubiquitous.
Here's a a blog post about why they tend to get 3.10 - 3.9 wrong, especially in English. I personally wouldn't trust them, especially when calculators are so ubiquitous.
I actually wrote about some of the reasons they may get that wrong in the past. I find it mostly holds true for older models, a year or two ago, and the error has largely gone away with newer models.
I actually wrote about some of the reasons they may get that wrong in the past. I find it mostly holds true for older models, a year or two ago, and the error has largely gone away with newer models.
I'm back, the teacher with another article and some ramblings to say. With all the talk about all these benchmarks that are beaten by new LLM models on release, I finally got around to seeing how...
I'm back, the teacher with another article and some ramblings to say.
With all the talk about all these benchmarks that are beaten by new LLM models on release, I finally got around to seeing how LLM's do a year after I checked last time.
A year ago I tested OpenAI’s o1 pro on real CEMC Problems of the Week and came away thinking secondary math was still “mostly safe.” I reran the experiment this year with GPT-5 on a new set,and I no longer doubt that frontier LLMs can now just truck through standard curriculum math, while education is mostly reacting as if we’re still in last year’s world. Very few institutions are running their own local benchmarks or “job interviews” for these tools.
As before, I’m mostly looking for readers, different perspectives, and whatever really. Been very very busy the past few months, happy I got to truck this out.
I mean, I use these things to shit-test my math ideas before they make it into papers. "Hey, this theorem seems interesting. Can you prove or disprove it?" [A few attempts, all of which the model...
I mean, I use these things to shit-test my math ideas before they make it into papers.
"Hey, this theorem seems interesting. Can you prove or disprove it?"
[A few attempts, all of which the model discards itself.]
"Ok, well, here's a line of thinking that could lead somewhere:"
[actually reasonably fleshed out theory].
Seriously. These things are that good.
To be fair, I'm doing the math as part of other studies, so this is applied math territory I suppose.
As always, double check LLM results before you embarrass yourself in peer review. But honestly? The fact that I can (ab)use these models like a research assistant and just throw half-cooked ideas at it and expect a halfway cromulent proof back? Baffling 2 years ago, but we're there.
It's moving so fast that a mental model of LLM capabilities from even 6 months ago is completely out of date. I assume there will be a plateau at some point, but so far all the claims of impending...
It's moving so fast that a mental model of LLM capabilities from even 6 months ago is completely out of date. I assume there will be a plateau at some point, but so far all the claims of impending plateaus have been proven wrong by the next generation of frontier models.
And it's not just math, or audio, or brainstorming, or video, or coding, or learning, or research, or (pseudo) reasoning, it's all of the above and a lot more besides. It looks more and more like it will prove to be the most impactful advancement in the digital revolution so far. With all the exciting and frightening up and downsides that go along with that.
Another way to look at it is AI doesn't have to be accurate to give you a list of reasons why an idea is bad. You're the smart one and can easily tell which of those reasons are valid. So in...
Another way to look at it is AI doesn't have to be accurate to give you a list of reasons why an idea is bad. You're the smart one and can easily tell which of those reasons are valid. So in certain domains and situations, you don't care if AI gives you bad answers. It's the variety you want, so you can find any hidden good answers that you yourself overlooked.
I do this now in software. I'll write code, then bias the AI with a question like "what did I do wrong?" Or "why is this sometimes not working?"
It'll always answer as if the code is bad and explain why I'm seeing bugs. Most of the time it's hallucinating in order to agree with me, and the critique makes no sense. I can easily tell that and throw it away. Other times it'll surprise me with a bug that it predicted and I didn't, which I can easily tell is legit.
When I'm asking about things I don't know as much about, I'm a lot more careful to all for background and citations. I try (increasingly hard) to back up what I'm told with human sources.
I think the following quite is an interesting point, and likely where the future of education (or at least self-guided exercises) in this new context lies; Personally, I have been using ChatGPT 5...
I think the following quite is an interesting point, and likely where the future of education (or at least self-guided exercises) in this new context lies;
If different students pick different AI “vibes” and get different types of explanations, hints, and levels of hand-holding, we will need to think carefully about equity, scaffolding, and what we count as independent work. The same underlying model might behave like a patient tutor for one student and an efficiency-obsessed problem solver for another.
Personally, I have been using ChatGPT 5 to work through exercise sheets on theoretical computer science (Turing machines, finite automata, register machines, Rice's theorem, and much more)- largely because the material that I have been provided with is in a language I do not yet speak well (German), as well as consisting of slides which, in many cases, are enough to prompt a lecturer but without said lecturer do not provide the entire picture- and the amount of time it takes for the LLM to insist on providing you with its own, fully-fleshed out answer to any problem stated is infuriatingly short. And this is without any form of vibes manipulation; this is standard behaviour.
It has been an incredibly useful tool in translation (once I have transcribed the slides- it still struggles mightily with the unusual layouts and mathematical formulae), and when trying to understand a concept and produce (my own) notes for it. I can see it becoming a standard in learning contexts, especially when it helps level the playing-field for people with issues such as language barriers, but there is a marked difference in utility between using the tool to further your understanding, and asking it to simply solve the problems for you. Which of these it does seems to ultimately be left up to the honesty of the student, or at the least understanding that having the answers given to you will not aid you later on in your curriculum.
I imagine that with the right prompting, it should be possible to build a decent math tutor that doesn’t tell students the answer? For example, Khan Academy has built Khanmigo, which promises “no...
I imagine that with the right prompting, it should be possible to build a decent math tutor that doesn’t tell students the answer? For example, Khan Academy has built Khanmigo, which promises “no answers.” I haven’t tried it.
There seem to be lots of tools out there! Reviewing them to figure out which ones are best would be a lot of work.
I've been working through Harvard's online CS50 course, and they have an AI assistant designed as a sort of enhanced rubber duck debugger. I haven't used it much, but the few times I have it has...
I've been working through Harvard's online CS50 course, and they have an AI assistant designed as a sort of enhanced rubber duck debugger. I haven't used it much, but the few times I have it has kept asking leading questions and not giving answers about the problem sets.
Also, they have a 'stamina' system set up with it such that you can only ask so many questions in a given period of time. I think the idea is that it forces students to think a bit more before asking the AI, but in practice, I think it could be even stricter.
ChatGPT, for example, has Study Mode. However, it's really just another mode with a new system prompt. Benjamin Breen wrote a bit about it, and I mentioned it briefly, and I tend to agree with...
ChatGPT, for example, has Study Mode. However, it's really just another mode with a new system prompt.
Benjamin Breen wrote a bit about it, and I mentioned it briefly, and I tend to agree with him. It's too agreeable, and I tend to learn best with a disagreeable teacher who pushes my teaching—rather than going the route that most LLM's do in being fairly sycophantic.
I’m not an educator and I really wouldn’t want to be given the challenges you face, but as someone who enjoys math, I feel the best way to prevent cheating is to show students what math can really...
I’m not an educator and I really wouldn’t want to be given the challenges you face, but as someone who enjoys math, I feel the best way to prevent cheating is to show students what math can really do.
LLMs can do your homework, but so can Wolfram Alpha and that’s been around much longer. The one thing they can’t do is give you intuition or insight into how to apply math concepts. Being able to combine multiple concepts you’ve learned to “discover” the next step is something that comes from personal experience with the underlying concepts.
My issue with much of math education is that you rarely get the chance to make these discoveries on your own until you get to more advanced classes. There is certainly room for some self discovery in middle and high school math classes. If more students got to experience the rush of figuring something out before the teacher told them, I’m guessing we’d have a lot more students who enjoyed math and were slightly more willing to put in the work to master certain concepts.
What you are describing is effectively the inquiry approach to Mathematics, which works well as long as your foundation is solid. Unfortunately, as many schools work on progressing students by...
What you are describing is effectively the inquiry approach to Mathematics, which works well as long as your foundation is solid.
Unfortunately, as many schools work on progressing students by grade (rather than ability level per subject), it can cause a lot of learning issues for a decent percentage of your class if the school isn't rigorously setup and has the proper supports for that approach to learning.
I have a lot of opinions on how K-12 math is taught in the US. I went to private schools for middle and high school, so I understand that I had a lot more resources available than the majority of...
I have a lot of opinions on how K-12 math is taught in the US.
I went to private schools for middle and high school, so I understand that I had a lot more resources available than the majority of students, but I was able to graduate high school at a math level above the state minimums.
In middle and high school, you can take a variety of different classes without being “held back” or “skipping a grade”. If we were more strict about what ability level you need to reach to move to the next math class, more students could master the fundamentals even if they didn’t get to take calculus in high school.
If all students could graduate high school with a solid understanding of Algebra and Trigonometry, they would have a much easier time in general life and college than someone who has a shaky grasp on calculus and its prerequisite math concepts.
In parallel is literacy education. Recently fourth graders in Mississippi have beaten their peers in Minnesota in Reading in the NAEP test. Black students in Mississippi have also seen big gains....
In parallel is literacy education. Recently fourth graders in Mississippi have beaten their peers in Minnesota in Reading in the NAEP test. Black students in Mississippi have also seen big gains. All this despite Mississippi being the poorest state.
The main intervention Mississippi has done is... holding back third graders who don't meet reading proficiency + embracing phonics and rejecting whole-language theory. See this article on the Mississippi Miracle.
In liberal states, I think that good intentions around equity have led to corrupting incentives that cause schools to prioritize gaming metrics instead of actually educating their students.
Schools passing along unprepared students so they can claim good passing and graduation rates.
Colleges dropping SAT/ACT requirements in order to admit more students of color on the theory that SAT/ACT scores don't indicate academic readiness... but this has led to decreased academic performance, because scores do predict readiness. This train wreck is still in motion: I think it's a disservice to students to admit them into programs they're not prepared to fully utilize, and the consequences will unfold in the years ahead.
San Francisco school district getting rid of 8th grade algebra in the name of equity.
One thing I'd say is that I am not entirely sure it is the LLMs themselves that have gotten better at math. Rather, from my understanding, a lot of the commerical models do tool calling in the background for things there are not natively good at. I might be wrong there though, or it might be more complex. But I figured I'd mention it as context anyway.
This is an interesting one, actually! I’m not fully up to date on it (did a bit of work with a team who were competing for the AIMO prize, but that was last year, so basically a lifetime ago in this field), but the balance between tool calling and LLM parsing actually skews (or did skew, at least) pretty heavily towards the latter for mathematical problems.
Most successful approaches absolutely did use code generation and a python interpreter for the actual arithmetic - but basic arithmetic, and even not-too-complicated symbolic algebra, are pretty much solved problems for computers. For a problem to be challenging at all to an LLM with tool calling abilities, you’re inherently testing its capacity to parse, “understand”, and reformulate the conceptual problem in order to use those tools effectively.
It’s similar to allowing calculators in an exam: we could spend time doing long division on paper, and similarly we could spend time training LLMs to do accurate arithmetic and symbolic manipulation internally, but for most real-world tests it’s fair to assume that tools will be available to assist. The questions are then formulated to test understanding rather than mechanical skills: did you, or did the LLM, select the right numbers to put into the calculator (if there are numbers in the answer at all)? The only way to do so is to interpret the question correctly, which puts the onus on the human or LLM rather than on the calculator or python runtime.
One of the unexpectedly tricky bits is actually getting decent questions to benchmark with, though! LLMs generally have an extremely good ability to recall things from their training data, and it’s natural to train a mathematically-focused LLM on any question-answer pairs you can find, but that means if you’re testing at the high school or early university difficulty level you’re going to have to write and test with completely new questions that have never been seen or published on the internet if you want a baseline of how well the model can actually generalise concepts. If you don’t do that, you’re likely to end up testing recall more than generalisation - which is worthwhile in itself, as long as you’re aware that’s what you’re doing, but will fall off a cliff once it hits something it doesn’t have a close example for encoded in the training data.
My understanding is that this tool-calling is usually explicit in the output, though perhaps collapsed-by-default. With the providers I'm using anyway, that was my impression, though of course it could be done on the down-low too.
But I'd be almost certain that they're being trained using such tools. Like, simply generating symbolic algebra problems, throwing them into a computer algebra system, and then training a LLM to do the same thing is very low-hanging fruit, but could pay dividends on all kinds of other problems of interest.
It’s difficult to try to separate tool use from the LLM when you look at how it works in training. The training corpus contains saved outputs from the tools, so much in the same way that good students use LLMs to improve their understanding, LLMs use their tools to improve their understanding even when they aren’t actively using tools.
That's more true for things like proofs, but LLMs are not calculators. They're trained on plenty of math, yet they frequently get basic arithmetic wrong. They don't have any deeper "understanding" of math. If they did, they'd be able to consistently and correctly apply basic mathematical operations.
From my experience, which is limited as I can do basic arithmetic faster in my head than pulling out a device, they really don't get the basics wrong much anymore. Admittedly this is just a vibe check on my end, but it feels like the errors are getting fewer and fewer over time.
Here's a a blog post about why they tend to get
3.10 - 3.9wrong, especially in English. I personally wouldn't trust them, especially when calculators are so ubiquitous.I actually wrote about some of the reasons they may get that wrong in the past. I find it mostly holds true for older models, a year or two ago, and the error has largely gone away with newer models.
I'm back, the teacher with another article and some ramblings to say.
With all the talk about all these benchmarks that are beaten by new LLM models on release, I finally got around to seeing how LLM's do a year after I checked last time.
A year ago I tested OpenAI’s o1 pro on real CEMC Problems of the Week and came away thinking secondary math was still “mostly safe.” I reran the experiment this year with GPT-5 on a new set,and I no longer doubt that frontier LLMs can now just truck through standard curriculum math, while education is mostly reacting as if we’re still in last year’s world. Very few institutions are running their own local benchmarks or “job interviews” for these tools.
As before, I’m mostly looking for readers, different perspectives, and whatever really. Been very very busy the past few months, happy I got to truck this out.
I mean, I use these things to shit-test my math ideas before they make it into papers.
"Hey, this theorem seems interesting. Can you prove or disprove it?"
[A few attempts, all of which the model discards itself.]
"Ok, well, here's a line of thinking that could lead somewhere:"
[actually reasonably fleshed out theory].
Seriously. These things are that good.
To be fair, I'm doing the math as part of other studies, so this is applied math territory I suppose.
As always, double check LLM results before you embarrass yourself in peer review. But honestly? The fact that I can (ab)use these models like a research assistant and just throw half-cooked ideas at it and expect a halfway cromulent proof back? Baffling 2 years ago, but we're there.
It's moving so fast that a mental model of LLM capabilities from even 6 months ago is completely out of date. I assume there will be a plateau at some point, but so far all the claims of impending plateaus have been proven wrong by the next generation of frontier models.
And it's not just math, or audio, or brainstorming, or video, or coding, or learning, or research, or (pseudo) reasoning, it's all of the above and a lot more besides. It looks more and more like it will prove to be the most impactful advancement in the digital revolution so far. With all the exciting and frightening up and downsides that go along with that.
Another way to look at it is AI doesn't have to be accurate to give you a list of reasons why an idea is bad. You're the smart one and can easily tell which of those reasons are valid. So in certain domains and situations, you don't care if AI gives you bad answers. It's the variety you want, so you can find any hidden good answers that you yourself overlooked.
I do this now in software. I'll write code, then bias the AI with a question like "what did I do wrong?" Or "why is this sometimes not working?"
It'll always answer as if the code is bad and explain why I'm seeing bugs. Most of the time it's hallucinating in order to agree with me, and the critique makes no sense. I can easily tell that and throw it away. Other times it'll surprise me with a bug that it predicted and I didn't, which I can easily tell is legit.
When I'm asking about things I don't know as much about, I'm a lot more careful to all for background and citations. I try (increasingly hard) to back up what I'm told with human sources.
AI works best when either:
I say that I only use AI when I already know the answer or don't care if it's right.
Also true of web search?
Eh, search can also be about discoverability. There's more than one reason someone might perform a web search.
I think the following quite is an interesting point, and likely where the future of education (or at least self-guided exercises) in this new context lies;
Personally, I have been using ChatGPT 5 to work through exercise sheets on theoretical computer science (Turing machines, finite automata, register machines, Rice's theorem, and much more)- largely because the material that I have been provided with is in a language I do not yet speak well (German), as well as consisting of slides which, in many cases, are enough to prompt a lecturer but without said lecturer do not provide the entire picture- and the amount of time it takes for the LLM to insist on providing you with its own, fully-fleshed out answer to any problem stated is infuriatingly short. And this is without any form of vibes manipulation; this is standard behaviour.
It has been an incredibly useful tool in translation (once I have transcribed the slides- it still struggles mightily with the unusual layouts and mathematical formulae), and when trying to understand a concept and produce (my own) notes for it. I can see it becoming a standard in learning contexts, especially when it helps level the playing-field for people with issues such as language barriers, but there is a marked difference in utility between using the tool to further your understanding, and asking it to simply solve the problems for you. Which of these it does seems to ultimately be left up to the honesty of the student, or at the least understanding that having the answers given to you will not aid you later on in your curriculum.
I imagine that with the right prompting, it should be possible to build a decent math tutor that doesn’t tell students the answer? For example, Khan Academy has built Khanmigo, which promises “no answers.” I haven’t tried it.
There seem to be lots of tools out there! Reviewing them to figure out which ones are best would be a lot of work.
I've been working through Harvard's online CS50 course, and they have an AI assistant designed as a sort of enhanced rubber duck debugger. I haven't used it much, but the few times I have it has kept asking leading questions and not giving answers about the problem sets.
Also, they have a 'stamina' system set up with it such that you can only ask so many questions in a given period of time. I think the idea is that it forces students to think a bit more before asking the AI, but in practice, I think it could be even stricter.
ChatGPT, for example, has Study Mode. However, it's really just another mode with a new system prompt.
Benjamin Breen wrote a bit about it, and I mentioned it briefly, and I tend to agree with him. It's too agreeable, and I tend to learn best with a disagreeable teacher who pushes my teaching—rather than going the route that most LLM's do in being fairly sycophantic.
I’m not an educator and I really wouldn’t want to be given the challenges you face, but as someone who enjoys math, I feel the best way to prevent cheating is to show students what math can really do.
LLMs can do your homework, but so can Wolfram Alpha and that’s been around much longer. The one thing they can’t do is give you intuition or insight into how to apply math concepts. Being able to combine multiple concepts you’ve learned to “discover” the next step is something that comes from personal experience with the underlying concepts.
My issue with much of math education is that you rarely get the chance to make these discoveries on your own until you get to more advanced classes. There is certainly room for some self discovery in middle and high school math classes. If more students got to experience the rush of figuring something out before the teacher told them, I’m guessing we’d have a lot more students who enjoyed math and were slightly more willing to put in the work to master certain concepts.
What you are describing is effectively the inquiry approach to Mathematics, which works well as long as your foundation is solid.
Unfortunately, as many schools work on progressing students by grade (rather than ability level per subject), it can cause a lot of learning issues for a decent percentage of your class if the school isn't rigorously setup and has the proper supports for that approach to learning.
I have a lot of opinions on how K-12 math is taught in the US.
I went to private schools for middle and high school, so I understand that I had a lot more resources available than the majority of students, but I was able to graduate high school at a math level above the state minimums.
In middle and high school, you can take a variety of different classes without being “held back” or “skipping a grade”. If we were more strict about what ability level you need to reach to move to the next math class, more students could master the fundamentals even if they didn’t get to take calculus in high school.
If all students could graduate high school with a solid understanding of Algebra and Trigonometry, they would have a much easier time in general life and college than someone who has a shaky grasp on calculus and its prerequisite math concepts.
In parallel is literacy education. Recently fourth graders in Mississippi have beaten their peers in Minnesota in Reading in the NAEP test. Black students in Mississippi have also seen big gains. All this despite Mississippi being the poorest state.
The main intervention Mississippi has done is... holding back third graders who don't meet reading proficiency + embracing phonics and rejecting whole-language theory. See this article on the Mississippi Miracle.
In liberal states, I think that good intentions around equity have led to corrupting incentives that cause schools to prioritize gaming metrics instead of actually educating their students.