15 votes

GSM-Symbolic: Understanding the limitations of mathematical reasoning in large language models

Posted October 19, 2024 by skybrian

Tags: machine learning, language models.large, natural language processing, papers, author.iman mirzadeh, author.keivan alizadeh, author.hooman shahrokhi, author.oncel tuzel, author.samy bengio, author.mehrdad farajtabar, source.apple

https://machinelearning.apple.com/research/gsm-symbolic

Link information

This data is scraped automatically and may be incorrect.

Word count: 278 words

12 comments

[12]
skybrian (OP)
October 19, 2024
Link
This is a paper that was discussed several times on Hacker News, but I just saw it today. From the paper’s summary: Perhaps ignoring irrrelevant clauses is something that LLM’s could get better at...

This is a paper that was discussed several times on Hacker News, but I just saw it today. From the paper’s summary:

To overcome the limitations of existing evaluations, we introduce GSM-Symbolic, an improved benchmark created from symbolic templates that allow for the generation of a diverse set of questions. GSM-Symbolic enables more controllable evaluations, providing key insights and more reliable metrics for measuring the reasoning capabilities of models. Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark. Furthermore, we investigate the fragility of mathematical reasoning in these models and show that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is because current LLMs cannot perform genuine logical reasoning; they replicate reasoning steps from their training data. Adding a single clause that seems relevant to the question causes significant performance drops (up to 65%) across all state-of-the-art models, even though the clause doesn't contribute to the reasoning chain needed for the final answer.

Perhaps ignoring irrrelevant clauses is something that LLM’s could get better at with the right architecture?

7 votes
1. [3]
  carsonc
  October 19, 2024
  Link Parent
  Getting dismissed by ChatGPT on the basis of perceived irrelevance would be hilarious. "I can't delve into your question right now because there doesn't seem to be anything relevant in your...
  
  Getting dismissed by ChatGPT on the basis of perceived irrelevance would be hilarious.
  "I can't delve into your question right now because there doesn't seem to be anything relevant in your prompt. Perhaps you can reword it to establish a higher degree relevance."
  
  The article doesn't seem to treat the issue of negation in LLMs. Is this also <ahem> irrelevant?
  
  2 votes
  1. [2]
    skybrian (OP)
    October 19, 2024
    Link Parent
    The paper is about a new benchmark that adds irrelevant clauses to math problems as a distraction. (It also changes the specific numbers used in the problems.) It isn’t really about chatting in...
    
    The paper is about a new benchmark that adds irrelevant clauses to math problems as a distraction. (It also changes the specific numbers used in the problems.) It isn’t really about chatting in general.
    
    5 votes
    
    carsonc
    October 20, 2024
    Link Parent
    Even in the context of the logical reasoning, I would think that an llm's insensitivity to negation would be problematic. For example, I can imagine that wording a problem "Peter has fewer apples...
    
    Even in the context of the logical reasoning, I would think that an llm's insensitivity to negation would be problematic. For example, I can imagine that wording a problem "Peter has fewer apples than Sarah" might get a different response than "Peter does not have as many apples as Sarah" while both statements can play the same, necessary role in a word problem.
    
    I didn't see it explored in the paper, but I wonder if it would affect the outcomes.
    
    2 votes
2. [8]
  ogre
  October 19, 2024
  Link Parent
  Would the right architecture be something other than transformers?
  
  Would the right architecture be something other than transformers?
  
  1 vote
  1. [3]
    skybrian (OP)
    October 19, 2024
    Link Parent
    Possibly. I also think it’s fairly common to try to use LLM’s as part of a larger program. Their pattern-matching ability might still be useful when taking a different approach.
    
    Possibly. I also think it’s fairly common to try to use LLM’s as part of a larger program. Their pattern-matching ability might still be useful when taking a different approach.
    
    3 votes
    
    [2]
    ogre
    October 19, 2024
    Link Parent
    Do you know of any LLM math uses cases? I don’t see why processing unstructured text would be useful for calculations, specifically because of the variance in results.
    
    Do you know of any LLM math uses cases? I don’t see why processing unstructured text would be useful for calculations, specifically because of the variance in results.
    
    1 vote
    
    skybrian (OP)
    October 19, 2024
    Link Parent
    Explaining math problems can be useful for education, even if the answers are sometimes wrong. Hints are often useful when you have some other way of deciding whether they’re true. But here it’s...
    
    Explaining math problems can be useful for education, even if the answers are sometimes wrong. Hints are often useful when you have some other way of deciding whether they’re true.
    
    But here it’s used as a simplified benchmark for logical reasoning. The logical reasoning used in math problems is similar to what we use to understand computer programs, and that has direct applications. (Even though the results are often wrong, too.)
    
    4 votes
  2. [4]
    sparksbet
    October 21, 2024
    Link Parent
    Certainly not on its own, at least nothing that currently exists. The only reason LLMs can do any mathematical reasoning at all is because of the sheer size of the training data and the sheer...
    
    Certainly not on its own, at least nothing that currently exists. The only reason LLMs can do any mathematical reasoning at all is because of the sheer size of the training data and the sheer number of parameters. Models prior to transformers would stop improving at a certain number of parameters, which has not seemed to be the case for transformer-based models in the same way. Iirc it was a big deal when GPT 2 could do a bit of simple math.
    
    3 votes
    
    [3]
    Greg
    October 21, 2024
    Link Parent
    There’s a bit of promising research going on around combining state space models (for more efficient handling of longer/denser context windows) and Kolmogorov-Arnold networks (for a “cleaner”,...
    
    There’s a bit of promising research going on around combining state space models (for more efficient handling of longer/denser context windows) and Kolmogorov-Arnold networks (for a “cleaner”, more interpretable way of getting outputs), but it’s very early days and I’m not aware of anyone having done it at real scale yet. KANs didn’t even exist six months ago, so to say it’s untested waters would be an understatement!
    
    I did some work on mathematical reasoning with normal, transformer-based LLMs earlier in the year and the gains from fine tuning at each intermediate step of a solution rather than the answer as a whole were significant, so my suspicion is that anything facilitating that approach will be useful. Gut feeling tells me that a mamba-style model might lend itself well to doing that kind of “generate a whole tree of answers and then predict which path has the highest chance of being right” work, but I don’t have any actual data to back that yet.
    
    And to pull my little ramble back to @ogre’s original question about whether transformers are the right approach: maybe, maybe not, but the thing we have seen so far is that the approach to training makes a huge difference in this type of work. Significantly more than the actual model architecture, in my experience.
    
    3 votes
    
    [2]
    sparksbet
    October 21, 2024
    Link Parent
    Ooh, I'll have to look into KANs, this is my first time hearing about them! Ultimately I think LLMs on their own aren't really the best suited to solving math problems. The question is what we do...
    
    Ooh, I'll have to look into KANs, this is my first time hearing about them!
    
    Ultimately I think LLMs on their own aren't really the best suited to solving math problems. The question is what we do when we have a task that has some level of basic mathematical reasoning as a prerequisite. imo at this stage, it's almost always going to be better to use a language model just for extracting relevant information from the natural language problem (probably fine-tuning it to an extent on how to do this depending on the expected input, since as you say the specific training does matter a lot for these things) and then using something that isn't machine learning to ultimately do the math. But ofc that depends a lot on what your ultimate goal is when it comes to this -- whether you're trying to automate a task or to expand what LLMs are capable of doing.
    
    3 votes
    
    Greg
    October 23, 2024
    Link Parent
    Yeah, I'm with you on that! The most successful examples I've seen all start fine-tuning from a model that's already been trained for coding, and give it access to a python interpreter to execute...
    
    Yeah, I'm with you on that! The most successful examples I've seen all start fine-tuning from a model that's already been trained for coding, and give it access to a python interpreter to execute the actual manipulation; whether the model can successfully interpret a complex problem from natural language to a formalised representation is by far the more interesting question to me than whether it can then do the subsequent arithmetic. Although to be fair I'm primarily familiar with this problem space from a "research into techniques for improving LLMs themselves and neural nets more broadly" angle rather than "this may have an immediate use case", which definitely colours my perception.
    
    It's an interestingly tough question even just having a model that can accurately do that translation from normal text to something interpretable (code, formal logic, symbolic representation, etc.) - like, "$2M bounty for whoever can hit a specific milestone by April" kind of tough - so I'm reasonably confident that a better understanding of how to bake in this kind of reasoning will improve LLMs in general.
    
    Ooh, I'll have to look into KANs, this is my first time hearing about them!
    
    The repo and docs that accompany the original paper are one of the nicest examples of academic coding I've seen, actually! Well written examples and explanations, functions baked right into the codebase to understand and visualise what's happening in your own networks, the whole nine yards.
    
    The vast majority of what I see nowadays is published with zero documentation beyond the paper and is an absolute mess of copy-paste coding and half finished paths from experiments that never went anywhere - I don't blame the academics for that, time is always short and I'd rather spend a week deciphering their spaghetti code than six months trying to reimplement it from scratch with no guarantee of success, but it's a breath of fresh air to see someone who was able to put all the polish on their release.
    
    1 vote