17 votes

No, GPT4 can’t ace MIT - a critical analysis of “Exploring the MIT Mathematics and EECS Curriculum Using Large Language Models”

6 comments

  1. [4]
    boxer_dogs_dance
    Link
    This article critically examines a recent paper that claims that CHATGPT aced an exam at the Massachusetts Institute of Technology and claims to disprove the finding. It reminds me of...

    This article critically examines a recent paper that claims that CHATGPT aced an exam at the Massachusetts Institute of Technology and claims to disprove the finding. It reminds me of https://en.wikipedia.org/wiki/Clever_Hans, the famous horse.

    12 votes
    1. [3]
      paper_reactor
      Link Parent
      Definitely reminiscent of Clever Hans. Another issue I have with a lot of AI-driven papers, which includes ChaptGPT research, is how much does training a large language model really constitute as...

      Definitely reminiscent of Clever Hans. Another issue I have with a lot of AI-driven papers, which includes ChaptGPT research, is how much does training a large language model really constitute as research. Training an LLM is of course required for application, but does that really make it research in the grand scheme of things? I haven't read the original paper and know it's on arxiv which isn't peer reviewed, but the AI-hype is a little crazy. I saw something not to long ago where someone used machine-learning to generate synthetic data from a data set (how does adding noise to an existing data set to have more data lead to better results), and I didn't see the real significance in it.

      I work in neutronics, and this reminds me of seeing published papers on someone modeling a nuclear reactor using some Monte Carlo code. It's not rocket science to set up a model in well-established code and pull out some results. At some point, people just focus on the doing without focusing on the significance.

      8 votes
      1. [2]
        TallUntidyGothGF
        Link Parent
        I think it is following the usual pattern of scientific publishing when a new method captures the public and scientific imagination. For example, when whole genome sequencing became feasible, you...

        I think it is following the usual pattern of scientific publishing when a new method captures the public and scientific imagination. For example, when whole genome sequencing became feasible, you could easily get a big paper just by sequencing [species x]. Over time this in itself became less publishable, since it became obvious that you'd be able to do it, and the mere doing of it didn't really lead to what it promised. Then, eventually, to publish with it, you had to actually find something new with, or about, the method.

        Your overall point is interesting. I do, personally, see an application in this context as research: you're performing an experimental application of a method, recording results, and you're evaluating it systematically. Ideally you're using it to find things out about the method, and the world. I see systematic evaluation as the thing that makes science science.

        Also kind of a tangent, but I work in health informatics over several points along the translational pipeline between basic methodological development and implementation in healthcare systems. When you're developing your method, nobody really cares about your application (except as a kind of comparative proof of principle), and when you're applying it, no one really cares about the method. It took me a while to learn this in the beginning: you have to write separate papers, for separate audiences, using completely different language, and different modes of evaluation.

        In translational medicine, however, it has always struck me as a difficult setup - many issues with, for example, the statistically illiterate implementing and evaluating models that are used in healthcare, or the medically illiterate implementing and evaluating models that are used in healthcare, or the ethically illiterate implementing and evaluating models that are used in healthcare, and so on...

        That said, I find it kind of exciting, really, that it's not always clear what the implications or significance of more basic results are at the time, and enjoy the creativity of finding new ways they can be combined, modified, and applied, to actually achieve something in the world.

        edit: grammar

        4 votes
        1. paper_reactor
          Link Parent
          I agree, this always happens when something starts seeing such a widespread audience. And like you said, there absolutely is validity to publishing something where application is the purpose...

          I agree, this always happens when something starts seeing such a widespread audience. And like you said, there absolutely is validity to publishing something where application is the purpose rather than just coming up with some novel method or novel set of results. Validation and application of a method is key and after all, application is almost tantamount to the usefulness of something which should be published. I took a glance at the paper, and I do think the ultimate goal of looking at how classes are linked is interesting.

          With that, I do think significance of results is something that gets pushed to the background sometimes when something so widespread as AI and machine learning has taken center-stage so commandingly. It's something I struggle with in my own research. Significance has a bunch of contributing factors that include having solid data, solid methods, and solid results with discussion about those results and why it could matter. And like you said, literacy is important to explaining significance.

          Significance takes on many flavors too, so being novel isn't the only qualifier to being significant. I think that in the name of putting out results, the significance is diluted since the pressure to put out results is increased in crowded fields. I find AI exciting and it has so many potential applications; I've seen some truly cool applications of machine learning that have me excited.

          Additionally, and this is nothing new, it doesn't help that the general public can misinterpret and skew the intent of research while spreading it.

          1 vote
  2. pageupdraws
    Link
    This is a fascinating analysis and critical in all the right ways. Has anyone seen similar treatments for other claims about GPT4 scores for other well known exams? I recall reading claims about...

    This is a fascinating analysis and critical in all the right ways. Has anyone seen similar treatments for other claims about GPT4 scores for other well known exams?

    I recall reading claims about scores on the Bar exam, AP Calculus, etc. and I am curious to learn if those are well substantiated.

    4 votes
  3. ebonGavia
    Link
    The conclusions reached in the article do not support the headline as written. The authors explicitly disclaim any knowledge about whether GPT4 could pass the exam – the criticisms are of the...

    The conclusions reached in the article do not support the headline as written. The authors explicitly disclaim any knowledge about whether GPT4 could pass the exam – the criticisms are of the testing methodology, which does indeed seem to have been fraught with errors.

    2 votes