19 votes

The leading AI models are now very good historians

16 comments

  1. [4]
    balooga
    Link
    I’ve played around with ChatGPT’s text analysis enough to know it’s quite good at analyzing works it’s familiar with, and half-decent at BS-ing about stuff it’s never seen. For me, the most...

    I’ve played around with ChatGPT’s text analysis enough to know it’s quite good at analyzing works it’s familiar with, and half-decent at BS-ing about stuff it’s never seen.

    For me, the most impressive thing in this post was its image recognition capabilities. It did amazingly well with that photo of an 18th century manuscript. Everything was stacked against it:

    1. Non-English source (and an archaic one at that)
    2. Irregular line breaks
    3. Portions of some words are just missing
    4. Hand-written lettering
    5. Uneven line alignment
    6. Pictographic letterforms
    7. Some superscript
    8. Decorative elements
    9. Tons of smudges, either stained from an adjacent page or visible through the paper from the reverse side

    Even with all this, ChatGPT was apparently able to get things mostly correct. That’s very impressive. I can’t vouch for its historical interpretations but I was blown away by its literal description of the contents of the image.

    I’m curious about what model was used. The author says it was o1, but as far as I can tell there’s no way to upload images to that one. Either this was a mistake and it’s actually 4o, or there’s some way to do it that I’m unaware of. A higher subscription tier maybe? Not sure.

    Just on Saturday I was out and about and saw a bird I didn’t recognize. I snapped a pic of it with my phone asked ChatGPT (4o, the only model I believe accepts file uploads) to identify it. It instantly and unequivocally gave me the right answer. No hedging language or lists of possible matches. It just identified the bird from my low-quality snapshot and did so perfectly. I continue to find useful new use cases for this thing.

    12 votes
    1. [3]
      V17
      Link Parent
      I have the standard 20 USD subscription and can upload files, image recognition works. That is in the web version from my computer. I think it wasn't present during the preview phase and possibly...

      I’m curious about what model was used. The author says it was o1, but as far as I can tell there’s no way to upload images to that one. Either this was a mistake and it’s actually 4o, or there’s some way to do it that I’m unaware of.

      I have the standard 20 USD subscription and can upload files, image recognition works. That is in the web version from my computer. I think it wasn't present during the preview phase and possibly during launch, but it's there now.

      4 votes
      1. [2]
        balooga
        Link Parent
        Interesting, it seems to vary depending on your UI. I see the file upload button when I use the web app, but it's not there (for o1) in the iOS app. Thanks for confirming, I'll need to play around...

        Interesting, it seems to vary depending on your UI. I see the file upload button when I use the web app, but it's not there (for o1) in the iOS app. Thanks for confirming, I'll need to play around with it in a browser and see how it compares.

        2 votes
        1. EpicAglet
          Link Parent
          For me it seems I can only upload images, not other files to o1.

          For me it seems I can only upload images, not other files to o1.

  2. [3]
    skybrian
    Link
    From the blog post: ... ...

    From the blog post:

    I realize that some people reading this will probably be thinking around this point “why write a book about William James and the history of science, if the next OpenAI model is likely going to be able to auto-generate a decent approximation of it?”

    The answer is that I genuinely do believe that human consciousness and creativity is both an end in itself and a source of value in itself [....]

    ...

    After all: when you get down to it, o1 talking about a panopticon and Foucault in the above snippet is very, very similar to what a first year history PhD student might produce.

    This makes sense: 2025 is, after all, being hailed as the year that PhD-level AI agents will flourish. I can personally report that, in the field of history, we’re already there.

    But the architecture of these models, the data that feeds them and the human training the guides them, all converges on the median. The supposedly “boundary-pushing” ideas it generated were all pretty much what a class of grad students would come up with — high level and well-informed, but predictable.

    ...

    All that said — yes, these things can definitely “do” historical research and analysis now, and I am 100% certain that they will improve many aspects of the work historians do to understand the past, especially in the realms of transcription, translation, and image analysis. I find that pretty exciting.

    9 votes
    1. [2]
      heraplem
      Link Parent
      Won't pay the bills, though.

      The answer is that I genuinely do believe that human consciousness and creativity is both an end in itself and a source of value in itself

      Won't pay the bills, though.

      6 votes
      1. Minori
        (edited )
        Link Parent
        That's true, but humans still have a role to play in history. While these models might be fantastic at helping historians work through primary sources or develop ideas, humans still have some...

        That's true, but humans still have a role to play in history. While these models might be fantastic at helping historians work through primary sources or develop ideas, humans still have some comparative advantages which give an economic reason to have a human involved. Maybe someone will try replacing their professional historians with AI models, but it's more likely that existing historians will simply become productive with these new tools at their disposal!

  3. [3]
    Deely
    Link
    As a somewho developer.. I just don't understand. Currently I'm using a bit niche technology (definitely not latest frameworks) and from time to time I do a test by asking different LMs some...

    As a somewho developer.. I just don't understand. Currently I'm using a bit niche technology (definitely not latest frameworks) and from time to time I do a test by asking different LMs some simple questions about this technology stack. Just to explain: it's not a web based tech, it's specialized financial/procurement oriented tech for non-web applications for windows. (Actually web too, but that's separate tech branch). It's not a popular but also not a hidden knowledge or abandonware.

    So... everytime LMs answer is wrong. And I can't say that my question is hard, it's on the level "how to find records in specialized graphic list control by condition". Nothing fancy or tricky. It looks like LM read first two paragraphs of related documentation (because LM can produce code example that will work like half the time), but LM completely skips meaning of special cases that mentioned at the bottom of documentation page.

    And... How can I trust LM in any other questions? LM that become "historians", "lawyers", "advisors", "helpers", etc?

    8 votes
    1. heraplem
      (edited )
      Link Parent
      This is similar to my experience. I struggle to get useful information out of LLMs. I have a hypothesis for why this is: if I'm turning to an LLM for information, it's because searching on Google...

      This is similar to my experience. I struggle to get useful information out of LLMs.

      I have a hypothesis for why this is: if I'm turning to an LLM for information, it's because searching on Google has failed to yield satisfactory results. But if that's the case, then the LLM probably doesn't have a good answer in its training set, so it's unlikely to be able to help much.

      Something that I find really irritating about LLMs is that, in my experience, they really struggle to distinguish between important and non-important information. If I ask ChatGPT a question, the exact words I put in there significantly influence the output. But I'm not likely to know precisely what information is significant to the question when I don't know the answer! And I find that if I don't use the right "keywords", it will often fail to give me the correct answer. I've also found that it struggles to solve the XY problem.

      The best use-cases for LLMs that I've been able to find involve literally just churning out a bunch of text that requires more creativity than a macro system but not enough that I need to be super deliberate about it (e.g., writing a bunch of test cases, and even then that was just as a supplement to my own hand-written test cases, and even then some of the test cases had mistakes in them).

      I think that, unless there is a revolutionary advancement in the fundamental technology, the most significant use-cases for LLMs will involve fine-tuned models in specific applications, and even then there will probably need to be significant guardrails.

      9 votes
    2. onceuponaban
      Link Parent
      As far as I'm concerned: you don't. Not to say that LLMs are useless, but I see them more as ways to automate the "obvious" parts wherever traditional algorithms wouldn't be worth the effort to...

      How can I trust LM in any other questions?

      As far as I'm concerned: you don't. Not to say that LLMs are useless, but I see them more as ways to automate the "obvious" parts wherever traditional algorithms wouldn't be worth the effort to develop for that purpose. If you can reliably control for unsound output, or it doesn't actually matter in this situation, sure, go ahead and see if it makes your tasks easier. This is why for example I can see it making sense as a search engine assistant, since if it sources what it claims to be the information you're looking for, you can check if it's in fact based on the content of its source and not it suddenly turning into a word salad engine. But in any case where correctness is critical, if you cannot make sure that a wrong answer from an LLM will be noticed, then it must not be used. Not in its current state, at least.

      5 votes
  4. [4]
    thefactthat
    Link
    It's interesting to see this at the same time as this research coming to a very different conclusion, namely that LLMs struggle to "understand" complex historical topics. From my perspective,...

    It's interesting to see this at the same time as this research coming to a very different conclusion, namely that LLMs struggle to "understand" complex historical topics. From my perspective, history is a very broad church both in scope (i.e. time period and geography covered) and approach (i.e. research methodology), and what the linked article suggests is LLM proficiency in one particular area, that is archival research and early modern European history. Not to say it isn't impressive, but it doesn't cover the whole of what being a good historian is about. It would also be interesting to see the LLM's proficiency in drawing conclusions about documents from different historical periods and geographical areas.

    4 votes
    1. [2]
      creesch
      Link Parent
      The title very much fits the current hype, while the context of the article shows that LLMs have good potential to be useful tools in the hands of a competent historian for specific tasks.

      The title very much fits the current hype, while the context of the article shows that LLMs have good potential to be useful tools in the hands of a competent historian for specific tasks.

      7 votes
      1. skybrian
        Link Parent
        Yes, it's just a few examples showing some progress. Even evaluating those tasks could be done more rigorously. Someone should study how accurate they actually are at deciphering handwriting from...

        Yes, it's just a few examples showing some progress. Even evaluating those tasks could be done more rigorously. Someone should study how accurate they actually are at deciphering handwriting from different historical periods.

        It seems promising as a tool, though?

    2. onceuponaban
      (edited )
      Link Parent
      In my experience, and as far as I can tell, due to the nature of the technology itself, generative AI's capabilities are as broad as they are shallow. Picking LLMs as a specific example, they are,...

      In my experience, and as far as I can tell, due to the nature of the technology itself, generative AI's capabilities are as broad as they are shallow. Picking LLMs as a specific example, they are, to oversimplify, the equivalent of your phone's autocomplete if the data it could "learn" from had been massively scaled up. It can "answer" queries on many subjects, computed from what sequences of words were the most likely to follow the input according to its training. And, with careful enough selection and a broad enough dataset, that results in approaching a knowledge base you can directly discuss with... but that's all it does. It approaches one.

      At no point is the correctness or soundness of the statement's meaning part of controlling the generation of the tokens that eventually become the answer you receive from the model. For anything that goes outside of the bounds of the data it's been trained on, things go haywire quickly as what is calculated from the model's weights to be the most likely follow-up to the input diverges completely from truth. Even if the correct answer IS somewhere in the original dataset, there is no guarantee that the calculations establishing the weights of the model correlate with coherent reasoning given relevant input. Because that's simply not what this technology is meant for. That's why an LLM's answers' quality drops off a cliff at arbitrary points and "hallucinate" while still sounding just as confident as whenever they are correct because the data it was trained on happened to induce statements that match the truth.

      As efficiency and model size increases, allowing to leverage larger datasets into larger models in a way that is more likely to generate correct results, so does the perceived knowledge in a given subject for the user, as well as the amount of topics where this baseline surface-level reliability can be established. But the proverbial cliff, and the resulting confidently wrong word salads, still exists, because that is an inherent limit of the technology in this application. Can this be useful for learning? For truly baseline knowledge about ubiquitous topics, and if you can reliably tell apart correct answers from automated misinformation, sure, and as LLMs improve, what this will return useful enough results for will expand accordingly, to a point. But the whole issue with using an LLM for learning is that you can't tell without researching it yourself if what it told you is correct, and while particularly obvious instances of "hallucinations" are notorious, it can just as easily be subtly but dangerously wrong information, provided to you with the same confidence as everything else.

      There are ways to mitigate this, for example through integration with a search engine, as you can check actual sources of information to cross-check an LLM's claims and further your knowledge that way. This could be an overall improvement for the user if it happens to be faster than querying a search engine yourself (I personally haven't found that useful yet, but I can see the value in being able to use natural language to search for information rather than approach queries as you would a traditional data indexer). And if used that way, I can see their potential as an "AI historian". There will, no matter the improvements, still be a hard limit to how reliable the model can be before hitting the limit of how "deep" its knowledge is and its output degenerates into nonsense, but it can certainly improve its surface-level reach in more subjects to be useful in ways previous models weren't before, so long as people understand the limitations of this technology... and don't fall to AI-related grifter bullshit, but that's another matter entirely.

      3 votes
  5. [2]
    skybrian
    Link
    Here's a thread about handwriting. I wonder if LLM's have been tested on harder handwriting problems?

    Here's a thread about handwriting. I wonder if LLM's have been tested on harder handwriting problems?

    1. Minori
      Link Parent
      I don't think LLM's are being used for images (if they are, I have a lot of questions). I can see some of the newer transformer-based image models being useful for deciphering handwriting, but...

      I don't think LLM's are being used for images (if they are, I have a lot of questions). I can see some of the newer transformer-based image models being useful for deciphering handwriting, but it's still difficult to decipher some scripts unless there's something like a Rosetta Stone to train the model on.