0 votes

AI might not be coming for lawyers’ jobs anytime soon

1 comment

  1. skybrian
    (edited )
    Link
    Despite the headline making a hedged prediction, this article seems to be more about the present than the future: ... Looks like there is a leaderboard for that benchmark here. Note that there are...

    Despite the headline making a hedged prediction, this article seems to be more about the present than the future:

    [L]awyers say that LLMs are a long way from reasoning well enough to replace them. Lucas Hale, a junior associate at McDermott Will & Schulte, has been embracing AI for many routine chores. He uses Relativity to sift through long documents and Microsoft Copilot for drafting legal citations. But when he turns to ChatGPT with a complex legal question, he finds the chatbot spewing hallucinations, rambling off topic, or drawing a blank.

    “In the case where we have a very narrow question or a question of first impression for the court,” he says, referring to a novel legal question that a court has never decided before, “that’s the kind of thinking that the tool can’t do.”

    ...

    [N]ew benchmarks are aiming to better measure the models’ ability to do legal work in the real world. The Professional Reasoning Benchmark, published by ScaleAI in November, evaluated leading LLMs on legal and financial tasks designed by professionals in the field. The study found that the models have critical gaps in their reliability for professional adoption, with the best-performing model scoring only 37% on the most difficult legal problems, meaning it met just over a third of possible points on the evaluation criteria. The models frequently made inaccurate legal judgments, and if they did reach correct conclusions, they did so through incomplete or opaque reasoning processes.

    Looks like there is a leaderboard for that benchmark here. Note that there are two datasets. GPT-5-Pro got 37% on the "hard subset" and almost 50% on the full dataset (shown on the right side).

    Defining a new, tougher benchmark is certainly useful. But AI labs sometimes saturate benchmarks in a year or two, so we'll see.

    If this benchmark turns out to be too easy to capture what lawyers do, I assume there will be more benchmarks.