I don't buy any LLM benchmarks. They mislead laypeople about the tech's capabilities. They give people the impression that once they hit 100% (or close to it) a career has been automated away....
with the best-performing model scoring only 37% on the most difficult legal problems
I don't buy any LLM benchmarks. They mislead laypeople about the tech's capabilities. They give people the impression that once they hit 100% (or close to it) a career has been automated away. LLMs often score 60-70+% on software engineering benchmarks but I find them to do far less than the majority of my work. Sure, a higher number should generally mean a better tool. But benchmarks can - and do - make it into the training data for models. They can include leading questions that give away the answer. If the humans writing the benchmark questions know the correct answer they can easily accidentally encode part of the answer in the way they word the question.
And then consider how much of a job isn't addressed by tests. How much of a lawyer's job is working with/finding clients? Or consider the Achilles' heel of AI benchmarks - that the real world isn't a test with questions. It's a flow of information that often requires a person to realize, unprompted, there is a problem. Then they must search to refine their understanding of the problem. And finally they can try to answer that problem. It's the difference between book smarts and experience.
Yes, that’s why benchmarks keep getting replaced by harder benchmarks. I think they do show progress of a sort, but maybe not as much as some might think, and not as well as writing custom...
Yes, that’s why benchmarks keep getting replaced by harder benchmarks. I think they do show progress of a sort, but maybe not as much as some might think, and not as well as writing custom benchmarks for whatever a business really cares about.
There are also more subtle ways benchmarks can be beaten, not by the exact question leaking into the test data, but by adding a bunch of questions somewhat like what’s on the test. LLM’s do generalize somewhat, so that’s often enough.
We’ve had at least 40 years of ordinary software engineering being used to automate routine office work. Sometimes jobs disappear, but more often it changes the nature of the work. I see LLM’s continuing that trend at a somewhat faster pace.
Despite the headline making a hedged prediction, this article seems to be more about the present than the future: ... Looks like there is a leaderboard for that benchmark here. Note that there are...
Despite the headline making a hedged prediction, this article seems to be more about the present than the future:
[L]awyers say that LLMs are a long way from reasoning well enough to replace them. Lucas Hale, a junior associate at McDermott Will & Schulte, has been embracing AI for many routine chores. He uses Relativity to sift through long documents and Microsoft Copilot for drafting legal citations. But when he turns to ChatGPT with a complex legal question, he finds the chatbot spewing hallucinations, rambling off topic, or drawing a blank.
“In the case where we have a very narrow question or a question of first impression for the court,” he says, referring to a novel legal question that a court has never decided before, “that’s the kind of thinking that the tool can’t do.”
...
[N]ew benchmarks are aiming to better measure the models’ ability to do legal work in the real world. The Professional Reasoning Benchmark, published by ScaleAI in November, evaluated leading LLMs on legal and financial tasks designed by professionals in the field. The study found that the models have critical gaps in their reliability for professional adoption, with the best-performing model scoring only 37% on the most difficult legal problems, meaning it met just over a third of possible points on the evaluation criteria. The models frequently made inaccurate legal judgments, and if they did reach correct conclusions, they did so through incomplete or opaque reasoning processes.
Looks like there is a leaderboard for that benchmark here. Note that there are two datasets. GPT-5-Pro got 37% on the "hard subset" and almost 50% on the full dataset (shown on the right side).
Defining a new, tougher benchmark is certainly useful. But AI labs sometimes saturate benchmarks in a year or two, so we'll see.
If this benchmark turns out to be too easy to capture what lawyers do, I assume there will be more benchmarks.
I don't buy any LLM benchmarks. They mislead laypeople about the tech's capabilities. They give people the impression that once they hit 100% (or close to it) a career has been automated away. LLMs often score 60-70+% on software engineering benchmarks but I find them to do far less than the majority of my work. Sure, a higher number should generally mean a better tool. But benchmarks can - and do - make it into the training data for models. They can include leading questions that give away the answer. If the humans writing the benchmark questions know the correct answer they can easily accidentally encode part of the answer in the way they word the question.
And then consider how much of a job isn't addressed by tests. How much of a lawyer's job is working with/finding clients? Or consider the Achilles' heel of AI benchmarks - that the real world isn't a test with questions. It's a flow of information that often requires a person to realize, unprompted, there is a problem. Then they must search to refine their understanding of the problem. And finally they can try to answer that problem. It's the difference between book smarts and experience.
Yes, that’s why benchmarks keep getting replaced by harder benchmarks. I think they do show progress of a sort, but maybe not as much as some might think, and not as well as writing custom benchmarks for whatever a business really cares about.
There are also more subtle ways benchmarks can be beaten, not by the exact question leaking into the test data, but by adding a bunch of questions somewhat like what’s on the test. LLM’s do generalize somewhat, so that’s often enough.
We’ve had at least 40 years of ordinary software engineering being used to automate routine office work. Sometimes jobs disappear, but more often it changes the nature of the work. I see LLM’s continuing that trend at a somewhat faster pace.
Despite the headline making a hedged prediction, this article seems to be more about the present than the future:
...
Looks like there is a leaderboard for that benchmark here. Note that there are two datasets. GPT-5-Pro got 37% on the "hard subset" and almost 50% on the full dataset (shown on the right side).
Defining a new, tougher benchmark is certainly useful. But AI labs sometimes saturate benchmarks in a year or two, so we'll see.
If this benchmark turns out to be too easy to capture what lawyers do, I assume there will be more benchmarks.