4 votes

Task-Specific LLM Evals that Do & Don't Work

Posted December 9, 2024 by skybrian

Tags: artificial intelligence, machine learning, evaluation, language models.large, author.eugene yan, source.eugeneyan, long read

https://eugeneyan.com/writing/evals/

Link information

This data is scraped automatically and may be incorrect.

Authors: Eugene Yan
Published: Mar 31 2024
Word count: 6254 words

2 comments

sparksbet
December 9, 2024
Link
The portion on classification (which I've been working with for years now) definitely reflects what I would consider the best evaluation metrics for the task no matter what model you're using, and...

The portion on classification (which I've been working with for years now) definitely reflects what I would consider the best evaluation metrics for the task no matter what model you're using, and as a result, I have a lot of faith in the author's picks for the other tasks, even those I haven't worked with myself! Even when LLMs stop being the next big thing, good evaluation metrics on tasks we care about will almost certainly be applicable to whatever comes next.

4 votes
skybrian (OP)
December 9, 2024
Link
From the article: (I don’t expect I’ll ever do this kind of programming, but I think it’s an interesting window into what it takes to use LLM’s in production applications.)

From the article:

If you’ve ran off-the-shelf evals for your tasks, you may have found that most don’t work. They barely correlate with application-specific performance and aren’t discriminative enough to use in production. As a result, we could spend weeks and still not have evals that reliably measure how we’re doing on our tasks.

To save us some time, I’m sharing some evals I’ve found useful. The goal is to spend less time figuring out evals so we can spend more time shipping to users. We’ll focus on simple, common tasks like classification/extraction, summarization, and translation. (Although classification evals are basic, having a good understanding helps with the meta problem of evaluating evals.) We’ll also discuss how to measure copyright regurgitation and toxicity.

(I don’t expect I’ll ever do this kind of programming, but I think it’s an interesting window into what it takes to use LLM’s in production applications.)

2 votes