The portion on classification (which I've been working with for years now) definitely reflects what I would consider the best evaluation metrics for the task no matter what model you're using, and...
The portion on classification (which I've been working with for years now) definitely reflects what I would consider the best evaluation metrics for the task no matter what model you're using, and as a result, I have a lot of faith in the author's picks for the other tasks, even those I haven't worked with myself! Even when LLMs stop being the next big thing, good evaluation metrics on tasks we care about will almost certainly be applicable to whatever comes next.
From the article: (I don’t expect I’ll ever do this kind of programming, but I think it’s an interesting window into what it takes to use LLM’s in production applications.)
From the article:
If you’ve ran off-the-shelf evals for your tasks, you may have found that most don’t work. They barely correlate with application-specific performance and aren’t discriminative enough to use in production. As a result, we could spend weeks and still not have evals that reliably measure how we’re doing on our tasks.
To save us some time, I’m sharing some evals I’ve found useful. The goal is to spend less time figuring out evals so we can spend more time shipping to users. We’ll focus on simple, common tasks like classification/extraction, summarization, and translation. (Although classification evals are basic, having a good understanding helps with the meta problem of evaluating evals.) We’ll also discuss how to measure copyright regurgitation and toxicity.
(I don’t expect I’ll ever do this kind of programming, but I think it’s an interesting window into what it takes to use LLM’s in production applications.)
The portion on classification (which I've been working with for years now) definitely reflects what I would consider the best evaluation metrics for the task no matter what model you're using, and as a result, I have a lot of faith in the author's picks for the other tasks, even those I haven't worked with myself! Even when LLMs stop being the next big thing, good evaluation metrics on tasks we care about will almost certainly be applicable to whatever comes next.
From the article:
(I don’t expect I’ll ever do this kind of programming, but I think it’s an interesting window into what it takes to use LLM’s in production applications.)