This all feels quite damning, and in stark contrast to the marketing claims and suggestions people make based on benchmark performances and the like . Where is the disconnect here? Inability to...
This all feels quite damning, and in stark contrast to the marketing claims and suggestions people make based on benchmark performances and the like . Where is the disconnect here? Inability to handle new information, weaker models being used for news summarisation, different types of problem tested in benchmark, poor product design, these models just aren’t as good as they’re being made out to be? Some combination? In any case, the idea of rolling out any kind of software with this kind of error rate seems insane - and a really bad pr move in the face of where we are with misinformation proliferation - did they even do a validation? it seems to fit in with the narrative that these companies are desperate to show some kind of real value added from this tech…
sure, i agree, i think they (and other media corps') incentive is that they want human eyes on their web pages, for various reasons. otoh most of what we hear are tests run by those who really...
sure, i agree, i think they (and other media corps') incentive is that they want human eyes on their web pages, for various reasons. otoh most of what we hear are tests run by those who really believe in LLMs, (often that they are a straight line to AGI), if not directly conducted by those who stand to profit from people believing that, so it's interesting to see a test that is not biased in that particular way (even if it is in the other direction, somewhat). like, given the results here, i would not feel safe at all using an LLM mediated summary or QA session on something complex like a research paper.
i also think that in spite of the conflict of interest, risk of bias here is tempered by the very simple study design. on the other hand, what they really should have done is to see how those same journalist experts rated human-synthesised summaries of those stories as a control. i'm sure anyone who is a subject matter expert will also have had the sad experience of seeing it done dirty for a mass audience, and it's easy to see misinformation propagate even via good-faith re-reporting of news stories.
edit: they also should have blinded the experts to whether an ai or human was answering their question, to control for anti-ai bias
This all feels quite damning, and in stark contrast to the marketing claims and suggestions people make based on benchmark performances and the like . Where is the disconnect here? Inability to handle new information, weaker models being used for news summarisation, different types of problem tested in benchmark, poor product design, these models just aren’t as good as they’re being made out to be? Some combination? In any case, the idea of rolling out any kind of software with this kind of error rate seems insane - and a really bad pr move in the face of where we are with misinformation proliferation - did they even do a validation? it seems to fit in with the narrative that these companies are desperate to show some kind of real value added from this tech…
sure, i agree, i think they (and other media corps') incentive is that they want human eyes on their web pages, for various reasons. otoh most of what we hear are tests run by those who really believe in LLMs, (often that they are a straight line to AGI), if not directly conducted by those who stand to profit from people believing that, so it's interesting to see a test that is not biased in that particular way (even if it is in the other direction, somewhat). like, given the results here, i would not feel safe at all using an LLM mediated summary or QA session on something complex like a research paper.
i also think that in spite of the conflict of interest, risk of bias here is tempered by the very simple study design. on the other hand, what they really should have done is to see how those same journalist experts rated human-synthesised summaries of those stories as a control. i'm sure anyone who is a subject matter expert will also have had the sad experience of seeing it done dirty for a mass audience, and it's easy to see misinformation propagate even via good-faith re-reporting of news stories.
edit: they also should have blinded the experts to whether an ai or human was answering their question, to control for anti-ai bias