42 votes

AI eroded doctors’ ability to spot cancer within months in study

Posted August 14 by unkz

Tags: tech industry, health, artificial intelligence, medicine, healthcare, doctors, studies, cancer, diagnostics, author.harry black, source.bloomberg, paywall

https://www.bloomberg.com/news/articles/2025-08-12/ai-eroded-doctors-ability-to-spot-cancer-within-months-in-study

50 comments

[16]
unkz (OP)
August 14
Link
It's an interesting finding, but I'm unsure of how to take it. After all, technology has "eroded" our societal abilities to do many tasks, like making rope by hand or forging knives from raw iron,...

It's an interesting finding, but I'm unsure of how to take it. After all, technology has "eroded" our societal abilities to do many tasks, like making rope by hand or forging knives from raw iron, but this is generally accepted as a good trade-off for letting us focus on higher order tasks. Is it not possible that we would be better off with a small number of specialized medical data scientists building widely distributed models that free up doctors' time for other specializations or actually performing surgery?

archive link

21 votes
1. [2]
  gpl
  August 14 (edited August 14)
  Link Parent
  If we already had widely distributed (open?) models that were shown to be at least as free of bias as scans/analyses without said models, I think I might agree. But I don't like the idea of my...
  
  If we already had widely distributed (open?) models that were shown to be at least as free of bias as scans/analyses without said models, I think I might agree. But I don't like the idea of my doctor becoming reliant on a tool that they may not always have access to, that may not always be around, that they may not completely understand, and whose performance may change unexpectedly with future iterations.
  
  47 votes
  1. Nichaes
    August 16
    Link Parent
    There are countless tools doctors use all the time that they definitely don't need to fully understand to use, might not be readily available and are proprietary (MRIs, CT scanners, blood...
    
    There are countless tools doctors use all the time that they definitely don't need to fully understand to use, might not be readily available and are proprietary (MRIs, CT scanners, blood analyzers, robotics for surgery, pacemakers, etc). I feel like the only actual problematic thing here is unreliable performance.
    
    7 votes
2. [12]
  CannibalisticApple
  August 14
  Link Parent
  When it comes to health, I personally think it's better to have multiple checks in place, especially for something as serious as cancer. That way if one misses it for whatever reason, there's...
  
  When it comes to health, I personally think it's better to have multiple checks in place, especially for something as serious as cancer. That way if one misses it for whatever reason, there's still a chance the other can catch it. It's basically the same principal as getting second opinions from other doctors.
  
  Besides, not all doctors and facilities will have access to AI. Rural areas, third world countries, hospitals with limited budgets... Even places that have AI can have scenarios where it can become useless while the rest of the hospital can still operate, such as the AI malfunctioning, some fatal design flaw with a specific model, or some update/patch screwing all AI software. Given how tight scheduling can be, having to reschedule even a single day's worth of appointments due to a technical issue could delay a screening by weeks, which could be just enough time for cancer to progress and spread to a point of no return. It would be better for the doctors to be able to run and analyze those screenings themselves ASAP.
  
  So to summarize: this is a vital skill that I do NOT want to see eroded on a wide scale. The skills you mentioned still have individual practitioners who can pass on the knowledge, but medicine is a much higher stakes field. Medical schools and universities seem likely to be in a top position to have AI, so I'd be worried about the skills of the teachers degrading. If the students get subpar training or become too dependent on AI during education, that can cause problems if they enter an environment without it.
  
  Also, one more question for people with more knowledge about hospitals: is there any chance that a ransomware attack could screw with AI at a hospital? Asking because I remember the ransomware attacks last year messed up some hospitals really badly, and led to some fatalities. If AI would also be vulnerable to ransomware attacks, and if hospital usage of AI extends beyond cancer detection to other tests and functions... Yeah, I absolutely do NOT want those skills to erode either.
  
  15 votes
  1. [8]
    stu2b50
    August 14
    Link Parent
    It's fair not to want skills to expire, but as the study noted, the rate of cancer detection went up after they started using the new imaging tools. It's a tough sell to say, "Hey, we didn't catch...
    
    It's fair not to want skills to expire, but as the study noted, the rate of cancer detection went up after they started using the new imaging tools. It's a tough sell to say, "Hey, we didn't catch your colon cancer because we want our surgeons to practice finding cancer. Sorry, bud."
    
    12 votes
    
    [7]
    vord
    August 14
    Link Parent
    Of course the reverse is true too. Sorry we didn't find your cancer because we can't afford the annual fee to the privately-owned cancer-detection monopoly, and none of the doctors we can afford...
    
    Of course the reverse is true too.
    
    Sorry we didn't find your cancer because we can't afford the annual fee to the privately-owned cancer-detection monopoly, and none of the doctors we can afford have the skill to detect without it.
    
    10 votes
    
    [3]
    unkz (OP)
    August 14
    Link Parent
    Not much different than “sorry, we can’t afford a PET scanner at this hospital”? But our solution in that case is to try to raise funds for a PET scanner, since we want the best possible technology.
    
    Not much different than “sorry, we can’t afford a PET scanner at this hospital”? But our solution in that case is to try to raise funds for a PET scanner, since we want the best possible technology.
    
    7 votes
    
    [2]
    vord
    August 15
    Link Parent
    Yes and no. I'll bet there is a lot more variety of PET scanners to choose from. With a refurb and used market for more affordable options.
    
    Yes and no. I'll bet there is a lot more variety of PET scanners to choose from. With a refurb and used market for more affordable options.
    
    3 votes
    
    Minori
    August 15
    Link Parent
    It's theoretically easy to make a public medical imaging model. Especially for countries with large public healthcare systems, it'd make sense to fund and improve medical technology as populations...
    
    It's theoretically easy to make a public medical imaging model. Especially for countries with large public healthcare systems, it'd make sense to fund and improve medical technology as populations age.
    
    3 votes
    
    [3]
    PepperJackson
    August 15
    Link Parent
    I also wonder what the future of this is. Meaning, okay, all pathologists are replaced with this technology, now how do we identify new entities? I think these models would be very good at picking...
    
    I also wonder what the future of this is. Meaning, okay, all pathologists are replaced with this technology, now how do we identify new entities? I think these models would be very good at picking up qualities that suggest a disease that humans do not notice at all, which would address this issue.
    
    I worry about where we go when we have no people with the skills to be able to train the next generation of pathologists, who could innovate on what we currently know. But on the other hand, I suppose the entire arm of medicine could vaporize and we could pay technologists to click the button without any medical training. Currently pathology relies a lot on clinical context, which I'm sure the models could incorporate.
    
    6 votes
    
    vord
    August 15
    Link Parent
    This sounds about right. Those whom don't grasp satire are doomed to make it a reality. In our world, the billionaires will get access to the doctors that don't fumble the probes. The rest of us...
    
    This sounds about right.
    
    Those whom don't grasp satire are doomed to make it a reality.
    
    In our world, the billionaires will get access to the doctors that don't fumble the probes. The rest of us will get McCVS.
    
    4 votes
    
    PeeingRedAgain
    August 16
    Link Parent
    I think a reasonably nuanced strategy would be to note the strengths and weaknesses of the human-driven and AI-driven processes and use them to complement each other's weaknesses. This may be...
    
    I think a reasonably nuanced strategy would be to note the strengths and weaknesses of the human-driven and AI-driven processes and use them to complement each other's weaknesses. This may be something as simple an an AI model highlighting a scan or section of frozen tissue for a qualified individual to review that may have otherwise been missed by human error.
    
    The problem is that reasonably-well thought out solutions are at odds with the entities trying to make as much money as possible, those trying to push a narrative for their own gain (monetary or otherwise), as well as an increasingly polar environment where any setback is immediately viewed as a proof of a complete failure of one of the two processes (depending on whatever your "gut" tells you the future of human or AI's roles in medicine should be).
    
    1 vote
  2. [3]
    tanglisha
    August 14
    Link Parent
    I don’t know about hospitals specifically, but in general anything connected to the internet is at risk of a ransomware attack.
    
    I don’t know about hospitals specifically, but in general anything connected to the internet is at risk of a ransomware attack.
    
    3 votes
    
    [2]
    Diff
    August 16
    Link Parent
    And anything with AI is going to be at risk of an internet or service outage. Or an outright service collapse.
    
    And anything with AI is going to be at risk of an internet or service outage. Or an outright service collapse.
    
    tanglisha
    August 16
    Link Parent
    Let’s not forget some company buying it out and then shutting it down.
    
    Let’s not forget some company buying it out and then shutting it down.
3. tanglisha
  August 14
  Link Parent
  It would be great if all the diagnostic information could be wrapped up in one thing like a tricorder. This works be especially helpful in places with no specialists and for folks who aren’t the...
  
  It would be great if all the diagnostic information could be wrapped up in one thing like a tricorder. This works be especially helpful in places with no specialists and for folks who aren’t the usual kind of patients for a specific doctor.
  
  On my first visit to the VA my doctor was surprised to see me. “Oh, all my other patients are men over 60!” He would not have been able to diagnose my hormonal disorder, I’m glad I walked in with paperwork for it.
  
  2 votes
[34]
stu2b50
August 14
Link
I feel like this is really a case where the, fairly useless, term "AI" is harming the ability to talk about things. Right now, LLMs take up most of the oxygen for "AI". Which is fair, I think. But...

I feel like this is really a case where the, fairly useless, term "AI" is harming the ability to talk about things. Right now, LLMs take up most of the oxygen for "AI". Which is fair, I think. But when Bloomberg writes "AI" in the headline, I feel like will cause people to conflate things which should not be conflated. LLMs have specific properties which are very different than what "AI" covers

(which, in practical terms, is anything that involves fitting matrices to data. Isn't that insane? Those have nothing to do with each other. How have we gotten in a state where matrices whose values are determined with statistical analysis = "artificial intelligence". But I disgress)

For example, LLMs hallucinate. Does a CNN based bounding box detector hallucinate? No. It gets things wrong, but it does not hallucinate.

LLMs hallucinate because what you train the model on and what you do with the model are very different - the model weights are trained on predicting the most likely next token. Then, you "manipulate" the input space so that the output is more likely to be what you want. If the input is "What is the capital of the United States?" then "The capital of the United States is Washington D.C" is more likely than "Spongebob killed my brother three years ago" as the following tokens.

That's very different than the hotdog-or-not model that is directly trained to produce a binary response, and is having its weights directly modified for that result on the training dataset.

These "AI" cancer-detection models are not LLMs. That's not to say that they can't be wrong, but more that the risks, benefits, and cons are just completely different than an LLM. And so, calling them "AI" is just directly harmful to a layman's ability to understand what's going on.

19 votes
1. [33]
  unkz (OP)
  August 14
  Link Parent
  Is that accurate? Or, what is the difference between a CNN (or vision transformers which share even more characteristics with LLMs) finding faces in static noise and an LLM making up the name of...
  
  For example, LLMs hallucinate. Does a CNN based bounding box detector hallucinate? No. It gets things wrong, but it does not hallucinate.
  
  LLMs hallucinate because what you train the model on and what you do with the model are very different
  
  Is that accurate? Or, what is the difference between a CNN (or vision transformers which share even more characteristics with LLMs) finding faces in static noise and an LLM making up the name of the capital of Germany? Why is one just “wrong” and the other a “hallucination”? They even face the same kinds of vulnerabilities, eg. adversarial images vs LLM jailbreaks.
  
  4 votes
  1. [32]
    stu2b50
    August 14 (edited August 14)
    Link Parent
    The difference is that an LLM is not trained to answer questions. It's trained to find the most likely subsequent token. You make it into something that "answers" questions by manipulating the...
    
    The difference is that an LLM is not trained to answer questions. It's trained to find the most likely subsequent token. You make it into something that "answers" questions by manipulating the input space such that the most likely tokens is probably something that is an answer to your question.
    
    When a chatbot says that "nyc is the capital of the united states" is that response wrong? It's not, really. That sentence, probabilistically is likely fairly frequent. That's a hallucination. The model isn't doing the wrong thing, it's doing the right thing, you're just not using the model for the task it was directly trained but, rather trying to get at a side-effect. It happens that correct answers are more likely to be next to questions in training sets. It doesn't have to be.
    
    An LLM output being wrong would be if you gave it the input "What is the capitol of the united states" and it said the most likely proceeding tokens are "这̴̢̨̯̞͉͚̟̱̓́̐͐͛̊̄͘是̶̯̇̏̅͌͑͌̄̎̓̔̓̐͆̃͝苹̷̻̈́̋̉̍̏͂́͑͆̚͝果̴̧̛̹̳̲̙̮͙͖͈̥̲̟̈́͆͋̃͗̒̑͝". That's probably not what the most likely proceeding tokens are.
    
    7 votes
    
    [2]
    unkz (OP)
    August 14 (edited August 15)
    Link Parent
    Sort of but what you're describing is a foundation model (with some caveats). Chatbot LLMs absolutely are trained to provide answers via RLHF, DPO, etc., and what they are generating is no longer...
    
    It's trained to find the most likely subsequent token.
    
    Sort of but what you're describing is a foundation model (with some caveats). Chatbot LLMs absolutely are trained to provide answers via RLHF, DPO, etc., and what they are generating is no longer what's statistically common in the training corpus but something else entirely.
    
    3 votes
    
    stu2b50
    August 14
    Link Parent
    True, and those make these chatbots hallucinate less, as they are a more direct form of training. But ultimately the bulk of the values of the model weights are from the supervised training on...
    
    True, and those make these chatbots hallucinate less, as they are a more direct form of training. But ultimately the bulk of the values of the model weights are from the supervised training on token prediction.
    
    Those methods also suffer from, well, sucking. RLHF is reinforcement learning, which is infamously shit even in scenarios where you can run billions of iteration in a simulation, and it's from human feedback, which introduces a massive bottleneck called: humans. You're getting extremely low signal data (2 bits, essentially), and it's puny in size.
    
    5 votes
    
    [29]
    Drewbahr
    August 14
    Link Parent
    Yes, it is literally wrong though.
    
    When a chatbot says that "nyc is the capital of the united states" is that response wrong? It's not, really.
    
    Yes, it is literally wrong though.
    
    1 vote
    
    [28]
    stu2b50
    August 14
    Link Parent
    Why is it wrong? You trained a model to predict the most likely tokens in sequence, and it gave some likely tokens in sequence.
    
    Why is it wrong? You trained a model to predict the most likely tokens in sequence, and it gave some likely tokens in sequence.
    
    8 votes
    
    [24]
    Drewbahr
    August 14
    Link Parent
    It's wrong because New York City is not the capital of the United States.
    
    It's wrong because New York City is not the capital of the United States.
    
    3 votes
    
    [23]
    stu2b50
    August 14
    Link Parent
    That depends on your criteria, no? It's wrong in a geographic sense, if you extract the geographic meaning from that string, but it's not wrong on the measure of "is this a likely sequence of...
    
    That depends on your criteria, no? It's wrong in a geographic sense, if you extract the geographic meaning from that string, but it's not wrong on the measure of "is this a likely sequence of characters following the string 'what is the capitol of the united states?'". The latter of which is what almost all of the model weights were trained on.
    
    And that's my point: this is very different from the "A""I" used in colon cancer detection, which is a fairly traditional use of computer vision, and is doing the first order thing it's trained on, rather than some nebulous second order property as LLM chatbots are used for.
    
    https://www.med.uio.no/helsam/english/research/projects/endobrain-international/index.html
    
    8 votes
    
    [22]
    Drewbahr
    August 14
    Link Parent
    The criteria is in the question. New York City is literally not the capital of The United States. I don't personally care about "model weights" or any of that training criteria. I care about...
    
    The criteria is in the question. New York City is literally not the capital of The United States.
    
    I don't personally care about "model weights" or any of that training criteria. I care about whether the answer is right, or wrong. And the answer, claiming that New York City is the capital of the United States, is literally, in all ways, wrong.
    
    2 votes
    
    [21]
    stu2b50
    August 14
    Link Parent
    Sure, and that's the difference between an LLM chatbot and a computer vision polyp classifier. The criteria you're using to judge the output isn't the criteria the model was trained on. Whereas...
    
    Sure, and that's the difference between an LLM chatbot and a computer vision polyp classifier. The criteria you're using to judge the output isn't the criteria the model was trained on.
    
    Whereas the criteria that latter was trained on is aligned with the intended behavior: whether or not an image of a polyp is cancerous.
    
    10 votes
    
    [20]
    Drewbahr
    August 14
    Link Parent
    This is where I rankle at the whole notion of "AI", "LLM", and general nomenclature surrounding this stuff. Sure, an LLM may not be a true AI - but nothing is, at least not yet. So AI becomes a...
    
    This is where I rankle at the whole notion of "AI", "LLM", and general nomenclature surrounding this stuff. Sure, an LLM may not be a true AI - but nothing is, at least not yet. So AI becomes a useful shorthand for working within the space that LLMs currently occupy. It's similar in my mind, to how everyone uses Kleenex instead of "facial tissue" or Q-Tips instead of ... whatever else you'd call them. It's become a generic term.
    
    The idea that an LLM "hallucinates" seems absolutely bonkers to me. I understand what you're getting at - the LLM is only as "accurate" as the information it's trained on. If an LLM is only ever given pictures of dogs, it's going to assume everything with four legs is a dog of some variety, or whatever.
    
    But it's not "hallucinating" anything. It's just wrong. An LLM can be, and often is, wrong. That's just the nature of things. "Hallucinating", to me, is just a euphemism that folks use to offset the discomfort they feel when this tool gives them bullshit.
    
    When a child points to a dog and calls it a horse, we don't say the kid was "hallucinating". They're just wrong. Adorably wrong, but still wrong. And yes, they're wrong because they don't know better until they've been corrected - but they're still wrong.
    
    2 votes
    
    [19]
    stu2b50
    August 14
    Link Parent
    That… wasn’t what I was getting at at all. My point is that LLM chatbots are trying to make use of second order properties of their metric by which they are trained, which make them unique in many...
    
    I understand what you're getting at - the LLM is only as "accurate" as the information it's trained on. If an LLM is only ever given pictures of dogs, it's going to assume everything with four legs is a dog of some variety, or whatever.
    
    That… wasn’t what I was getting at at all. My point is that LLM chatbots are trying to make use of second order properties of their metric by which they are trained, which make them unique in many ways to any other kind of multilevel perceptron.
    
    To reorient, since I’m not entirely sure what this is even on about anymore, the fundamental thing I’m positing in this thread is that the algorithm being used to detect cancer in colon polyps is fundamentally different from chatbot LLMs and therefore has fundamentally different properties.
    
    Do you disagree with this? Or not? I can’t tell at this point.
    
    8 votes
    
    [18]
    Drewbahr
    August 15
    Link Parent
    My whole disagreement is here: It is wrong. Full stop.
    
    My whole disagreement is here:
    
    When a chatbot says that "nyc is the capital of the united states" is that response wrong? It's not, really.
    
    It is wrong. Full stop.
    
    1 vote
    
    [16]
    vord
    August 15
    Link Parent
    Like, I'm never one to shy away at pedantic specifics of discussion, but what your aruging is a bit like saying "The screwdriver is a bad hammer because it doesn't hammer in nails." In this case,...
    
    Like, I'm never one to shy away at pedantic specifics of discussion, but what your aruging is a bit like saying
    
    "The screwdriver is a bad hammer because it doesn't hammer in nails."
    
    In this case, the chatbot is the hammer, and the LLM is the screwdriver. The LLM is a bad chatbot because being a chatbot is not what it was designed to do.
    
    What @stu2b50 is getting at is these cancer screeners are much more akin to using a screwdriver as a screwdriver.
    
    4 votes
    
    [15]
    Drewbahr
    August 15
    Link Parent
    A tool, even when used for the purpose it wasn't intended for, can still just be wrong. Please, I beg you, understand that I'm solely arguing that an LLM can be wrong. New York City is not the...
    
    A tool, even when used for the purpose it wasn't intended for, can still just be wrong.
    
    Please, I beg you, understand that I'm solely arguing that an LLM can be wrong. New York City is not the capital of the United States, no matter who or what says it is.
    
    I feel like I'm taking crazy pills here.
    
    5 votes
    
    [14]
    Minori
    August 15 (edited August 15)
    Link Parent
    They're saying that the LLM's purpose is to generate statistically-likely strings of text. While you might want it to generate factually true statements, that's not how the model works. The model...
    
    They're saying that the LLM's purpose is to generate statistically-likely strings of text. While you might want it to generate factually true statements, that's not how the model works. The model has no concept of truth.
    
    The purpose of a system is what it does. For LLMs, they generate conceivable strings of text. Coincidentally, those strings are sometimes factually true.
    
    The model's text may be factually wrong, but the system is working correctly. The language model is doing what it's programmed to do. True statements from LLMs are also "hallucinations". They're never guaranteed to generate factually true text.
    
    It's like creating a line of best fit from a scatter plot of data. Some data points may be precisely on the line. Some data points are further away from the line's predictions. In either case, the line isn't "wrong" because it's doing exactly what it's supposed to. The line is predicting where a data point might be. A line of best fit isn't a source of truth, and it never can be.
    
    8 votes
    
    [12]
    Drewbahr
    August 15
    Link Parent
    I get that. I understand that LLMs only occasionally end up saying factual things, and even then only by chance or coincidence. I understand statistical probabilities, I really truly do. I don't...
    
    I get that. I understand that LLMs only occasionally end up saying factual things, and even then only by chance or coincidence. I understand statistical probabilities, I really truly do.
    
    I don't care if the system is working correctly. I'm focused entirely on the statement I quoted.
    
    I must be taking crazy pills
    
    6 votes
    
    [6]
    DefinitelyNotAFae
    August 15
    Link Parent
    You're not. You're correct that it is a factually inaccurate statement. regardless of the fact the LLM is not designed to give factually accurate statements. To roughly use the metaphor above,...
    
    You're not. You're correct that it is a factually inaccurate statement. regardless of the fact the LLM is not designed to give factually accurate statements. To roughly use the metaphor above, you're saying a screwdriver is bad at hammering nails and they're explaining screwdrivers to you over and over again.
    
    I don't know why they think you're not getting their point, but I think they're not being actually helpful. It's not just you.
    
    9 votes
    
    [5]
    vord
    August 15 (edited August 15)
    Link Parent
    To jump back to the problem quote: When @Drewbahr said "It literally is," it caused a rift in the discussion. Because it was quite obvious (to me) from tone (and a quick non-LLM search result)...
    
    To jump back to the problem quote:
    
    When a chatbot says that "nyc is the capital of the united states" is that response wrong? It's not, really.
    
    When @Drewbahr said "It literally is," it caused a rift in the discussion. Because it was quite obvious (to me) from tone (and a quick non-LLM search result) that we knew NYC is not capital...it was a rhetorical question. So it felt, to me at least, that it was directed at the "It's not really" part, which made it sound like a rejection of this bit.
    
    That sentence, probabilistically is likely fairly frequent. That's a hallucination. The model isn't doing the wrong thing, it's doing the right thing, you're just not using the model for the task it was directly trained
    
    Which does suggest that the point being made was missed somehow. And thus the "how can I reframe this" chain began.
    
    I'm betting there is some degree of a complex venn diagram of neuro(a)typical thinking and languange/grammer interpretation which has resulted in two camps essentially saying "I don't know why we're fighting over this" at each other.
    
    6 votes
    
    DefinitelyNotAFae
    August 15
    Link Parent
    It feels like the commitment to the bit here is the problem with a mix of "failing to define terms before getting incredibly technical." I understood the point being made and understood the...
    
    It feels like the commitment to the bit here is the problem with a mix of "failing to define terms before getting incredibly technical."
    
    I understood the point being made and understood the frustration being expressed and it feels like the commitment to the bit has overridden useful communication. It's why I tried to suggest two alternative terms that could mean "wrong" but in different ways.
    
    6 votes
    
    [3]
    Drewbahr
    August 15
    Link Parent
    People can have disagreements and misunderstandings without being called neuroatypical. I'm going to walk away, because yet another person is coming in here trying to explain how my observation is...
    
    People can have disagreements and misunderstandings without being called neuroatypical.
    
    I'm going to walk away, because yet another person is coming in here trying to explain how my observation is incorrect.
    
    5 votes
    
    slabs37
    August 15
    Link Parent
    Good on ya for stepping away. I see your point and their point, you're looking at it in a way that the output is just plain wrong, they're looking at it so the process that it reached the output...
    
    Good on ya for stepping away. I see your point and their point, you're looking at it in a way that the output is just plain wrong, they're looking at it so the process that it reached the output can't be wrong, because the process is probability based and there's no wrong answer
    
    If we want something that is accurate to knowledge, it is wrong, full stop.
    But here we have a mere language model, only sounds plausible. This is a step towards what we want to have, but it's not what we want yet.
    
    4 votes
    
    vord
    August 15
    Link Parent
    Sorry that came across as an insult. Not my intent at all. Just trying to (unsuccessfully) convey a potential explaination for the dialog. Neurotypicality is a spectrum after all, there is no...
    
    Sorry that came across as an insult. Not my intent at all. Just trying to (unsuccessfully) convey a potential explaination for the dialog. Neurotypicality is a spectrum after all, there is no dividing line (and IMO the idea of a true normal is a false one). I was just as much throwing out my own atypicality.
    
    I was reminded of a comedic youtube short I have no hope of finding where the atypical person just kept repeating the same thing with different inflections to a "normal" person. It went something like:
    
    A: "I remembered you mentioned this thing you were working on, so I sent you a link I found, I thought you might like it."
    N: "Are you implying that I can't do my own research?"
    (Rinse/repeat a few times with different responses, but upon hitting the correct intonation)
    A: "I remembered you mentioned this thing you were working on, so I sent you a link I found, I thought you might like it"
    N: Oh, thank you!
    A: How do you normal people ever manage to communicate anything?
    
    That really stuck with me, and makes me mad I can't find it.
    
    1 vote
    
    [5]
    CptBluebear
    August 15
    Link Parent
    No I'm with you. "Hallucination" is this weird term used to anthropomorphize and cover up that the machine gave an output that isn't correct. As if it's just a quirk of nature. From a user...
    
    No I'm with you. "Hallucination" is this weird term used to anthropomorphize and cover up that the machine gave an output that isn't correct. As if it's just a quirk of nature.
    
    From a user perspective the output is wrong.
    
    But.. they're not arguing that. As designed, the machine decided that it's a statistic probability that would be an answer to that question. That isn't wrong, that's just the way the machine works.
    That said, I personally think their side of the argument does not matter if it ever wants to be a useful product. Because right now it's just a funny word salad machine not beholden to any correct response whatsoever, so who cares that it functions "technically correct".
    
    3 votes
    
    [4]
    stu2b50
    August 15
    Link Parent
    Honestly, just out of curiosity, is that really what you thought I was arguing? In the top level post I wrote, my main thesis, if you were to really boil it down, is that I think the polyp cancer...
    
    That said, I personally think their side of the argument does not matter if it ever wants to be a useful product. Because right now it's just a funny word salad machine not beholden to any correct response whatsoever, so who cares that it functions "technically correct".
    
    Honestly, just out of curiosity, is that really what you thought I was arguing?
    
    In the top level post I wrote, my main thesis, if you were to really boil it down, is that I think the polyp cancer detection model, which is not an LLM, is good and that I wish it would not be saddled with the baggage that LLMs have by lumping both together as “AI”.
    
    I cannot fathom where this impression came from that this was some kind of defense or argument for LLMs. If anything, to construct the former point, I was mainly making statements that showed how LLMs are extra bad at their job.
    
    4 votes
    
    [3]
    CptBluebear
    August 15
    Link Parent
    Same as @Drewbahr: I did not respond to the argument about specialised tool kits. I'm less worried about that. A healthcare trained tool will likely end up making fewer mistakes than doctors do...
    
    Same as @Drewbahr:
    
    My whole disagreement is here:
    
    When a chatbot says that "nyc is the capital of the united states" is that response wrong? It's not, really.
    
    It is wrong. Full stop.
    
    I did not respond to the argument about specialised tool kits. I'm less worried about that. A healthcare trained tool will likely end up making fewer mistakes than doctors do (or by extension an LLM), so you won't hear me complain. The caveat that we don't know where responsibility lies when it does make an error.
    
    [2]
    stu2b50
    August 15
    Link Parent
    You seem to agree with me that it’s not wrong? That’s something you said. The context for this discussion was that someone else asked what the difference was between a normal probabilistic model...
    
    You seem to agree with me that it’s not wrong?
    
    As designed, the machine decided that it's a statistic probability that would be an answer to that question. That isn't wrong, that's just the way the machine works.
    
    That’s something you said.
    
    The context for this discussion was that someone else asked what the difference was between a normal probabilistic model and an LLM chatbot. My answer is that LLM chatbots are extra bad, because even when they’re working correctly, they can be wrong.
    
    3 votes
    
    CptBluebear
    August 15
    Link Parent
    I'm in partial agreement. I understand where you're coming from but I fail to see how that's relevant unless you work in the field. From a user perspective, which is to say the people that use...
    
    I'm in partial agreement. I understand where you're coming from but I fail to see how that's relevant unless you work in the field. From a user perspective, which is to say the people that use LLMs, the answer is by all measurements incorrect. Whether or not means that the code behind it works as intended doesn't really matter. Even if I understand the technological reason it's not outputting the right answer, I can't very well see when it's outputting something factually incorrect. The DC example is rather plainly incorrect, but what about a topic I know little about? Yeah the system works as intended but I'm walking away with the wrong assumption.
    
    This discussion in particular is on the sidelines of your general point (which is a point I agree with). I also saw Drewbahr spiraling and figured I'd pipe up to share an opinion matching theirs.
    
    Long story short: No I don't fully agree. For any user, the output is clearly factually incorrect and by default makes using it unreliable. But that's talking about the secondary discussion spawned by Drewbahr and hallucinations, otherwise I am in full agreement.
    
    This thread is mainly miscommunication.
    
    2 votes
    
    unkz (OP)
    August 15 (edited August 15)
    Link Parent
    There are two issues that I take with this. First, an LLM does not generate statistically likely strings of text. Even base models before fine-tuning don’t generate statistically likely strings of...
    
    They're saying that the LLM's purpose is to generate statistically-likely strings of text. While you might want it to generate factually true statements, that's not how the model works. The model has no concept of truth.
    
    There are two issues that I take with this.
    
    First, an LLM does not generate statistically likely strings of text. Even base models before fine-tuning don’t generate statistically likely strings of text. This conception of what an LLM does is rooted in Markov chains, which are a simple predecessor of LLMs and do indeed generate statistically likely strings of text. LLMs also output tokens from a probability distribution, but those probabilities don’t come from raw counts. They emerge from deep, nonlinear transformations over high-dimensional, sparsely activated features that encode concepts and constraints. All that to say if the training corpus contains 90 cases of "dogs are mammals" and 10 cases of "dogs are fish" it does not follow that it will simply report that dogs are mammals 90% of the time.
    
    Second, and this follows from the first, the way these non-linear transforms and normalizations and the resulting sparse features operate results in stable activation patterns, similar to attractor states in dynamical systems theory, that serve to efficiently minimize training loss across diverse paraphrases of the same concepts. These conceptual basins are qualitatively different than simple statistical representations and act very much like a "concept of truth" that is encoded within the parameters. In fact, I would argue that this is strong parallel to what human beings use as a concept of truth, where conflicting facts are frequently held in superposition in our own minds, depending on the context in which the "fact" is being recalled.
    
    3 votes
    
    stu2b50
    August 15
    Link Parent
    I’ll leave it at this, since I don’t think anything further can be gained, but I said multiple times that it’s semantically wrong. I mean, it’s not even a real example, I made it up as an example...
    
    I’ll leave it at this, since I don’t think anything further can be gained, but I said multiple times that it’s semantically wrong. I mean, it’s not even a real example, I made it up as an example of a hypothetical where an LLM chatbot makes a semantic mistake. Its whole purpose is to be wrong.
    
    I feel like from the other reply you think this is a defense of LLM chatbots, that they can’t be wrong, but it’s literally the opposite. I’m making the argument that LLMs are wrong more of the time. A normal model produces true positives, true negatives, false positives, false negatives. An LLM chatbot can produce a true positive and still be semantically wrong.
    
    That’s not only means that a greater portion of the potential outputs are semantically wrong, it means that these are very hard to correct for, since it was a true positive result.
    
    4 votes
    
    [3]
    DefinitelyNotAFae
    August 14
    Link Parent
    Perhaps wrong is too broad. It's incorrect but not broken might be clearer?
    
    Perhaps wrong is too broad. It's incorrect but not broken might be clearer?
    
    [2]
    stu2b50
    August 14
    Link Parent
    There's nothing incorrect about the input and output of the model as to what it was trained to do. Here's an analogous situation. You know the models that draw boxes around subjects in images for...
    
    There's nothing incorrect about the input and output of the model as to what it was trained to do.
    
    Here's an analogous situation. You know the models that draw boxes around subjects in images for security cameras?
    
    Let's say you had a need to generate green boxes of varying sizes. You're going to use this model to do that. You do this by supplying the model with pictures of cats in your living room which are roughly the same size and proportions as the box you want it to draw. This is what people are doing when they are prompt engineering - you're shaping the input such that output has the likelihood of having properties you want.
    
    One time you gave it a picture with two cats in it, one in the foreground (the one you want the box of) and one in the background. The model draws the box around both the cat in the background and the foreground, making one big box. That's not the box you wanted! This is like what happens when the LLM is saying "NYC is the capitol of the united states".
    
    Is it wrong? Not really, it generated boxes around entities in the image. But when you're after second-order properties (like the shape of the bounding box rather than the accuracy of the bounding box, or the logical soundness of a reply), then it's inherently going to get a lot more inaccurate, because many outputs that are correct in the first order will not have the second order properties you want (things that sound reasonable, but aren't true, like "nyc is the capitol of the united states" - the validity of that sequence of tokens is a second order property).
    
    4 votes
    
    DefinitelyNotAFae
    August 15
    Link Parent
    If I shake a magic 8 ball and ask it if it's currently the year 1999 and it says "Yes" the answer is incorrect. But it's not broken. It did what was intended. Calling the answer wrong would be...
    
    If I shake a magic 8 ball and ask it if it's currently the year 1999 and it says "Yes" the answer is incorrect. But it's not broken. It did what was intended. Calling the answer wrong would be normal and wouldn't imply the Magic 8 Ball isn't working as intended.
    
    Arguing that it isn't wrong or incorrect because the Magic 8 Ball worked as intended would not make sense to most people.
    
    I take your point but you're not providing clarity and while you don't have to take my language, not providing some clarification besides the word "wrong," which means many things not just "not working as intended" is unhelpful.
    
    6 votes