I've made the argument repeatedly that calling falsehoods output by an LLM "hallucinations" is a category error: the models don't have two modes, "normal" and "hallucinating"; they only have one...
I've made the argument repeatedly that calling falsehoods output by an LLM "hallucinations" is a category error: the models don't have two modes, "normal" and "hallucinating"; they only have one mode, and they are hallucinating all the time. It just happens to be the case that sometimes the things they hallucinate are true, or at least have enough verisimilitude to pass a sniff test.
(In a sense, this shouldn't be surprising. An LLM isn't going to generate a phrase like "colorless green ideas sleep furiously"¹, not because it doesn't mean anything, but because people don't say things like that—it's exactly the same reason it won't put adjectives on the wrong order, essentially a syntactic property of the training data. Stripping "stuff people don't say" from the set of all possible grammatical sentences heavily biases it towards true, or at least meaningful-seeming, statements, without actually taking any form of semantic truth into account.)
The way this article frames it is both clearer and more descriptive: LLMs bullshit, all the time. It's intuitively obvious to people that bullshitters often say true things, but that nevertheless you shouldn't take their word for anything.
¹ Though it obviously will generate that specific phrase, and expound upon it at length, because it already exists in the public discourse which is used as training data.
Yes, I agree that people need to be wary all the time. Treating it as a hint generator is best. The LLM doesn’t verify things, you do. (Or perhaps some other part of a larger system.) Nonetheless,...
Yes, I agree that people need to be wary all the time. Treating it as a hint generator is best. The LLM doesn’t verify things, you do. (Or perhaps some other part of a larger system.)
Nonetheless, there’s a distinction between the stuff it will say every time because it’s found frequently in its training data (maybe not word-for-word, but at a higher level of abstraction), and the stuff it says now and then because the output is based on a random number generator.
The non-random parts are what makes it seem convincing and a useful source of hints. The random parts make it seem smarter than it is, because sometimes it gets lucky and people stop there. Distinguishing the two requires multiple trials and maybe some statistics, and people mostly don’t do that.
Since my boss's boss just sent my division an email clearly written using ChatGPT, made a list (which LLMs also love doing) of points that were mostly bullshit, I fully support the idea of calling...
Since my boss's boss just sent my division an email clearly written using ChatGPT, made a list (which LLMs also love doing) of points that were mostly bullshit, I fully support the idea of calling anything they do bullshit.
I'm a software engineer. I use them all day every day. They're great at getting you half of the way to understanding a new topic as long as you know 90% about some adjacent topics, just enough to...
I'm a software engineer. I use them all day every day.
They're great at getting you half of the way to understanding a new topic as long as you know 90% about some adjacent topics, just enough to call it out on being wrong. It's important to clarify I'm using it to learn about software and programming - stuff where I will find out pretty quickly if I've been lied to.
They're very useful for language transformation tasks. I've taken 1000 lines of API documentation, pasted it into GPT-4, and it's produced a perfect TypeScript definition for the API.
When programming it's a nice little helper to have it auto-complete lines for you. Often times it's easy enough to guess what the next snippet of code will be and having it write that for me, adapting to my code style and correctly interpolating variables, is a time saver.
Related to above - writing code that tests other bits of code is a repetitive task. Having an LLM write tests for you is amazing. I wouldn't trust it for anything complicated here, but often times you're just iterating through all of the edge cases and it's easy to read through the auto-complete suggestion to verify it.
GPT-4 is great at explaining words in-context. A lot of words in technical communication are overloaded - they have 10 subtly different definitions. Being able to explain the context and then copy/paste the source material with a query about one of the words' contextual definitions is powerful. The alternative here would not be a dictionary, it would either be (a) just guessing and being about half right or (b) researching for an hour to try to figure it out for yourself. LLMs have a nice middle ground where you're probably moving closer to being correct than guessing and spending 99% less time than researching.
GPT-4 is pretty good at generating Python visualization code. I don't want LLMs writing most of the code in my repository, but for throw-away stuff I'll happily let it write a 50 lines of matplotlib function calls and list comprehension.
I'm a security engineer who does some coding, but I'm not writing code all day every day. For me, LLMs are incredibly useful for me to get the framework of a script running, which I can then tweak...
I'm a security engineer who does some coding, but I'm not writing code all day every day. For me, LLMs are incredibly useful for me to get the framework of a script running, which I can then tweak and adapt to do what I actually want it to do.
I don't have the innate familiarity with common patterns that software developers do, so the structure of a common usecase like sending requests to an API with error handling and so on would take me reinventing the wheel constantly, whereas I can just ask an LLM to write a script to get x from this API, then take that (usually nonfunctional) code and adapt it to what I want it to do. Saves literal hours of work that pops up as incidental to the things that I'm actually paid to do.
It's also great at writing domain specific queries in languages like KQL, Splunk, or other proprietary SQLish languages.
Any time I need something that's sorta righ-ish, it's often faster to use an LLM then correct it than write all the boring boiler plate type stuff myself.
Project Zero was playing with it recently and shared some thoughts you might be interested in. The Google Security blog also found some success in using it for reporting security incidents. It...
It seems like every domain is currently experimenting with how LLMs might help (or hinder) their chosen niche. It's interesting to see this learning process take place across so many fields at once.
I needed a fairly simple shape object in QML a few days ago and was very impressed when I asked ChatGPT to write the code for me and it gave me perfectly working code on the first try. I then...
I needed a fairly simple shape object in QML a few days ago and was very impressed when I asked ChatGPT to write the code for me and it gave me perfectly working code on the first try. I then asked it to rewrite the code at a different size and it again returned a perfect result. It's something I could have done just as well, but I would have had to do a bit of figuring to determine where the points of the line segments ended, and then when I decided I wanted a new size I would have had to do it again. It's these little time savers that are great.
Then again, other times it just writes completely nonsense.
I've been using ChatGPT to help me learn Japanese. I copy paste some Japanese text and ask it to break it down and explain the grammar and translate the vocabulary. It works super well at parsing...
I've been using ChatGPT to help me learn Japanese. I copy paste some Japanese text and ask it to break it down and explain the grammar and translate the vocabulary. It works super well at parsing Japanese sentences which are not broken up with spaces in a way that online dictionaries don't.
I think that this is like when people say that plastic surgery looks terrible: they never notice plastic surgery done well. When an LLM gets the tone of voice perfect, they're indistinguishable...
I think that this is like when people say that plastic surgery looks terrible: they never notice plastic surgery done well.
When an LLM gets the tone of voice perfect, they're indistinguishable from a human. By now you've certainly interacted with such an LLM.
Abstract Interesting, I'll have to dig in later. Idk if folks saw the "Chat GPT doesn't know what letters are" discourse but amidst someone on Bluesky insisting that lawyers should use AI for...
Abstract
Recently, there has been considerable interest in large language models: machine learning systems which produce human-like text and dialogue. Applications of these systems have been plagued by persistent inaccuracies in their output; these are often called “AI hallucinations”. We argue that these falsehoods, and the overall activity of large language models, is better understood as bullshit in the sense explored by Frankfurt (On Bullshit, Princeton, 2005): the models are in an important way indifferent to the truth of their outputs. We distinguish two ways in which the models can be said to be bullshitters, and argue that they clearly meet at least one of these definitions. We further argue that describing AI misrepresentations as bullshit is both a more useful and more accurate way of predicting and discussing the behaviour of these systems.
Interesting, I'll have to dig in later. Idk if folks saw the "Chat GPT doesn't know what letters are" discourse but amidst someone on Bluesky insisting that lawyers should use AI for discovering "relevant" legal information in other languages and translating it, a side conversation appeared demonstrating how Chat GPT 3.5 and 4.0 both are incapable of listing the states that start and end with the same letter or that contain an "M" or whatever. It'll give partially correct answers but omit others, and when corrected will give a different subset of correct but incomplete answers or flat out incorrect (California has no "m") ones.
Framing that as "bullshit" makes sense so thanks for this!
One I've seen before is asking it to count the number of letters in a phrase or fit a response into a character limit. It breaks the illusion fairly well: a human with actual cognition has no...
One I've seen before is asking it to count the number of letters in a phrase or fit a response into a character limit. It breaks the illusion fairly well: a human with actual cognition has no trouble knowing what characters are and counting them, which should be a trivial task for a mathematical machine, but it's incapable because all it does is select statistically likely words in a sequence.
I have seen this come up from time to time as a criticism, but I don't think its very fair to grade LLMs on this sort of task. The human brain can think of things without saying (or writing) those...
I have seen this come up from time to time as a criticism, but I don't think its very fair to grade LLMs on this sort of task. The human brain can think of things without saying (or writing) those thoughts. LLMs have no ability to think without writing. Their short term memory is limited to the input given by the user and the output it prints out. If you ask a human to write a 12 word poem, they will probably try a few combinations in their head, count the words, and reject the combinations that don't meet the 12 word requirement. They will probably NOT be able to come up with a poem that is 12 words without trying more than one combination. And the human brain has entire sections dedicated to counting and mathematics. LLMs have no such section of their "brain".
If we are willing to work with the limitations of current AI models, they become significantly better at this sort of task. One way is to allow them to "collect their thoughts" by writing arbitrary text that is NOT sent to the user in the conversation. Current AI chat platforms do not allow for this, but we can work around it by having the LLM delineate between "spoken" and "thought" words.
For some examples, I asked llama3:8b for a 12 word poem. First response:
Golden sunsets fade slowly, summer's warmth gives way to winter's chill.
11 words. I responded with this prompt:
Count the words in that poem. If it is not exactly 12 words, write a new poem with 12 words
I apologize for the mistake! Here are the words in the original poem:
Golden
sunsets
fade
slowly
summer's
warmth
gives
way
to
winter's
chill
The original poem has 11 words, not 12. Here is a new poem with exactly 12 words:
Softly
falls
the
snowflake
gently
landing
on
my
outstretched
hand
in
silence
By counting out the words, it is able to fit the requirements I gave it. With the correct prompt, you can even get it to perform this introspection itself without additional prompting. To use anthropomorphic terms, we have not yet taught LLMs introspection, and we see errors that would be expected of humans that were not trained introspection. It is possible more training or more advanced models can have this introspection built in, but for now we have to use creative prompts to account for this deficiency.
Just ask it to play tic tac toe, it can't remember the board. Also if you tell it you won, it will agree. Then you can tell it actually you were lying and you lost, and it will agree.
Just ask it to play tic tac toe, it can't remember the board. Also if you tell it you won, it will agree. Then you can tell it actually you were lying and you lost, and it will agree.
Another thing it wasn't able to do: tell me ten jokes without repeating them, and then tell me ten more jokes without repeating them. It'll keep apologizing and promising to do better and do it...
Another thing it wasn't able to do: tell me ten jokes without repeating them, and then tell me ten more jokes without repeating them. It'll keep apologizing and promising to do better and do it again and again.
Yeah it kept apologizing for giving me 3 states, and then giving me 2, and then 3 and then 2 and then finally Ohio, but just Ohio, etc. and that was just me recreating what others had done. My few...
Yeah it kept apologizing for giving me 3 states, and then giving me 2, and then 3 and then 2 and then finally Ohio, but just Ohio, etc. and that was just me recreating what others had done. My few free "4.0" answers weren't better.
Thinking of it as a bullshitter makes a lot of sense to me.
This reminds me of when I was using Code Interpreter and it made mistakes writing code so many times that after a while, it learned the pattern and always assumed it was wrong. Here's how I...
This reminds me of when I was using Code Interpreter and it made mistakes writing code so many times that after a while, it learned the pattern and always assumed it was wrong.
Here's how I explain it: to an LLM, repetition is a pattern. If it sees a lot of repetition, it decides that it's most likely autocomplete, so it will continue it. This is one reason a random number generator is often used, to try to add enough variation to break some patterns.
So it doesn't pay to ask it to correct errors more than once or twice. Better to edit your prompt and try again, or start a new chat session.
Fwiw, I did edit the prompt too, and no improvement. That is how it identified Ohio but never could get Ohio and Arizona on the same list. It's why bullshitting defines it for me. I know it's...
Fwiw, I did edit the prompt too, and no improvement. That is how it identified Ohio but never could get Ohio and Arizona on the same list.
It's why bullshitting defines it for me. I know it's autocompleting. It's funny at how bad it is at doing so. I was just entertaining myself by attempting to replicate wat others had already shared screenshots on
Depending on the model you were using, this could have just been an artifact of a limited context length. If the older jokes were not being passed to the model because of the context length, it is...
Depending on the model you were using, this could have just been an artifact of a limited context length. If the older jokes were not being passed to the model because of the context length, it is literally impossible for the model to remember anything (assuming nothing like RAG on items outside the context window, but I haven't heard of a system that does that.
I tested this 10 jokes without repeating on a locally run llama3 8b and it successfully gave me 20 jokes without repeating any. And for the state names with M, it did pretty well. It seemed to only be able to list states that started with M, but since letters in words is isn't how these LLMs work, it isn't surprising that they aren't good at that task. Judge a fish by climbing a tree and all that.
Someone tried to claim it only got capital Ms but it missed New Mexico in that example. I'm assuming it's bullshitting by using the answers to "what states start with M" but it's wild how you can...
Someone tried to claim it only got capital Ms but it missed New Mexico in that example. I'm assuming it's bullshitting by using the answers to "what states start with M" but it's wild how you can get it in the loop of apologizing and giving a worse answer, then apologizing and giving the first answer, etc. all wrong answers
I think folks looking for a silver bullet are right, it's bullshit for that. But as someone with ADHD, it's a lifesaver for me organizationally. I have all this info trapped up in my head that I...
I think folks looking for a silver bullet are right, it's bullshit for that. But as someone with ADHD, it's a lifesaver for me organizationally. I have all this info trapped up in my head that I usually just dump onto the page in a confusing, hap hazard manner. It used to take lots of editing and reediting and a whole bunch of folks helping me focus and narrow my points. Now I use chatGPT to give me the structure of a grant or a white paper or a report and then I go in and flesh it out. There are still folks reviewing and editing, but the lift on their end is much lighter.
I liked when folks were calling it the "drunk intern". In an age where few people proofread anymore of course there are going to be issues. But if it's used as a tool with a specific aim I think it's pretty helpful. The problem with the articles and op-eds on AI is that it acts like it can do everything, and that is bullshit. But it has some really powerful uses if you are aware of it's drawbacks and flaw and can enable some pretty big jumps in productivity. Grants used to take me about a month to get into submittable order, that has been cut down to about a week with ChatGPT's help.
This just seems like people not using it properly. It's good at writing tedious text. Email, coverletters, memos, etc. It cannwrite decent stories, poems and so on. It can help brainstorm ideas...
This just seems like people not using it properly. It's good at writing tedious text. Email, coverletters, memos, etc. It cannwrite decent stories, poems and so on. It can help brainstorm ideas and it's pretty good at analyzing text you feed it in various ways.
It's a great Google butler as long as you wrap your prompt in stipulations. E.g. today I asked it if male pattern baldness is transferred via the mother, and specified "only refer to peer reviewed...
It's a great Google butler as long as you wrap your prompt in stipulations. E.g. today I asked it if male pattern baldness is transferred via the mother, and specified "only refer to peer reviewed articles and cite your sources".
Its answer was, indeed, based on peer reviewed articles and it cited 6 different peer reviewed articles that all pertained to my question.
You can't just rely on it without any fact checking or skepticism, but I've found it extremely useful in my day to day life honestly. I got the paid version while I was job hunting to help with cover letters, but ever since 4o I'm using it so much that I decided to keep paying for it. YMMV of course.
I'm sure you did this in this instance, but reminder to everyone here to be sure to check whether any sources Chat GPT cites actually exist! Because it absolutely will make them up.
Its answer was, indeed, based on peer reviewed articles and it cited 6 different peer reviewed articles that all pertained to my question.
I'm sure you did this in this instance, but reminder to everyone here to be sure to check whether any sources Chat GPT cites actually exist! Because it absolutely will make them up.
Oh yes sorry if that wasn't clear - I indeed specifically went to each individual source and confirmed that 1) they were peer reviewed arcticles published in respected journals; 2) read each...
Oh yes sorry if that wasn't clear - I indeed specifically went to each individual source and confirmed that 1) they were peer reviewed arcticles published in respected journals; 2) read each abstract to confirm they were indeed all relevant; 3) none of the abstracts went against ChatGPTs conclusions and were complementary to the arguments ChatGPT was making.
Follow the advice above folks - verify like it's 2009 Wikipedia and you could just add yourself to a wiki page pretending to be Tom Cruise's cousin!
Until about a week ago, I had this exact opinion about ChatGPT, but then I realized what a blessing it is for developers. ChatGPT could mean different things for different folks. I consider it in...
Until about a week ago, I had this exact opinion about ChatGPT, but then I realized what a blessing it is for developers.
ChatGPT could mean different things for different folks. I consider it in the same league as PHP, Bootstrap, Google translate, font-awesome, etc. i.e infrastructure tools and resources web developers can use to solve their problems.
Especially as a freelance programmer, I'm finding it very useful. It reduces a lot of my grunt work.
For example, I just have to say "Please create a Bootstrap form with labels and inputs with following selectors. And please add a placeholder attribute to be the same as label text.". And voila! Instead of me toiling for half hour in Notepad++, ChatGPT gives me this code ready-made!
I also discovered that ChatGPT can double up as Google translate! The other day, I had to converse with a Spanish client and all I had to ask was, "Dear ChatGPT, please translate the following text to Spanish".
In the old days, they used to say that Google was omniscient. With respect to ChatGPT, the only analogue I can think of is story of Alladin and the bottled genie!
chatgpt is good with meal plans, double feature movie nights with a theme, and a lot of related stuff. it also wrote me some great github actions. i’m still a hater on some fronts, but it’s slowly...
chatgpt is good with meal plans, double feature movie nights with a theme, and a lot of related stuff. it also wrote me some great github actions.
i’m still a hater on some fronts, but it’s slowly winning me over with mundane decisions.
I've only used ChatGPT for two things: I like to keep a copy of my virtual textbook locally on my computer, but everything is browser based and they even restricted the number of pages you can...
I've only used ChatGPT for two things:
I like to keep a copy of my virtual textbook locally on my computer, but everything is browser based and they even restricted the number of pages you can print. So I copy-paste everything, which is a little annoying but it slows me down enough to read the material as I paste it. However, if I try to paste any tables, the formatting never carries over properly and it's a mess. So I tell ChatGPT to make the text a table (or bulleted list) and it does it.
Other thing is initial draft for cover letters when applying for jobs. Most of those land in the digital garbage bin before being read so I'm fine with getting help with them.
Edit: My point is that I only use it for very, very basic tasks. I've used it for starting points for hobby programming projects, and it seems to have worked well, but when you start to dive into more specialized areas or things that are more specific to your environment, it starts to fall apart.
I'm going to admit to having not read the whole thing, but I thumbed through it and hit most paragraphs. I generally agree with the sentiment, and it reminds me of my Business Intelligence class,...
I'm going to admit to having not read the whole thing, but I thumbed through it and hit most paragraphs.
I generally agree with the sentiment, and it reminds me of my Business Intelligence class, in which we briefly covered language modelling for business analysis, things like sentiment analysis to identify the vibe of reviews, and even predictive text and I believe generative models as an example. Turns out it was a bit of a primer on how Large Language Models work back in 2019.
That section stuck, and this article led my thinking to shift a bit, at least in how I would articulate how LLMs work. The bit with "They use the most likely word" that people say is sort of accurate, but it also analyzes the input its sentiment, the meaning of that, before generating a response.f
The shift in thinking for me is this notion of "bullshit," and it reminded me of my job. I'm an "IT Consultant," which is to say a dude paid to do the work our clients don't want to, or don't have the capacity for. I, and many of my coworkers, were hired on knowing nothing, just having knowledge articles and Google (so, everybody's public facing knowledge articles), and completing an internship. I actually use Bing Chat to start every search if I get stuck at work not because I suck at Googling, but because so much stuff is simply hard to google, especially in today's SEO dystopia. Bing Chat will go "Here's a solution and some links," and use real terminology I was unfamiliar with, or hadn't encountered before.
I'm generally bullshitting to my clients in the same way the model is bullshitting to me. We're both predicting the need of the end-user based on an aggregate of information, but where I draw comfort in my job security is I can apply a "reasonableness" measure to it. I can't quantify it as "this is 80% likely," but I can look at a response on a forum, extrapolate from a Microsoft Learn article, or use Copilot output and links and give it a sniff test for my scenario, cross-reference with my experience, documented practice, or even the aggregate as I can perceive it, and propose a next step in a problem.
A note: I don't use LLMs to write any of my communications. If a client asks me a question I wouldn't hit ChatGPT and regurigate the response, I will take some time to find the proper answer and offer a response I am confident in.
The conclusion is just 229 words and summarizes the, well, conclusion of the article concisely. But if even that is tl, the tl;dr is that "truthhood" is neither a design goal nor an actual...
The conclusion is just 229 words and summarizes the, well, conclusion of the article concisely. But if even that is tl, the tl;dr is that "truthhood" is neither a design goal nor an actual property of LLMs, and the word we use for language produced without concern for the truth is "bullshit".
Honestly, judging by the abstract and introduction, and a quick skim through the paper itself, this seems to just be another pedantic argument about how we should refer to AI output....
Honestly, judging by the abstract and introduction, and a quick skim through the paper itself, this seems to just be another pedantic argument about how we should refer to AI output. Unfortunately, wrapping a blog post in a white paper format doesn't make the content any more insightful.
It's probably meant to be tongue-in-cheek, but I think there's more interesting papers to read on AI.
Token generators with self-attention mechanisms (ie. LLMs) are definitely still AI, as defined in the field. They're the latest step beyond traditional machine learning and neural nets, which are...
Token generators with self-attention mechanisms (ie. LLMs) are definitely still AI, as defined in the field. They're the latest step beyond traditional machine learning and neural nets, which are also considered to be AI.
They're not AGI, but nobody is claiming that they are (except perhaps for Blake Lemoine). I haven't seen anybody else make the argument that these chatbots are conscious.
The term hallucinate stems from a technical article, written by one of the foremost scientists working in this area. It was adopted by others in the field and now describes a specific class of problem. Arguing over the language feels like prescriptivism at best, and bikeshedding at worst.
Personally, the term "hallucinate" reminds me of Do Androids Dream of Electric Sheep?, and I like that it removes any assumptions of intent or veracity. But it doesn't really matter what gets used, so long as people understand the limitations and capabilities of these tools.
For the record, I (and at least some others in the field fwiw) dislike the term hallucinations because it carries connotations that the model intends to convey the truth and has a knowledge or...
I like that it removes any assumptions of intent or veracity.
For the record, I (and at least some others in the field fwiw) dislike the term hallucinations because it carries connotations that the model intends to convey the truth and has a knowledge or perception problem -- after all, this is what we're typically talking about when we use the term for humans. The problem is that this introduces a lot of very faulty assumptions, especially to laypeople. The problem, as pointed out in this article, is that these models have zero regard for whether the information they present is true or not. Any veracity in the information conveyed by the generated utterances is a useful side-effect of the things these models are actually trained for. I think the term hallucination obfuscates the fact that there is no fundamental underlying difference between what the model is doing when its information is accurate and when it isn't.
Of course, "hallucination" has become the industry term, so I still use it to discuss this problem like anyone else. But criticizing particular technical terminology for being inadequate, confusing, or misleading isn't "prescriptivism". In my undergraduate thesis I had to spend paragraphs disambiguating between different things referred to as a "topic" in the literature, and linguists have no qualms commenting on bad terminology within our field. Whether that's ever effective at actually changing the terminology is another matter.
Any existing term you use is going to carry connotations that aren't accurate, because the language we have developed as humans was not taken with LLMs into account, they didn't exist. Just like...
Any existing term you use is going to carry connotations that aren't accurate, because the language we have developed as humans was not taken with LLMs into account, they didn't exist.
Just like the word "pilot" implied a local expert who guided a ship into harbor, not someone who directly steers a vessel before aircraft existed, but now it means both things.
"Bullshit" isn't exactly accurate either, and it's probably less accurate than hallucinate, because bullshitting is universally something done with the intent to deceive. LLMs don't mean to deceive us. They don't mean to do anything, they're machines with no motivations whatsoever.
I agree that "bullshit" carries similar problems, but I still think the existing term is worthy of criticism. I think it's possible to choose analogical language like this in more mindful ways --...
I agree that "bullshit" carries similar problems, but I still think the existing term is worthy of criticism. I think it's possible to choose analogical language like this in more mindful ways -- "stochastic parrots" is an example of a paper that did this effectively imo.
I've made the argument repeatedly that calling falsehoods output by an LLM "hallucinations" is a category error: the models don't have two modes, "normal" and "hallucinating"; they only have one mode, and they are hallucinating all the time. It just happens to be the case that sometimes the things they hallucinate are true, or at least have enough verisimilitude to pass a sniff test.
(In a sense, this shouldn't be surprising. An LLM isn't going to generate a phrase like "colorless green ideas sleep furiously"¹, not because it doesn't mean anything, but because people don't say things like that—it's exactly the same reason it won't put adjectives on the wrong order, essentially a syntactic property of the training data. Stripping "stuff people don't say" from the set of all possible grammatical sentences heavily biases it towards true, or at least meaningful-seeming, statements, without actually taking any form of semantic truth into account.)
The way this article frames it is both clearer and more descriptive: LLMs bullshit, all the time. It's intuitively obvious to people that bullshitters often say true things, but that nevertheless you shouldn't take their word for anything.
¹ Though it obviously will generate that specific phrase, and expound upon it at length, because it already exists in the public discourse which is used as training data.
Yes, I agree that people need to be wary all the time. Treating it as a hint generator is best. The LLM doesn’t verify things, you do. (Or perhaps some other part of a larger system.)
Nonetheless, there’s a distinction between the stuff it will say every time because it’s found frequently in its training data (maybe not word-for-word, but at a higher level of abstraction), and the stuff it says now and then because the output is based on a random number generator.
The non-random parts are what makes it seem convincing and a useful source of hints. The random parts make it seem smarter than it is, because sometimes it gets lucky and people stop there. Distinguishing the two requires multiple trials and maybe some statistics, and people mostly don’t do that.
Since my boss's boss just sent my division an email clearly written using ChatGPT, made a list (which LLMs also love doing) of points that were mostly bullshit, I fully support the idea of calling anything they do bullshit.
I'm all for people using LLMs where appropriate. But I'd also support formal reprimands for people that waste everyone's time with LLM bullshit.
Where do you think LLMs are appropriate?
I'm a software engineer. I use them all day every day.
I'm a security engineer who does some coding, but I'm not writing code all day every day. For me, LLMs are incredibly useful for me to get the framework of a script running, which I can then tweak and adapt to do what I actually want it to do.
I don't have the innate familiarity with common patterns that software developers do, so the structure of a common usecase like sending requests to an API with error handling and so on would take me reinventing the wheel constantly, whereas I can just ask an LLM to write a script to get x from this API, then take that (usually nonfunctional) code and adapt it to what I want it to do. Saves literal hours of work that pops up as incidental to the things that I'm actually paid to do.
It's also great at writing domain specific queries in languages like KQL, Splunk, or other proprietary SQLish languages.
Any time I need something that's sorta righ-ish, it's often faster to use an LLM then correct it than write all the boring boiler plate type stuff myself.
Project Zero was playing with it recently and shared some thoughts you might be interested in. The Google Security blog also found some success in using it for reporting security incidents.
It seems like every domain is currently experimenting with how LLMs might help (or hinder) their chosen niche. It's interesting to see this learning process take place across so many fields at once.
I needed a fairly simple shape object in QML a few days ago and was very impressed when I asked ChatGPT to write the code for me and it gave me perfectly working code on the first try. I then asked it to rewrite the code at a different size and it again returned a perfect result. It's something I could have done just as well, but I would have had to do a bit of figuring to determine where the points of the line segments ended, and then when I decided I wanted a new size I would have had to do it again. It's these little time savers that are great.
Then again, other times it just writes completely nonsense.
I've been using ChatGPT to help me learn Japanese. I copy paste some Japanese text and ask it to break it down and explain the grammar and translate the vocabulary. It works super well at parsing Japanese sentences which are not broken up with spaces in a way that online dictionaries don't.
I think that this is like when people say that plastic surgery looks terrible: they never notice plastic surgery done well.
When an LLM gets the tone of voice perfect, they're indistinguishable from a human. By now you've certainly interacted with such an LLM.
Abstract
Interesting, I'll have to dig in later. Idk if folks saw the "Chat GPT doesn't know what letters are" discourse but amidst someone on Bluesky insisting that lawyers should use AI for discovering "relevant" legal information in other languages and translating it, a side conversation appeared demonstrating how Chat GPT 3.5 and 4.0 both are incapable of listing the states that start and end with the same letter or that contain an "M" or whatever. It'll give partially correct answers but omit others, and when corrected will give a different subset of correct but incomplete answers or flat out incorrect (California has no "m") ones.
Framing that as "bullshit" makes sense so thanks for this!
One I've seen before is asking it to count the number of letters in a phrase or fit a response into a character limit. It breaks the illusion fairly well: a human with actual cognition has no trouble knowing what characters are and counting them, which should be a trivial task for a mathematical machine, but it's incapable because all it does is select statistically likely words in a sequence.
I have seen this come up from time to time as a criticism, but I don't think its very fair to grade LLMs on this sort of task. The human brain can think of things without saying (or writing) those thoughts. LLMs have no ability to think without writing. Their short term memory is limited to the input given by the user and the output it prints out. If you ask a human to write a 12 word poem, they will probably try a few combinations in their head, count the words, and reject the combinations that don't meet the 12 word requirement. They will probably NOT be able to come up with a poem that is 12 words without trying more than one combination. And the human brain has entire sections dedicated to counting and mathematics. LLMs have no such section of their "brain".
If we are willing to work with the limitations of current AI models, they become significantly better at this sort of task. One way is to allow them to "collect their thoughts" by writing arbitrary text that is NOT sent to the user in the conversation. Current AI chat platforms do not allow for this, but we can work around it by having the LLM delineate between "spoken" and "thought" words.
For some examples, I asked llama3:8b for a 12 word poem. First response:
11 words. I responded with this prompt:
I apologize for the mistake! Here are the words in the original poem:
The original poem has 11 words, not 12. Here is a new poem with exactly 12 words:
By counting out the words, it is able to fit the requirements I gave it. With the correct prompt, you can even get it to perform this introspection itself without additional prompting. To use anthropomorphic terms, we have not yet taught LLMs introspection, and we see errors that would be expected of humans that were not trained introspection. It is possible more training or more advanced models can have this introspection built in, but for now we have to use creative prompts to account for this deficiency.
Just ask it to play tic tac toe, it can't remember the board. Also if you tell it you won, it will agree. Then you can tell it actually you were lying and you lost, and it will agree.
Another thing it wasn't able to do: tell me ten jokes without repeating them, and then tell me ten more jokes without repeating them. It'll keep apologizing and promising to do better and do it again and again.
Yeah it kept apologizing for giving me 3 states, and then giving me 2, and then 3 and then 2 and then finally Ohio, but just Ohio, etc. and that was just me recreating what others had done. My few free "4.0" answers weren't better.
Thinking of it as a bullshitter makes a lot of sense to me.
This reminds me of when I was using Code Interpreter and it made mistakes writing code so many times that after a while, it learned the pattern and always assumed it was wrong.
Here's how I explain it: to an LLM, repetition is a pattern. If it sees a lot of repetition, it decides that it's most likely autocomplete, so it will continue it. This is one reason a random number generator is often used, to try to add enough variation to break some patterns.
So it doesn't pay to ask it to correct errors more than once or twice. Better to edit your prompt and try again, or start a new chat session.
Fwiw, I did edit the prompt too, and no improvement. That is how it identified Ohio but never could get Ohio and Arizona on the same list.
It's why bullshitting defines it for me. I know it's autocompleting. It's funny at how bad it is at doing so. I was just entertaining myself by attempting to replicate wat others had already shared screenshots on
Depending on the model you were using, this could have just been an artifact of a limited context length. If the older jokes were not being passed to the model because of the context length, it is literally impossible for the model to remember anything (assuming nothing like RAG on items outside the context window, but I haven't heard of a system that does that.
I tested this 10 jokes without repeating on a locally run llama3 8b and it successfully gave me 20 jokes without repeating any. And for the state names with M, it did pretty well. It seemed to only be able to list states that started with M, but since letters in words is isn't how these LLMs work, it isn't surprising that they aren't good at that task. Judge a fish by climbing a tree and all that.
Someone tried to claim it only got capital Ms but it missed New Mexico in that example. I'm assuming it's bullshitting by using the answers to "what states start with M" but it's wild how you can get it in the loop of apologizing and giving a worse answer, then apologizing and giving the first answer, etc. all wrong answers
I think folks looking for a silver bullet are right, it's bullshit for that. But as someone with ADHD, it's a lifesaver for me organizationally. I have all this info trapped up in my head that I usually just dump onto the page in a confusing, hap hazard manner. It used to take lots of editing and reediting and a whole bunch of folks helping me focus and narrow my points. Now I use chatGPT to give me the structure of a grant or a white paper or a report and then I go in and flesh it out. There are still folks reviewing and editing, but the lift on their end is much lighter.
I liked when folks were calling it the "drunk intern". In an age where few people proofread anymore of course there are going to be issues. But if it's used as a tool with a specific aim I think it's pretty helpful. The problem with the articles and op-eds on AI is that it acts like it can do everything, and that is bullshit. But it has some really powerful uses if you are aware of it's drawbacks and flaw and can enable some pretty big jumps in productivity. Grants used to take me about a month to get into submittable order, that has been cut down to about a week with ChatGPT's help.
This just seems like people not using it properly. It's good at writing tedious text. Email, coverletters, memos, etc. It cannwrite decent stories, poems and so on. It can help brainstorm ideas and it's pretty good at analyzing text you feed it in various ways.
But it's not google and was never meant to be.
It's a great Google butler as long as you wrap your prompt in stipulations. E.g. today I asked it if male pattern baldness is transferred via the mother, and specified "only refer to peer reviewed articles and cite your sources".
Its answer was, indeed, based on peer reviewed articles and it cited 6 different peer reviewed articles that all pertained to my question.
You can't just rely on it without any fact checking or skepticism, but I've found it extremely useful in my day to day life honestly. I got the paid version while I was job hunting to help with cover letters, but ever since 4o I'm using it so much that I decided to keep paying for it. YMMV of course.
I'm sure you did this in this instance, but reminder to everyone here to be sure to check whether any sources Chat GPT cites actually exist! Because it absolutely will make them up.
Oh yes sorry if that wasn't clear - I indeed specifically went to each individual source and confirmed that 1) they were peer reviewed arcticles published in respected journals; 2) read each abstract to confirm they were indeed all relevant; 3) none of the abstracts went against ChatGPTs conclusions and were complementary to the arguments ChatGPT was making.
Follow the advice above folks - verify like it's 2009 Wikipedia and you could just add yourself to a wiki page pretending to be Tom Cruise's cousin!
Until about a week ago, I had this exact opinion about ChatGPT, but then I realized what a blessing it is for developers.
ChatGPT could mean different things for different folks. I consider it in the same league as PHP, Bootstrap, Google translate, font-awesome, etc. i.e infrastructure tools and resources web developers can use to solve their problems.
Especially as a freelance programmer, I'm finding it very useful. It reduces a lot of my grunt work.
For example, I just have to say "Please create a Bootstrap form with labels and inputs with following selectors. And please add a placeholder attribute to be the same as label text.". And voila! Instead of me toiling for half hour in Notepad++, ChatGPT gives me this code ready-made!
I also discovered that ChatGPT can double up as Google translate! The other day, I had to converse with a Spanish client and all I had to ask was, "Dear ChatGPT, please translate the following text to Spanish".
In the old days, they used to say that Google was omniscient. With respect to ChatGPT, the only analogue I can think of is story of Alladin and the bottled genie!
Trusting ChatGPT (or even Google Translate) for speaking with customers in a language you don't understand is a little scary, imo.
That's usually true but in this case I'd already informed the client at start that I use translation software to converse and they were fine with it.
chatgpt is good with meal plans, double feature movie nights with a theme, and a lot of related stuff. it also wrote me some great github actions.
i’m still a hater on some fronts, but it’s slowly winning me over with mundane decisions.
I've only used ChatGPT for two things:
I like to keep a copy of my virtual textbook locally on my computer, but everything is browser based and they even restricted the number of pages you can print. So I copy-paste everything, which is a little annoying but it slows me down enough to read the material as I paste it. However, if I try to paste any tables, the formatting never carries over properly and it's a mess. So I tell ChatGPT to make the text a table (or bulleted list) and it does it.
Other thing is initial draft for cover letters when applying for jobs. Most of those land in the digital garbage bin before being read so I'm fine with getting help with them.
Edit: My point is that I only use it for very, very basic tasks. I've used it for starting points for hobby programming projects, and it seems to have worked well, but when you start to dive into more specialized areas or things that are more specific to your environment, it starts to fall apart.
I'm going to admit to having not read the whole thing, but I thumbed through it and hit most paragraphs.
I generally agree with the sentiment, and it reminds me of my Business Intelligence class, in which we briefly covered language modelling for business analysis, things like sentiment analysis to identify the vibe of reviews, and even predictive text and I believe generative models as an example. Turns out it was a bit of a primer on how Large Language Models work back in 2019.
That section stuck, and this article led my thinking to shift a bit, at least in how I would articulate how LLMs work. The bit with "They use the most likely word" that people say is sort of accurate, but it also analyzes the input its sentiment, the meaning of that, before generating a response.f
The shift in thinking for me is this notion of "bullshit," and it reminded me of my job. I'm an "IT Consultant," which is to say a dude paid to do the work our clients don't want to, or don't have the capacity for. I, and many of my coworkers, were hired on knowing nothing, just having knowledge articles and Google (so, everybody's public facing knowledge articles), and completing an internship. I actually use Bing Chat to start every search if I get stuck at work not because I suck at Googling, but because so much stuff is simply hard to google, especially in today's SEO dystopia. Bing Chat will go "Here's a solution and some links," and use real terminology I was unfamiliar with, or hadn't encountered before.
I'm generally bullshitting to my clients in the same way the model is bullshitting to me. We're both predicting the need of the end-user based on an aggregate of information, but where I draw comfort in my job security is I can apply a "reasonableness" measure to it. I can't quantify it as "this is 80% likely," but I can look at a response on a forum, extrapolate from a Microsoft Learn article, or use Copilot output and links and give it a sniff test for my scenario, cross-reference with my experience, documented practice, or even the aggregate as I can perceive it, and propose a next step in a problem.
A note: I don't use LLMs to write any of my communications. If a client asks me a question I wouldn't hit ChatGPT and regurigate the response, I will take some time to find the proper answer and offer a response I am confident in.
OP - mind posting a summary or commentary or something? 8000 words is a lot.
The conclusion is just 229 words and summarizes the, well, conclusion of the article concisely. But if even that is tl, the tl;dr is that "truthhood" is neither a design goal nor an actual property of LLMs, and the word we use for language produced without concern for the truth is "bullshit".
Honestly, judging by the abstract and introduction, and a quick skim through the paper itself, this seems to just be another pedantic argument about how we should refer to AI output. Unfortunately, wrapping a blog post in a white paper format doesn't make the content any more insightful.
It's probably meant to be tongue-in-cheek, but I think there's more interesting papers to read on AI.
Token generators with self-attention mechanisms (ie. LLMs) are definitely still AI, as defined in the field. They're the latest step beyond traditional machine learning and neural nets, which are also considered to be AI.
They're not AGI, but nobody is claiming that they are (except perhaps for Blake Lemoine). I haven't seen anybody else make the argument that these chatbots are conscious.
The term hallucinate stems from a technical article, written by one of the foremost scientists working in this area. It was adopted by others in the field and now describes a specific class of problem. Arguing over the language feels like prescriptivism at best, and bikeshedding at worst.
Personally, the term "hallucinate" reminds me of Do Androids Dream of Electric Sheep?, and I like that it removes any assumptions of intent or veracity. But it doesn't really matter what gets used, so long as people understand the limitations and capabilities of these tools.
For the record, I (and at least some others in the field fwiw) dislike the term hallucinations because it carries connotations that the model intends to convey the truth and has a knowledge or perception problem -- after all, this is what we're typically talking about when we use the term for humans. The problem is that this introduces a lot of very faulty assumptions, especially to laypeople. The problem, as pointed out in this article, is that these models have zero regard for whether the information they present is true or not. Any veracity in the information conveyed by the generated utterances is a useful side-effect of the things these models are actually trained for. I think the term hallucination obfuscates the fact that there is no fundamental underlying difference between what the model is doing when its information is accurate and when it isn't.
Of course, "hallucination" has become the industry term, so I still use it to discuss this problem like anyone else. But criticizing particular technical terminology for being inadequate, confusing, or misleading isn't "prescriptivism". In my undergraduate thesis I had to spend paragraphs disambiguating between different things referred to as a "topic" in the literature, and linguists have no qualms commenting on bad terminology within our field. Whether that's ever effective at actually changing the terminology is another matter.
Any existing term you use is going to carry connotations that aren't accurate, because the language we have developed as humans was not taken with LLMs into account, they didn't exist.
Just like the word "pilot" implied a local expert who guided a ship into harbor, not someone who directly steers a vessel before aircraft existed, but now it means both things.
"Bullshit" isn't exactly accurate either, and it's probably less accurate than hallucinate, because bullshitting is universally something done with the intent to deceive. LLMs don't mean to deceive us. They don't mean to do anything, they're machines with no motivations whatsoever.
I agree that "bullshit" carries similar problems, but I still think the existing term is worthy of criticism. I think it's possible to choose analogical language like this in more mindful ways -- "stochastic parrots" is an example of a paper that did this effectively imo.