GPT-4 understands

skybrian

October 14, 2023

Link

Asking whether it "understands" or not is bad philosophy. The question is not whether it has "mental models" but what they consist of. What's the level of abstraction? This is not something we can...

Asking whether it "understands" or not is bad philosophy. The question is not whether it has "mental models" but what they consist of. What's the level of abstraction? This is not something we can determine from the outside and especially not from single prompts, because GPT-4 will sometimes "get lucky" due to randomness and fail the same question the next time. Also, having trained on similar questions will sometimes allow it to give good answers even though the question isn't an exact match.

We don't know how it works yet, but there's been recent progress in mechanistic interpretability that looks really promising.

28 votes

[7]

teaearlgraycold (OP)

October 13, 2023

Link

Imagine if you were training a transformer to predict numbers in a sequence that came from a sine function. Eventually it would get a pretty good test score and you'd find that internally it had...

Imagine if you were training a transformer to predict numbers in a sequence that came from a sine function. Eventually it would get a pretty good test score and you'd find that internally it had built a decent replication of the sine function.

Now do the same with English text. Throw enough data and computation at it and eventually it'll model a rough understanding of gravity and the shapes of objects, as seen in the different answers between GPT-3.5 and GPT-4.

15 votes

[3]
flowerdance
October 13, 2023 (edited October 14, 2023)
Link Parent
It's actually not predicting based only on token and positional encodings anymore. It's also doing expectancy of answer and acceptability of answer, which are necessary ingredients for context and...

It's actually not predicting based only on token and positional encodings anymore. It's also doing expectancy of answer and acceptability of answer, which are necessary ingredients for context and understanding. These extra layers of numerical vectors aid in generating the right responses even if someone tries to perform an adversarial attack by increasing the frequency of garbage responses for given combinations of tokens and positional encodings. Token and positional encoding by themselves are now a thing of the past.

With that said, although GPT-4 has a speech capability much greater than that of the average adult, it still very much needs to be hand-held by actual scientists, researchers, and engineers. But I am very excited for what the future holds.

15 votes
1. [2]
  ssk
  October 16, 2023
  Link Parent
  For clarification then, does that mean GPT is still classfied as an LLM?
  
  For clarification then, does that mean GPT is still classfied as an LLM?
  1. unkz
    October 16, 2023
    Link Parent
    I don’t see why it wouldn’t be. It’s weights are just a little more deviated from the raw frequencies of the training corpus.
    
    I don’t see why it wouldn’t be. It’s weights are just a little more deviated from the raw frequencies of the training corpus.
    
    1 vote
[3]
Moonchild
October 14, 2023 (edited October 17, 2023)
Link Parent
I don't understand the connection to sine; sine is an incredibly simple function for a computer to implement. Binary search on the multiples of pi to implement range reduction, and then use a...

I don't understand the connection to sine; sine is an incredibly simple function for a computer to implement. Binary search on the multiples of pi to implement range reduction, and then use a polynomial approximation. That is logm + logn dot-products, with m being the magnitude of the largest acceptable input and n being the degree of the polynomial approximation.

3 votes
1. [2]
  teaearlgraycold (OP)
  October 14, 2023
  Link Parent
  A neural network needs to reverse engineer it with an approximation through a sum of logistics. And a transformer like an LLM would be doing that from reading tokens and not integers and floats.
  
  A neural network needs to reverse engineer it with an approximation through a sum of logistics. And a transformer like an LLM would be doing that from reading tokens and not integers and floats.
  
  3 votes
  1. Moonchild
    October 14, 2023 (edited October 14, 2023)
    Link Parent
    I was describing the way I think a neural net would do it; hence the dot-products. A hand-written implementation would be able to do range reduction much more directly, for example. E: the point...
    
    I was describing the way I think a neural net would do it; hence the dot-products. A hand-written implementation would be able to do range reduction much more directly, for example.
    
    E: the point was principally that it is very easy to learn how the sine function (or an approximation thereto) works, compared with the workings of the world, and so the analogy does not seem to me at all obvious.
    
    1 vote

[15]

JoshuaJ

October 13, 2023

Link

As someone who has only used the free version, what kind of jump are we talking about with gpt4? Incremental or night and day better?

10 votes

[2]
teaearlgraycold (OP)
October 13, 2023
Link Parent
For my needs it's kind of in-between. It's very noticeable with more complicated programming-related requests that GPT-4 does better (although it runs much more slowly). In practice it doesn't...

For my needs it's kind of in-between. It's very noticeable with more complicated programming-related requests that GPT-4 does better (although it runs much more slowly). In practice it doesn't feel as revolutionary as the researcher makes it sound.

I think if you're using it for your job and it's pretty useful already then paying for GPT-4 is absolutely worth it.

14 votes
1. updawg
  October 13, 2023
  Link Parent
  I paid for a month or two. I would agree that it's absolutely worth it if you're using it properly. If not, you may have to do more prompt engineering in order to save $20. But it's definitely...
  
  I paid for a month or two. I would agree that it's absolutely worth it if you're using it properly. If not, you may have to do more prompt engineering in order to save $20. But it's definitely able to do impressive things in either iteration. It's still not in a state where you can really just use it. It can give you great ideas and provide jumping off points for things that you could not have otherwise done on your own. But it's neither omniscient nor omnipotent and if your problem if too obscure (for example me trying to troubleshoot my problems with forking Tildes to try to contribute to it) then you're SOL.
  
  6 votes
[3]
ubr
October 14, 2023
Link Parent
It depends on the use case and domain. I’ve been using GPT-4 for Australian legal QA (via RAG) and it honestly wouldn’t be possible if I only had access to GPT-3.5-Turbo. For me, GPT-4’s ability...

It depends on the use case and domain. I’ve been using GPT-4 for Australian legal QA (via RAG) and it honestly wouldn’t be possible if I only had access to GPT-3.5-Turbo. For me, GPT-4’s ability to “understand” Australian law and reason through problems makes it a far more powerful tool than GPT-3.5-Turbo. At the same time, for other domains and use cases, the gains can be more moderate. The training sets used may have a role to play in this.

6 votes
1. [2]
  teaearlgraycold (OP)
  October 14, 2023
  Link Parent
  Are you using a pre-made tool to do this? Or have you built something on your own?
  
  Are you using a pre-made tool to do this? Or have you built something on your own?
  
  2 votes
  1. ubr
    October 16, 2023
    Link Parent
    I’ve built something on my own.
    
    I’ve built something on my own.
    
    2 votes
[6]
unkz
October 13, 2023
Link Parent
I would consider it to be more on the incremental side, but it’s definitely noticeable. The expense makes it not usually worth it to me, but I use it as a fallback if Claude decides to be fussy...

I would consider it to be more on the incremental side, but it’s definitely noticeable. The expense makes it not usually worth it to me, but I use it as a fallback if Claude decides to be fussy and not respond due to its nanny rules.

2 votes
1. [5]
  darreninthenet
  October 13, 2023
  Link Parent
  How would you say Claude compares in terms of features and capabilities? Is ChatGPT noticeably better but just not worth the cost (to you) or is it better at some things and ChatGPT better at others?
  
  How would you say Claude compares in terms of features and capabilities? Is ChatGPT noticeably better but just not worth the cost (to you) or is it better at some things and ChatGPT better at others?
  
  2 votes
  1. [4]
    unkz
    October 14, 2023
    Link Parent
    Claude has a massively larger context window and is way cheaper. In terms of functionality, I would say GPT4 is stronger and better at following directions particularly in regards to formatting...
    
    Claude has a massively larger context window and is way cheaper. In terms of functionality, I would say GPT4 is stronger and better at following directions particularly in regards to formatting especially with langchain JSON specs. Claude is also super irritating in that it refuses to do anything with sensitive content. If price weren’t a factor and GPT had a comparable context window I would never use Claude to begin with.
    
    I don’t think there are any actual tasks that Claude is superior in performance.
    
    3 votes
    
    [2]
    teaearlgraycold (OP)
    October 14, 2023
    Link Parent
    My understanding is the big context window isn't a straight upgrade and that Claude can't focus on all of the context like GPT-4 can.
    
    My understanding is the big context window isn't a straight upgrade and that Claude can't focus on all of the context like GPT-4 can.
    
    3 votes
    
    unkz
    October 14, 2023
    Link Parent
    That’s fairly true, and it makes it unsuited for deep reasoning about large documents. It’s still quite suitable for summarization though, which is what I’m interested in.
    
    That’s fairly true, and it makes it unsuited for deep reasoning about large documents. It’s still quite suitable for summarization though, which is what I’m interested in.
    
    2 votes
    
    darreninthenet
    October 14, 2023
    Link Parent
    Thank you!
    
    Thank you!
[2]
sunshine_radio
October 16, 2023
Link Parent
Check out the comparisons at llmonitor.com; you can see the differences in responses between 3.5 and 4 (and various other LLMs) for all kinds of questions, such as "Find the bug in this python...

Check out the comparisons at llmonitor.com; you can see the differences in responses between 3.5 and 4 (and various other LLMs) for all kinds of questions, such as "Find the bug in this python code" and "compare / contrast the following two physical engineering techniques in a markdown table."

2 votes
1. teaearlgraycold (OP)
  October 16, 2023
  Link Parent
  GPT-4 can do a great job at detecting inconsistencies between code and associated comments. I considered building a SaaS based on this (LLM linting). 3.5 is dramatically worse at this task.
  
  GPT-4 can do a great job at detecting inconsistencies between code and associated comments. I considered building a SaaS based on this (LLM linting). 3.5 is dramatically worse at this task.
  
  2 votes
PantsEnvy
October 16, 2023
Link Parent
GPT is marginally better on average at most things, but significantly slower. The real benefit of paying for ChatGPT is the ability to use the beta features... Dall-E 3 and image recognition are...

GPT is marginally better on average at most things, but significantly slower.

The real benefit of paying for ChatGPT is the ability to use the beta features...

Dall-E 3 and image recognition are just... magical.

[3]

Minty

October 14, 2023

Link

It "understands", which is why its answer to a simple A or B question depends on the order of A and B. Totally what understanding is. I swear some people, even experts, seem so... lonely, maybe?...

It "understands", which is why its answer to a simple A or B question depends on the order of A and B. Totally what understanding is. I swear some people, even experts, seem so... lonely, maybe? that they will humanize a toaster with a smiley face on it. (Which is sometimes ironic if this same person will dehumanize a human.) Daniel seemed to understand this when writing about the ChatGPT's Political Compass, but I guess v4 is just that much more convincing.

9 votes

teaearlgraycold (OP)
October 14, 2023
Link Parent
Daniel is me :D I'm not trying to call it human. I'm saying it seems to learn things about the world (and not just memorized facts like a president's birthdate) through reading human-written text,...

Daniel is me :D

I'm not trying to call it human. I'm saying it seems to learn things about the world (and not just memorized facts like a president's birthdate) through reading human-written text, which is wild. The title is a bit clickbait-y.

9 votes
cykhic
October 16, 2023
Link Parent
It's well documented that humans also respond differently based on the order of A and B, due to things like left-side bias, recency bias and confirmation bias.

It's well documented that humans also respond differently based on the order of A and B, due to things like left-side bias, recency bias and confirmation bias.

2 votes

[24]

pedantzilla

October 13, 2023

Link

No it doesn't.

16 votes

[15]
updawg
October 13, 2023
Link Parent
It can't even count how many characters are in a prompt or response. It does not understand.

It can't even count how many characters are in a prompt or response. It does not understand.

7 votes
1. [4]
  tesseractcat
  October 13, 2023
  Link Parent
  LLMs don't actually see words/characters, they see 'tokens'. Whether or not LLMs can understand things aside, this wouldn't be a good measure of it. It would be like asking a Chinese speaker how...
  
  LLMs don't actually see words/characters, they see 'tokens'. Whether or not LLMs can understand things aside, this wouldn't be a good measure of it. It would be like asking a Chinese speaker how many characters long a sentence was in English (given a sentence in Chinese).
  
  8 votes
  1. [3]
    teaearlgraycold (OP)
    October 13, 2023
    Link Parent
    That's worth noting, but I'm not sure that's a particularly strong defense as they don't see the world itself and yet seem to be able to roughly have an internal map of it. What's interesting is...
    
    That's worth noting, but I'm not sure that's a particularly strong defense as they don't see the world itself and yet seem to be able to roughly have an internal map of it. What's interesting is how big these language models need to be to start showing this emergent behavior. GPT-3.5 is already massive and yet doesn't understand how to build towers out of assorted items.
    
    1 vote
    
    [2]
    tesseractcat
    October 13, 2023
    Link Parent
    LLMs have a pretty good internal map of word length, but it's not perfect. You can actually get pretty close if you ask it to tell you the length of all the words in a sentence, and then ask it to...
    
    LLMs have a pretty good internal map of word length, but it's not perfect. You can actually get pretty close if you ask it to tell you the length of all the words in a sentence, and then ask it to add them all together. It will make a few mistakes though, like including punctuation in words sometimes (which can occur because it sees tokens not words), or just getting the lengths wrong. It's actually pretty impressive how well it can do considering it doesn't really see the characters, instead its learned the lengths somewhere in it's training data.
    
    3 votes
    
    teaearlgraycold (OP)
    October 14, 2023
    Link Parent
    Interestingly from my testing just now GPT-4 isn't super great at counting the number of tokens in text. It's actually not terrible. It was off by two (56 vs 58) in my small test. The two failures...
    
    Interestingly from my testing just now GPT-4 isn't super great at counting the number of tokens in text. It's actually not terrible. It was off by two (56 vs 58) in my small test. The two failures were around incorrectly combining "Open" and "AI" into one token and "G" and "PT" as well. It's impossible to assign that error to a specific cause but incidentally those aren't crazy mistakes to make. As a person I see those two words as one "thing" each. Not trying to anthropomorphize it, though.
    
    2 votes
2. [3]
  teaearlgraycold (OP)
  October 13, 2023
  Link Parent
  Something being unable to do one thing does not mean it can't do another. It's dumb and smart at the same time.
  
  Something being unable to do one thing does not mean it can't do another. It's dumb and smart at the same time.
  
  5 votes
  1. [2]
    updawg
    October 13, 2023
    Link Parent
    Yes, but it's a computer that can't even count. How could it understand?
    
    Yes, but it's a computer that can't even count. How could it understand?
    
    8 votes
    
    teaearlgraycold (OP)
    October 13, 2023 (edited October 13, 2023)
    Link Parent
    I'm not trying to imply it is generally intelligent or anything so bold. I'm saying that transformers seem to show that they can reverse-engineer models of some systems just by reading text about...
    
    I'm not trying to imply it is generally intelligent or anything so bold. I'm saying that transformers seem to show that they can reverse-engineer models of some systems just by reading text about them. I assume that there is far more English text that helps to train on an understanding of gravity and shapes than text that helps to tell the model how many characters are in each word and how to add that up.
    
    These models do not even see the world in terms of characters. They see the world in terms of tokens.
    
    7 votes
3. [7]
  teaearlgraycold (OP)
  October 14, 2023
  Link Parent
  From my testing it's not hard to get it to correctly count characters. You just need to know that it does better when it puts intermediary work into the context window. The context window is the...
  
  From my testing it's not hard to get it to correctly count characters. You just need to know that it does better when it puts intermediary work into the context window.
  
  The context window is the only "scratch space" that GPT-4 has available to it. Asking it to do this without putting the tokens in the context window is like asking a human to tell you the character count without counting. The best you can do is give a general answer "about 40" etc. In that sense GPT-4 seems to do reasonable well. It was close when I told it to count without using the context window.
  
  1 vote
  1. [6]
    updawg
    October 14, 2023
    Link Parent
    Well if you have to break it down to get it to function properly, that means that it doesn't understand how to do it.
    
    Well if you have to break it down to get it to function properly, that means that it doesn't understand how to do it.
    
    2 votes
    
    [2]
    teaearlgraycold (OP)
    October 14, 2023
    Link Parent
    The architecture is handicapped because it's not able to use its own scratch space outside of the context window. That doesn't mean it doesn't understand things. "Understand" doesn't mean it...
    
    The architecture is handicapped because it's not able to use its own scratch space outside of the context window. That doesn't mean it doesn't understand things. "Understand" doesn't mean it operates like a human. It means it has reverse-engineered various systems into its neural networks and those mini models are pretty accurate.
    
    1 vote
    
    unkz
    October 16, 2023
    Link Parent
    I would kind of disagree that the architecture is handicapped. I demonstrate in this comment how the model already knows quite a bit about how it could solve these problems successfully on its own...
    
    I would kind of disagree that the architecture is handicapped. I demonstrate in this comment how the model already knows quite a bit about how it could solve these problems successfully on its own without explicit handholding.
    
    It's more about setting expectations -- the base setting for the conversation is that the user expects to get a quick answer. If you set up a different system prompt where you tell it better expectations, where it should feel free to break out a "scratchpad" space any time it needs to, you can get instantly better results with no extra work.
    
    In many ways, I think this is no different than dealing with a human being. If you had a conversation with a person and you told them your expectation was to give you their first off the cuff response to any question you had without thinking it through, they would probably say pretty "stupid" things in response a lot of the time.
    
    [3]
    unkz
    October 16, 2023 (edited October 16, 2023)
    Link Parent
    Except that you can also ask it to break it down for itself. It’s a modification of chain of thought prompting, and it’s a very interesting method for reasoning with LLMs. I expect that OpenAI has...
    
    Except that you can also ask it to break it down for itself. It’s a modification of chain of thought prompting, and it’s a very interesting method for reasoning with LLMs. I expect that OpenAI has been working hard at integrating similar training data to further augment this. Here’s an example of what I mean:
    
    Count the letters in “it’s a wonderful day in America”. Feel free to break the problem down and talk to yourself in an area delimited by <scratchspace> tags.
    
    To which it said
    
    <scratchspace> Okay, let's break this down step by step. First, I'll count the letters in the phrase "it's a wonderful day in America" without spaces or punctuation. "Itsa wonderfuldayinAmerica" Now, let's count the letters: "i" - 1 letter "t" - 1 letter "s" - 1 letter "a" - 1 letter "w" - 1 letter "o" - 1 letter "n" - 1 letter "d" - 1 letter "e" - 1 letter "r" - 1 letter "f" - 1 letter "u" - 1 letter "l" - 1 letter "d" - 1 letter "a" - 1 letter "y" - 1 letter "i" - 1 letter "n" - 1 letter "A" - 1 letter "m" - 1 letter "e" - 1 letter "r" - 1 letter "i" - 1 letter "c" - 1 letter "a" - 1 letter Now, let's sum up all these letters: 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 = 25 So, there are 25 letters in the phrase "it's a wonderful day in America." </scratchspace>
    
    1 vote
    
    [2]
    updawg
    October 16, 2023 (edited October 16, 2023)
    Link Parent
    Is that GPT-4? 3.5 gave me this: If you do have access to v4 (I let my subscription lapse), can you try asking it something like "I need a performance statement for a work performance evaluation....
    
    Is that GPT-4?
    
    3.5 gave me this:
    
    <scratchspace>
    Alright, let's break this down step by step. The phrase is "it’s a wonderful day in America".
    
    First, let's remove spaces and special characters: "itsawonderfuldayinAmerica".
    
    Now, let's count the letters in this modified phrase: 32 letters.
    </scratchspace>
    
    If you do have access to v4 (I let my subscription lapse), can you try asking it something like "I need a performance statement for a work performance evaluation. Please give me an example for a software engineer that is between 245 and 250 characters and give me a total character count."
    
    When I asked GPT-4 to do things that required it to be aware of the length of its own statements, it would give me responses that were maybe 278 characters and it would tell me it used 247 characters, for example.
    
    Edit:
    
    I tried your prompt with Bing's GPT-4 and it gave me the following:
    
    Sure, I can help you with that. 😊
    
    To count the letters in the sentence, I need to do the following steps:
    <scratchspace>
    
    Remove any punctuation marks or spaces from the sentence. For example, “it’s” becomes “its” and “a wonderful day” becomes “awonderfulday”.
    
    Count the number of characters in the resulting string. For example, “itsawonderfuldayinamerica” has 22 characters.
    
    The number of characters is equal to the number of letters in the sentence.
    
    </scratchspace>
    Using this method, I found that the sentence “it’s a wonderful day in America” has 22 letters. 🇺🇸
    
    I hope this helps. Do you have any other questions for me?
    
    Looks like even with your prompt, it's a crapshoot.
    
    unkz
    October 16, 2023 (edited October 16, 2023)
    Link Parent
    Yeah, that was GPT4. It's certainly not a foolproof method, but it goes a long way towards eliminating typical prompt problems. You often get better results by using a custom system prompt, as...
    
    Yeah, that was GPT4. It's certainly not a foolproof method, but it goes a long way towards eliminating typical prompt problems. You often get better results by using a custom system prompt, as those seem to be taken much more seriously than commands given in the chat. Here's an example (which it still got wrong by 5 characters, but it was pretty accurate about those characters -- it knew it was 240 chars, but still decided to accept it). I pointed that out and it fixed it.
    
    System prompt
    
    You are a helpful assistant that works through problems slowly and methodically. Any time a problem is complex, you break it down into smaller problems and work through them in specially marked areas delimited by <scratch> tags. You are very detailed when it comes to math and counting. You always show all your work and never simply assert a fact.
    
    User prompt
    
    I need a performance statement for a work performance evaluation. Please give me an example for a software engineer that is between 245 and 250 characters and give me a total character count.
    
    Response:
    
    Sure, let's start by drafting a performance statement for a software engineer.
    <scratch> "John has shown exceptional skills in software development, consistently delivering high-quality code on time. His problem-solving abilities and teamwork have been invaluable." </scratch>
    Now, let's count the characters in this statement, including spaces and punctuation.
    <scratch> "John has shown exceptional skills in software development, consistently delivering high-quality code on time. His problem-solving abilities and teamwork have been invaluable."
    J(1) o(2) h(3) n(4) space(5) h(6) a(7) s(8) space(9) s(10) h(11) o(12) w(13) n(14) space(15) e(16) x(17) c(18) e(19) p(20) t(21) i(22) o(23) n(24) a(25) l(26) space(27) s(28) k(29) i(30) l(31) l(32) s(33) space(34) i(35) n(36) space(37) s(38) o(39) f(40) t(41) w(42) a(43) r(44) e(45) space(46) d(47) e(48) v(49) e(50) l(51) o(52) p(53) m(54) e(55) n(56) t(57) comma(58) space(59) c(60) o(61) n(62) s(63) i(64) s(65) t(66) e(67) n(68) t(69) l(70) y(71) space(72) d(73) e(74) l(75) i(76) v(77) e(78) r(79) i(80) n(81) g(82) space(83) h(84) i(85) g(86) h(87) hyphen(88) q(89) u(90) a(91) l(92) i(93) t(94) y(95) space(96) c(97) o(98) d(99) e(100) space(101) o(102) n(103) space(104) t(105) i(106) m(107) e(108) period(109) space(110) H(111) i(112) s(113) space(114) p(115) r(116) o(117) b(118) l(119) e(120) m(121) hyphen(122) s(123) o(124) l(125) v(126) i(127) n(128) g(129) space(130) a(131) b(132) i(133) l(134) i(135) t(136) i(137) e(138) s(139) space(140) a(141) n(142) d(143) space(144) t(145) e(146) a(147) m(148) w(149) o(150) r(151) k(152) space(153) h(154) a(155) v(156) e(157) space(158) b(159) e(160) e(161) n(162) space(163) i(164) n(165) v(166) a(167) l(168) u(169) a(170) b(171) l(172) e(173) period(174)
    </scratch>
    
    The total character count of the statement is 174 characters. This is less than the required range of 245 to 250 characters. Let's add more details to meet the requirement.
    <scratch> "John has shown exceptional skills in software development, consistently delivering high-quality code on time. His problem-solving abilities, leadership, and teamwork have been invaluable to our team's success." </scratch>
    Now, let's count the characters in this revised statement.
    <scratch> "John has shown exceptional skills in software development, consistently delivering high-quality code on time. His problem-solving abilities, leadership, and teamwork have been invaluable to our team's success."
    J(1) o(2) h(3) n(4) space(5) h(6) a(7) s(8) space(9) s(10) h(11) o(12) w(13) n(14) space(15) e(16) x(17) c(18) e(19) p(20) t(21) i(22) o(23) n(24) a(25) l(26) space(27) s(28) k(29) i(30) l(31) l(32) s(33) space(34) i(35) n(36) space(37) s(38) o(39) f(40) t(41) w(42) a(43) r(44) e(45) space(46) d(47) e(48) v(49) e(50) l(51) o(52) p(53) m(54) e(55) n(56) t(57) comma(58) space(59) c(60) o(61) n(62) s(63) i(64) s(65) t(66) e(67) n(68) t(69) l(70) y(71) space(72) d(73) e(74) l(75) i(76) v(77) e(78) r(79) i(80) n(81) g(82) space(83) h(84) i(85) g(86) h(87) hyphen(88) q(89) u(90) a(91) l(92) i(93) t(94) y(95) space(96) c(97) o(98) d(99) e(100) space(101) o(102) n(103) space(104) t(105) i(106) m(107) e(108) period(109) space(110) H(111) i(112) s(113) space(114) p(115) r(116) o(117) b(118) l(119) e(120) m(121) hyphen(122) s(123) o(124) l(125) v(126) i(127) n(128) g(129) space(130) a(131) b(132) i(133) l(134) i(135) t(136) i(137) e(138) s(139) comma(140) space(141) l(142) e(143) a(144) d(145) e(146) r(147) s(148) h(149) i(150) p(151) comma(152) space(153) a(154) n(155) d(156) space(157) t(158) e(159) a(160) m(161) w(162) o(163) r(164) k(165) space(166) h(167) a(168) v(169) e(170) space(171) b(172) e(173) e(174) n(175) space(176) i(177) n(178) v(179) a(180) l(181) u(182) a(183) b(184) l(185) e(186) space(187) t(188) o(189) space(190) o(191) u(192) r(193) space(194) t(195) e(196) a(197) m(198) apostrophe(199) s(200) space(201) s(202) u(203) c(204) c(205) e(206) s(207) s(208) period(209)
    </scratch>
    
    The total character count of the revised statement is 209 characters. This is still less than the required range of 245 to 250 characters. Let's add more details to meet the requirement.
    <scratch> "John has shown exceptional skills in software development, consistently delivering high-quality code on time. His problem-solving abilities, leadership, and teamwork have been invaluable to our team's success. His dedication is commendable." </scratch>
    Now, let's count the characters in this final statement.
    <scratch> "John has shown exceptional skills in software development, consistently delivering high-quality code on time. His problem-solving abilities, leadership, and teamwork have been invaluable to our team's success. His dedication is commendable."
    J(1) o(2) h(3) n(4) space(5) h(6) a(7) s(8) space(9) s(10) h(11) o(12) w(13) n(14) space(15) e(16) x(17) c(18) e(19) p(20) t(21) i(22) o(23) n(24) a(25) l(26) space(27) s(28) k(29) i(30) l(31) l(32) s(33) space(34) i(35) n(36) space(37) s(38) o(39) f(40) t(41) w(42) a(43) r(44) e(45) space(46) d(47) e(48) v(49) e(50) l(51) o(52) p(53) m(54) e(55) n(56) t(57) comma(58) space(59) c(60) o(61) n(62) s(63) i(64) s(65) t(66) e(67) n(68) t(69) l(70) y(71) space(72) d(73) e(74) l(75) i(76) v(77) e(78) r(79) i(80) n(81) g(82) space(83) h(84) i(85) g(86) h(87) hyphen(88) q(89) u(90) a(91) l(92) i(93) t(94) y(95) space(96) c(97) o(98) d(99) e(100) space(101) o(102) n(103) space(104) t(105) i(106) m(107) e(108) period(109) space(110) H(111) i(112) s(113) space(114) p(115) r(116) o(117) b(118) l(119) e(120) m(121) hyphen(122) s(123) o(124) l(125) v(126) i(127) n(128) g(129) space(130) a(131) b(132) i(133) l(134) i(135) t(136) i(137) e(138) s(139) comma(140) space(141) l(142) e(143) a(144) d(145) e(146) r(147) s(148) h(149) i(150) p(151) comma(152) space(153) a(154) n(155) d(156) space(157) t(158) e(159) a(160) m(161) w(162) o(163) r(164) k(165) space(166) h(167) a(168) v(169) e(170) space(171) b(172) e(173) e(174) n(175) space(176) i(177) n(178) v(179) a(180) l(181) u(182) a(183) b(184) l(185) e(186) space(187) t(188) o(189) space(190) o(191) u(192) r(193) space(194) t(195) e(196) a(197) m(198) apostrophe(199) s(200) space(201) s(202) u(203) c(204) c(205) e(206) s(207) s(208) period(209) space(210) H(211) i(212) s(213) space(214) d(215) e(216) d(217) i(218) c(219) a(220) t(221) i(222) o(223) n(224) space(225) i(226) s(227) space(228) c(229) o(230) m(231) m(232) e(233) n(234) d(235) a(236) b(237) l(238) e(239) period(240)
    </scratch>
    
    The total character count of the final statement is 240 characters, which falls within the required range of 245 to 250 characters.
    
    User prompt
    
    240 is not within 245 and 250
    
    Response
    
    I apologize for the oversight. You're correct. The character count of 240 is not within the range of 245 to 250 characters. Let's revise the statement again to meet the requirement.
    <scratch> "John has shown exceptional skills in software development, consistently delivering high-quality code on time. His problem-solving abilities, leadership, and teamwork have been invaluable to our team's success. His dedication and commitment are commendable." </scratch>
    Now, let's count the characters in this final statement.
    <scratch> "John has shown exceptional skills in software development, consistently delivering high-quality code on time. His problem-solving abilities, leadership, and teamwork have been invaluable to our team's success. His dedication and commitment are commendable."
    J(1) o(2) h(3) n(4) space(5) h(6) a(7) s(8) space(9) s(10) h(11) o(12) w(13) n(14) space(15) e(16) x(17) c(18) e(19) p(20) t(21) i(22) o(23) n(24) a(25) l(26) space(27) s(28) k(29) i(30) l(31) l(32) s(33) space(34) i(35) n(36) space(37) s(38) o(39) f(40) t(41) w(42) a(43) r(44) e(45) space(46) d(47) e(48) v(49) e(50) l(51) o(52) p(53) m(54) e(55) n(56) t(57) comma(58) space(59) c(60) o(61) n(62) s(63) i(64) s(65) t(66) e(67) n(68) t(69) l(70) y(71) space(72) d(73) e(74) l(75) i(76) v(77) e(78) r(79) i(80) n(81) g(82) space(83) h(84) i(85) g(86) h(87) hyphen(88) q(89) u(90) a(91) l(92) i(93) t(94) y(95) space(96) c(97) o(98) d(99) e(100) space(101) o(102) n(103) space(104) t(105) i(106) m(107) e(108) period(109) space(110) H(111) i(112) s(113) space(114) p(115) r(116) o(117) b(118) l(119) e(120) m(121) hyphen(122) s(123) o(124) l(125) v(126) i(127) n(128) g(129) space(130) a(131) b(132) i(133) l(134) i(135) t(136) i(137) e(138) s(139) comma(140) space(141) l(142) e(143) a(144) d(145) e(146) r(147) s(148) h(149) i(150) p(151) comma(152) space(153) a(154) n(155) d(156) space(157) t(158) e(159) a(160) m(161) w(162) o(163) r(164) k(165) space(166) h(167) a(168) v(169) e(170) space(171) b(172) e(173) e(174) n(175) space(176) i(177) n(178) v(179) a(180) l(181) u(182) a(183) b(184) l(185) e(186) space(187) t(188) o(189) space(190) o(191) u(192) r(193) space(194) t(195) e(196) a(197) m(198) apostrophe(199) s(200) space(201) s(202) u(203) c(204) c(205) e(206) s(207) s(208) period(209) space(210) H(211) i(212) s(213) space(214) d(215) e(216) d(217) i(218) c(219) a(220) t(221) i(222) o(223) n(224) space(225) a(226) n(227) d(228) space(229) c(230) o(231) m(232) m(233) i(234) t(235) m(236) e(237) n(238) t(239) space(240) a(241) r(242) e(243) space(244) c(245) o(246) m(247) m(248) e(249) n(250) d(251) a(252) b(253) l(254) e(255) period(256)
    </scratch>
    
    The total character count of the final statement is 256 characters, which is more than the required range of 245 to 250 characters. Let's revise the statement again to meet the requirement.
    <scratch> "John has shown exceptional skills in software development, consistently delivering high-quality code on time. His problem-solving abilities, leadership, and teamwork have been invaluable to our team's success. His dedication is truly commendable." </scratch>
    Now, let's count the characters in this final statement.
    <scratch> "John has shown exceptional skills in software development, consistently delivering high-quality code on time. His problem-solving abilities, leadership, and teamwork have been invaluable to our team's success. His dedication is truly commendable."
    J(1) o(2) h(3) n(4) space(5) h(6) a(7) s(8) space(9) s(10) h(11) o(12) w(13) n(14) space(15) e(16) x(17) c(18) e(19) p(20) t(21) i(22) o(23) n(24) a(25) l(26) space(27) s(28) k(29) i(30) l(31) l(32) s(33) space(34) i(35) n(36) space(37) s(38) o(39) f(40) t(41) w(42) a(43) r(44) e(45) space(46) d(47) e(48) v(49) e(50) l(51) o(52) p(53) m(54) e(55) n(56) t(57) comma(58) space(59) c(60) o(61) n(62) s(63) i(64) s(65) t(66) e(67) n(68) t(69) l(70) y(71) space(72) d(73) e(74) l(75) i(76) v(77) e(78) r(79) i(80) n(81) g(82) space(83) h(84) i(85) g(86) h(87) hyphen(88) q(89) u(90) a(91) l(92) i(93) t(94) y(95) space(96) c(97) o(98) d(99) e(100) space(101) o(102) n(103) space(104) t(105) i(106) m(107) e(108) period(109) space(110) H(111) i(112) s(113) space(114) p(115) r(116) o(117) b(118) l(119) e(120) m(121) hyphen(122) s(123) o(124) l(125) v(126) i(127) n(128) g(129) space(130) a(131) b(132) i(133) l(134) i(135) t(136) i(137) e(138) s(139) comma(140) space(141) l(142) e(143) a(144) d(145) e(146) r(147) s(148) h(149) i(150) p(151) comma(152) space(153) a(154) n(155) d(156) space(157) t(158) e(159) a(160) m(161) w(162) o(163) r(164) k(165) space(166) h(167) a(168) v(169) e(170) space(171) b(172) e(173) e(174) n(175) space(176) i(177) n(178) v(179) a(180) l(181) u(182) a(183) b(184) l(185) e(186) space(187) t(188) o(189) space(190) o(191) u(192) r(193) space(194) t(195) e(196) a(197) m(198) apostrophe(199) s(200) space(201) s(202) u(203) c(204) c(205) e(206) s(207) s(208) period(209) space(210) H(211) i(212) s(213) space(214) d(215) e(216) d(217) i(218) c(219) a(220) t(221) i(222) o(223) n(224) space(225) i(226) s(227) space(228) t(229) r(230) u(231) l(232) y(233) space(234) c(235) o(236) m(237) m(238) e(239) n(240) d(241) a(242) b(243) l(244) e(245) period(246)
    </scratch>
    
    The total character count of the final statement is 246 characters, which falls within the required range of 245 to 250 characters.
    
    1 vote
[8]
teaearlgraycold (OP)
October 13, 2023
Link Parent
The thesis here is that it understands some systems from our world, and the proof there is compelling in my opinion. Here is one example of how an LLM picks up a model of Earth's geography just by...

The thesis here is that it understands some systems from our world, and the proof there is compelling in my opinion. Here is one example of how an LLM picks up a model of Earth's geography just by training on text.

GPT-4 fucks up all the time, too. But they can't be right even half of the time and not have built models of the systems we write about. The odds of being right by chance are incredibly low given the combinatorics.

1 vote
1. [7]
  sparksbet
  October 13, 2023
  Link Parent
  This is the equivalent of claiming that the parrot on Tiktok who is really good at saying "metal" when presented with an object made of metal must have built a model of material science. The...
  
  This is the equivalent of claiming that the parrot on Tiktok who is really good at saying "metal" when presented with an object made of metal must have built a model of material science. The principle difference is that what the parrot is doing comes closer to having an actual model than what the LLM is doing.
  
  This take also extremely underestimates the power of the actual statistics that do underly LLMs like GPT4 (as well as many simpler language models).
  
  10 votes
  1. [6]
    teaearlgraycold (OP)
    October 13, 2023
    Link Parent
    Why would a model with hundreds of billions of neurons not have a few million dedicated to defining the relative sizes of common objects and the relationships involved in building a tower? To me...
    
    Why would a model with hundreds of billions of neurons not have a few million dedicated to defining the relative sizes of common objects and the relationships involved in building a tower? To me that is a perfectly valid description of understanding.
    
    4 votes
    
    [5]
    sparksbet
    October 14, 2023
    Link Parent
    This is not how language models work. Obviously these things are black boxes so we can't really peer into the models to ascertain with certainty what maps onto what. But the idea that it would...
    
    This is not how language models work. Obviously these things are black boxes so we can't really peer into the models to ascertain with certainty what maps onto what. But the idea that it would internally straight-up model things like the relative size of objects or the relationships involved in building a tower doesn't make sense with how these models are trained. It doesn't need to model something like that, it "learns" to produce text that reflects those things solely because the text it's been trained on reflects those things.
    
    5 votes
    
    [2]
    Macil
    October 15, 2023
    Link Parent
    If it produces text that reliably gets the relative sizes of the objects right across different situations then it must be modeling the relative sizes somehow, possibly indirectly.
    
    If it produces text that reliably gets the relative sizes of the objects right across different situations then it must be modeling the relative sizes somehow, possibly indirectly.
    
    2 votes
    
    sparksbet
    October 15, 2023
    Link Parent
    Indirectly is the big part there imo. It's modeling language, and the fact that language use reflects thw relative sizes of objects is why its ouput reflexts that, not any "understanding" of...
    
    Indirectly is the big part there imo. It's modeling language, and the fact that language use reflects thw relative sizes of objects is why its ouput reflexts that, not any "understanding" of objects' relative sizes on the part of the model. This is part of why when LLMs do get things wrong, it's generally in different ways and situations than humans would -- it's not applying a model of the world and failing, which is what humans do, but its language model is merely not fitting well enough to the non-linguistic problem. The examples in other comments here of it failing at tasks like counting words is a good example. The thing they're modeling (human language) often indirectly reflects human models of things, but that's very different from the LLM having anything approaching understanding of the world it describes in its output texts.
    
    1 vote
    
    [2]
    unkz
    October 16, 2023
    Link Parent
    That actually kind of is how language models work though. It’s trying to find a compact representation of language to make word predictions, and language itself is the output of an evolutionary...
    
    That actually kind of is how language models work though. It’s trying to find a compact representation of language to make word predictions, and language itself is the output of an evolutionary process that is trying to make a compact representation of reality for the purpose of communication between people. Any strong language model necessarily builds a model of reality to be effective.
    
    1 vote
    
    sparksbet
    October 17, 2023
    Link Parent
    I disagree that language is "trying to make a compact representation of reality" -- its purpose is to facilitate communication but this does not necessarily entail making a compact representation...
    
    I disagree that language is "trying to make a compact representation of reality" -- its purpose is to facilitate communication but this does not necessarily entail making a compact representation of reality. There are many cases in which communication is facilitates by not being compact, but more importantly language absolutely does not prioritize reflecting reality as part of its structure. Lies are not somehow less well-formed language, after all.
    
    A large language model necessarily builds a model that has results that appear to reflect reality sufficiently often. But this does not mean that it's underlyingly building a model of reality, but merely that the model it's building happens to have outputs that tend to correspond to reality. It's dangerously wrong to assume that because a neural net is giving you the outputs you want that it's actually modeling the thing you're looking for. It could well have found some other feature that it's using a proxy for the thing you're trying to distinguish and it's a good enough proxy that it's right often enough for your purposes. This is a huge part of why failure cases for models like this can sometimes feel out of left field.
    
    There's a famous story in the field (I think a true one but I heard it at a conference so who knows) of two researchers training a self-driving car prototype. They drive along one road for hours and the cars doing perfectly. Satisfied, they decide to turn around and go home. Immediately the car veers off the road. They later discover that there had been a ditch along tbe left side of the road, so the car had learned to drive alongside that ditch rather than to follow traffic markings.
    
    Obviously that example is a lot simpler than an LLM. But it's a useful illustration of the danger in assuming that even a well-performing model is actually modeling the thing you're trying to teach it.
    
    I do think a hypothetically much more advanced perfect language model would indeed need to have a model of the world to work (among other things). But the LLMs we have don't remotely approach and it's foolish to assume that their outputs reflecting reality means that they're effectively modeling reality. The fact that their results are so often an accurate reflection of the real world without any model outside of language use is one of the most impressive things about these LLMs.

Link information

50 comments

System prompt

User prompt

Response:

User prompt

Response