64 votes

ChatGPT's odds of getting code questions correct are worse than a coin flip

46 comments

  1. [7]
    hushbucket
    Link
    I'm not a pro but I code here and there when required professionally and for fun. ChatGPT has been incredibly helpful for me. I guess Im in the sweet spot: doing beginner to intermediate level...

    I'm not a pro but I code here and there when required professionally and for fun. ChatGPT has been incredibly helpful for me. I guess Im in the sweet spot: doing beginner to intermediate level things and I understand the basics. Like u/friendly said, im not really asking it for canned code snippets. I'm asking for syntax and the right language/terms to dig deeper with. I use Google the same way, but the hit rate is awful when you're not using right keywords. You can start a conversation off sloppily with ChatGPT and keep refining it until you get value. That's the biggest boon imo, how it uses your previous inputs (its "memory"). Great tool.

    59 votes
    1. [3]
      DeaconBlue
      Link Parent
      I am with you here, in the context of only asking for a jumping off point or modifying code rather than "please write me a full program to do this thing" Some examples have been: Finding syntax...

      I am with you here, in the context of only asking for a jumping off point or modifying code rather than "please write me a full program to do this thing"

      Some examples have been:

      • Finding syntax for registering a service bus connection with the .NET DI container

        • I assume that there is a MSDN article I could have just as easily copied and pasted from
      • Replacing some logic for trajectory calculations from Z-up to Y-up

        • The computer is less likely to miss a negative or positive swap over 80 lines of code than I am
      • Finding the name of an algorithm to search for to solve a problem

      19 votes
      1. teaearlgraycold
        Link Parent
        It's really good when you're learning a new framework or language and can state the problem but don't have the vocabulary yet to form a good Google search query.

        It's really good when you're learning a new framework or language and can state the problem but don't have the vocabulary yet to form a good Google search query.

        12 votes
      2. Fortner
        Link Parent
        I've had the opposite experience. I was simply trying to figure out how to exclude some symbols in an input, but include others. I spent hours asking ChatGPT. It gave me hundreds of results that...

        I've had the opposite experience. I was simply trying to figure out how to exclude some symbols in an input, but include others. I spent hours asking ChatGPT. It gave me hundreds of results that were all different and none of them worked.

        5 votes
    2. [2]
      Rudism
      Link Parent
      I've been kind of impressed with the phind.com model: first use AI to clean up the user's question into something that will give good search results perform the web search (limiting to sites that...

      I've been kind of impressed with the phind.com model:

      • first use AI to clean up the user's question into something that will give good search results
      • perform the web search (limiting to sites that will give good answers to the question, such as (arguably) stack overflow)
      • feed the original question and the content of the top results back to the AI and get it to answer
      • allow the user to ask further or clarifying questions

      I've found that it can come back with useful (if not correct) answers to even pretty esoteric questions about languages or features of languages that there is generally little information about in the LLM's training data (which ChatGPT proper would either refuse to answer or hallucinate nonsense to wholesale). But I agree it's still just a jumping off point for further research and won't actually write working code snippets for you.

      10 votes
      1. hushbucket
        Link Parent
        Very cool. First time hearing about phind so ill have go check it out

        Very cool. First time hearing about phind so ill have go check it out

    3. Dr_Amazing
      Link Parent
      This is my experience as well. "Write me code for this." is pretty iffy. Especially for a larger program. But it's pretty good at "read this code and tell me why it's not working." And really...

      This is my experience as well. "Write me code for this." is pretty iffy. Especially for a larger program.

      But it's pretty good at "read this code and tell me why it's not working."

      And really helpful with " explain the difference between these two things" or " whats the syntax for this thing."

      6 votes
  2. [7]
    friendly
    Link
    For context, I'm a working developer with a long tech history but less than 1 year of programming experience. When I've used ChatGPT to help me write code, the more complicated the context is, the...

    For context, I'm a working developer with a long tech history but less than 1 year of programming experience.

    When I've used ChatGPT to help me write code, the more complicated the context is, the less successful it has been in providing anything close to a solution.

    If I ask ChatGPT for simple programming logic it is helpful at giving me an overview of what a solution could look like. Sometimes it works out of the box, but more often I am just using it as a soundboard for ideas and syntax.

    14 votes
    1. [6]
      merry-cherry
      Link Parent
      Yeah, it's worked well as a more advanced auto complete. Trying to get a whole block of software out of fraught with peril. It's a real coin toss to ask it to optimize code too. I've had it simply...

      Yeah, it's worked well as a more advanced auto complete. Trying to get a whole block of software out of fraught with peril. It's a real coin toss to ask it to optimize code too. I've had it simply break the code in very nonsensical ways.

      7 votes
      1. [2]
        prairir001
        Link Parent
        Lately I have been using it for advanced autocomplete. I've noticed it's much much better in well known languages like rust. I've been trying it with zig and it often spits out nonsense.

        Lately I have been using it for advanced autocomplete. I've noticed it's much much better in well known languages like rust. I've been trying it with zig and it often spits out nonsense.

        3 votes
        1. merry-cherry
          Link Parent
          Oh yeah, it is definitely limited to the most popular languages. This is because it doesn't understand the language at all and is simply regurgitating a guess based on patterns. So uncommon...

          Oh yeah, it is definitely limited to the most popular languages. This is because it doesn't understand the language at all and is simply regurgitating a guess based on patterns. So uncommon languages simply have a lot less data to mimic from.

          6 votes
      2. [3]
        ComicSans72
        Link Parent
        Is there some advantage to it over copilot then? Just cost?

        Is there some advantage to it over copilot then? Just cost?

        1. merry-cherry
          Link Parent
          Copilot is GPT-3/4 with the prompts built in. You can have conversation with GPT directly like saying "hey this sucks" which you can't do in copilot, but you have to actually paste in all of the...

          Copilot is GPT-3/4 with the prompts built in. You can have conversation with GPT directly like saying "hey this sucks" which you can't do in copilot, but you have to actually paste in all of the relative code. I'd say copilot is better for programmers and GPT is better for learners.

        2. skybrian
          Link Parent
          Sometimes you want to ask questions. For example, I sometimes ask GPT4 to suggest alternative ways to do something and pick between them myself, which is different from having Copilot autocomplete...

          Sometimes you want to ask questions. For example, I sometimes ask GPT4 to suggest alternative ways to do something and pick between them myself, which is different from having Copilot autocomplete its best guess on how to do something based on context and randomness.

          (I think Copilot might have a chat dialog too, but I haven't tried it yet.)

  3. [3]
    honzabe
    Link
    I did a cursory reading of both the article and the paper and I only see "ChatGPT" mentioned without specifying whether they used the free version or paid GPT-4 version. That should be the first...

    I did a cursory reading of both the article and the paper and I only see "ChatGPT" mentioned without specifying whether they used the free version or paid GPT-4 version. That should be the first and most important fact mentioned because in my experience the results would be radically different. I would easily believe the results if they used GPT-3.5, I would be extremely skeptical if they said they used GPT-4.

    12 votes
    1. [2]
      lel
      Link Parent
      According to the paper they used 3.5. That being said, in my experience with GPT-4, it is overhyped and hallucinates just as often, though I would really like to see this repeated with 4.

      According to the paper they used 3.5. That being said, in my experience with GPT-4, it is overhyped and hallucinates just as often, though I would really like to see this repeated with 4.

      7 votes
      1. teaearlgraycold
        Link Parent
        4 is noticeably better in my opinion. But it’s still just an LLM.

        4 is noticeably better in my opinion. But it’s still just an LLM.

        4 votes
  4. skybrian
    Link
    When you’re trying to solve a mystery, a source of hints that is 50% accurate is a really good source. This is like appearing on the top half of the front page of a Google search.

    When you’re trying to solve a mystery, a source of hints that is 50% accurate is a really good source. This is like appearing on the top half of the front page of a Google search.

    11 votes
  5. [4]
    Fortner
    Link
    This just furthers my theory that there is little difference between AI and politicians in how they act, and how people react to them. Even though they're wrong most of the time, people still vote...

    "Our analysis shows that 52 percent of ChatGPT answers are incorrect and 77 percent are verbose," the team's paper concluded. "Nonetheless, ChatGPT answers are still preferred 39.34 percent of the time due to their comprehensiveness and well-articulated language style." Among the set of preferred ChatGPT answers, 77 percent were wrong.

    This just furthers my theory that there is little difference between AI and politicians in how they act, and how people react to them. Even though they're wrong most of the time, people still vote for them.

    7 votes
    1. [3]
      Grayscail
      Link Parent
      I think its very accurate. Politicians ultimately say whatever will get them approval. ChatGPT ultimately is trained to say whatever will sound convincing to a human. To some extent, that means...

      I think its very accurate. Politicians ultimately say whatever will get them approval. ChatGPT ultimately is trained to say whatever will sound convincing to a human. To some extent, that means gaining the human's approval.

      6 votes
      1. [2]
        tanglisha
        Link Parent
        I wonder how long it will be before someone manages to push an LLM through an election.

        I wonder how long it will be before someone manages to push an LLM through an election.

        1. g33kphr33k
          Link Parent
          America had a bright orange LLM that played golf a lot. It was very verbose, said mostly the wrong things, managed to get a lot of human approval and turned the US inside out with a silly slogan.

          America had a bright orange LLM that played golf a lot.

          It was very verbose, said mostly the wrong things, managed to get a lot of human approval and turned the US inside out with a silly slogan.

  6. [3]
    domukin
    Link
    I don’t know much about coding, but I suppose it is as valid a field as any to judge chatgpts prowess. In my limited non-scientific experience, it tends to get even basic arithmetic wrong but will...

    I don’t know much about coding, but I suppose it is as valid a field as any to judge chatgpts prowess. In my limited non-scientific experience, it tends to get even basic arithmetic wrong but will be very confident/convincing and will on occasion make up excuses when corrected.

    5 votes
    1. teaearlgraycold
      Link Parent
      It’s not a stupid AGI. It’s a smart language model. Don’t expect it to get arithmetic right. It’s good at language tasks. Incidentally GPT4 has modeled some simple systems relatively well.

      It’s not a stupid AGI. It’s a smart language model. Don’t expect it to get arithmetic right. It’s good at language tasks. Incidentally GPT4 has modeled some simple systems relatively well.

      6 votes
    2. honzabe
      (edited )
      Link Parent
      I wish people specified whether they mean GPT-4 or GPT-3.5. I see your comment the same way I see the linked article - if you mean GPT-3.5, I don't doubt that. If you are talking about GPT-4, I...

      I wish people specified whether they mean GPT-4 or GPT-3.5. I see your comment the same way I see the linked article - if you mean GPT-3.5, I don't doubt that. If you are talking about GPT-4, I would like to ask you to kindly link to some examples.

      4 votes
  7. [3]
    Grumble4681
    Link
    If you could train a coin to code for you correctly even half the time, I'd suspect you'd feel quite lucky. Others mentioned being coders in some form or another and their experiences, I'll add...

    If you could train a coin to code for you correctly even half the time, I'd suspect you'd feel quite lucky.

    Others mentioned being coders in some form or another and their experiences, I'll add mine as someone who isn't a coder. Most I've ever really done is read documentation on autohotkey and put together some very basic shit to automate mouse or key presses type of thing.

    I tried doing that with ChatGPT once recently (getting it to make a few autohotkey scripts for me, even though I knew I could do it myself I figured it might save me time) and it was moderately OK, but often it had something wrong with it that I had to fix but sometimes it was my fault for not knowing how it processes requests. I would tell it I wanted it to do X, Y and Z, and often it could do those things, but it would also prevent me from doing A, B, and C as a part of normally using my PC, and I thought implicitly it would understand that I wanted it to not prevent my usage of the machine.

    So if I told it I wanted the script to use a specific key as a toggle to trigger it to send a sequence of other keypresses or movements, but only when a specific application is active/focused does that specific key function as that toggle, it would do that, but it would also make it so that key would be inoperable when that application isn't in focus, even though if I was making the script myself I would have ordinarily understood that I still wanted that key to work as normal, just not when the application I specified is active. At that point I started learning better how to phrase my requests, and it would try to meet my requests, but either I wasn't being specific enough or it just couldn't accomplish it. Sometimes I would use vague instructions, something like 'not prevent normal use of the keys when application is in focus' or something like that and you could see it would attempt to do something like that, and on some occasions it worked or sorta worked, but in many cases it didn't. Hell even if it did manage to adjust it for that one flaw, sometimes it would introduce some other flaw from the adjustment. It was a little aggravating because I knew I could do it myself, but I was trying to learn how to give it instructions in a way for it to give me the solutions I wanted. In some cases I suspect there might not be any magic words to make it output the solution if the problem is too complex.

    5 votes
    1. [2]
      onyxleopard
      Link Parent
      I’ll note that this is fundamentally an issue with formulating requirements. It’s really hard to do this well and communicate well-formulated requirements to software engineers. Some human...

      I’ll note that this is fundamentally an issue with formulating requirements. It’s really hard to do this well and communicate well-formulated requirements to software engineers. Some human software engineers make bad assumptions about implicit requirements or get bitten when they just ignore things that are not explicit in the specifications they are given. The best human software engineers are the ones who don’t even start coding until they have a good understanding of what the software is supposed to do. GPT-* and other current LLMs don’t have access to the wider context or experience of humans, so they either start coding without understanding the real requirements, or they get the implicit requirements correct by chance. (This also happens with junior software engineers as well.) If your software applications are low stakes, though, going back and forth between describing your requirements and filling in implicit requirements when they are not met is a slower, but feasible cycle to get to usable software.

      9 votes
      1. Grumble4681
        Link Parent
        Yeah that's what I started understanding the more I used it was I was just going back and forth filling in different implicit requirements. I think that was the slightly aggravating part because...

        Yeah that's what I started understanding the more I used it was I was just going back and forth filling in different implicit requirements. I think that was the slightly aggravating part because it was difficult for me to keep accounting for all the implicit requirements. Basically I could tell it I want X, Y, Z, then find out right away after testing it that it missed an implicit requirement. Then I give it another instruction, and then there's other implicit requirements that weren't resolved that revealed themselves after getting past the previous one. Then giving more specific instructions to deal with the newest revealed implicit requirement isn't good enough because you also have to account for the past implicit requirements. The first few that can be fine, but once you get past that, it becomes a more complex web of instructions that are hard to convey.

        Eventually I just started using it to get the base of something and then I finished it myself, in that way I often didn't have to look up the documentation anymore to find the specific syntax or functions I needed to accomplish it, but I could account for my own implicit requirements more easily.

        2 votes
  8. chiliedogg
    Link
    As a less-than-novice, my odds of getting it right with a coin flip are zero, not 50/50. "Correct 48% of the time" is remarkable.

    As a less-than-novice, my odds of getting it right with a coin flip are zero, not 50/50.

    "Correct 48% of the time" is remarkable.

    4 votes
  9. [6]
    tomf
    Link
    chatGPT is popular in the spreadsheet world among people who have never used a spreadsheet before. it writes terrible formulas and, if they even work, they’re just not natural. for QUERY in...

    chatGPT is popular in the spreadsheet world among people who have never used a spreadsheet before. it writes terrible formulas and, if they even work, they’re just not natural.

    for QUERY in sheets, it writes it like SQL instead of learning from the documentation.

    i run a sheets sub and automatically remove all mentions of this (contextual) trash.

    3 votes
    1. [5]
      bret
      Link Parent
      as one of those guys who has very limited spreadsheet knowledge, I used chatgpt to write this monstrosity:...

      as one of those guys who has very limited spreadsheet knowledge, I used chatgpt to write this monstrosity:

      =LET(vb,LOOKUP(2,1/(K$2:K9<>""),K$2:K9),va,INDEX(FILTER(K9:K1999,K9:K1999<>"",""),1),db,INDEX(B:B,LOOKUP(2,1/(K$2:K9<>""),ROW(K$2:K9))),da,INDEX(B:B,INDEX(FILTER(ROW(K9:K1999),K9:K1999<>"",""),1)),vb+((va-vb)/(da-db+1)*(B9-db+1)))

      it works (takes values from a list of date/values and gives me interpolated values in a list of EVERY possible date), but SOMETHING is telling me there's a better way to write it :^)

      3 votes
      1. [2]
        shusaku
        Link Parent
        This is my favorite use of ChatGPT: for writing in annoying languages. I’m never going to write an awk command or bash script again, it’s simply better at handling the syntax than I am.

        This is my favorite use of ChatGPT: for writing in annoying languages. I’m never going to write an awk command or bash script again, it’s simply better at handling the syntax than I am.

        5 votes
        1. stu2b50
          Link Parent
          Anytime I touch nginx configs chatgpt is doing the work now. It's like an algorithm with NP complexity - I can verify that it's not doing something terrible in a short amount of time, so I'm not...

          Anytime I touch nginx configs chatgpt is doing the work now. It's like an algorithm with NP complexity - I can verify that it's not doing something terrible in a short amount of time, so I'm not worried chatgpt is going to give me some insecure server configs, but if I had to go around and remember all the bullshit it's going to take me far more time than iterating with chatgpt.

          2 votes
      2. skybrian
        Link Parent
        You could ask ChatGPT what the problems are in doing it this way, if there are alternative ways to do it, whether it could be simplified, and to explain how each part works. Don’t blindly trust...

        You could ask ChatGPT what the problems are in doing it this way, if there are alternative ways to do it, whether it could be simplified, and to explain how each part works.

        Don’t blindly trust the answers, though. You need to keep going until you either understand it or see the problem.

        3 votes
      3. tomf
        Link Parent
        haha. At least it used LET to shorten it a bit. Its funny that it used $ in the lookups for the starting row but no where else.

        haha. At least it used LET to shorten it a bit. Its funny that it used $ in the lookups for the starting row but no where else.

        1 vote
  10. [4]
    supported
    Link
    Yes this is my personal experience. ChatGPT is way smart and also way stupid.

    Yes this is my personal experience. ChatGPT is way smart and also way stupid.

    2 votes
    1. [3]
      Ashelyn
      Link Parent
      I think I've seen it best described as confident and authoritative regardless of how correct it actually is. Most language models are programmed to be authoritative because that style of prose is...

      I think I've seen it best described as confident and authoritative regardless of how correct it actually is. Most language models are programmed to be authoritative because that style of prose is more convincing, and the companies that develop them want you to feel like you're speaking to an expert even if that's not really the case because it sells the experience better.

      8 votes
      1. [2]
        pedantzilla
        Link Parent
        So, automated mansplaining.

        I think I've seen it best described as confident and authoritative regardless of how correct it actually is.

        So, automated mansplaining.

        4 votes
        1. teaearlgraycold
          Link Parent
          It is trained on a lot of reddit comments…

          It is trained on a lot of reddit comments…

          5 votes
  11. [3]
    0xSim
    Link
    Regarding code generation, I find Copilot to be magnitudes more useful than ChatGPT. ChatGPT can give me some high level ideas or patterns to solve a problem, but I find it almost useless with...

    Regarding code generation, I find Copilot to be magnitudes more useful than ChatGPT.

    ChatGPT can give me some high level ideas or patterns to solve a problem, but I find it almost useless with code. It simply takes too much time to write & tune questions & context, and at the end I have a buggy answer that I could have written better myself.

    Copilot has the full context of my project, gives suggestions in real time, and while it won't certainly write everything for me, it excels at boilerplate and boring code.

    2 votes
    1. [2]
      hushbucket
      Link Parent
      copilot isn't available for free though correct?

      copilot isn't available for free though correct?

      1 vote
      1. 0xSim
        Link Parent
        Correct, it's $10/mo after a free trial. It's free if you have a popular open source project. The definition of "popular" isn't clear, but it's generally agreed it means +1000 stars.

        Correct, it's $10/mo after a free trial. It's free if you have a popular open source project. The definition of "popular" isn't clear, but it's generally agreed it means +1000 stars.

        2 votes
  12. bret
    Link
    I've used chatgpt to successfully write some basic sql code (sql being something I'm unfamiliar with). The prompt would be something like, "WRITE SQL code to, in XYZ table, remove columns A and...

    I've used chatgpt to successfully write some basic sql code (sql being something I'm unfamiliar with). The prompt would be something like, "WRITE SQL code to, in XYZ table, remove columns A and B."

    Then, in excel, I've had it write a formula which - for a list of dates and values, will grab the interpolated values of all dates between the given dates. I strongly suspect the code is not optimized in any way and generally super janky, as it looks like this:
    "=LET(vb,LOOKUP(2,1/(K$2:K9<>""),K$2:K9),va,INDEX(FILTER(K9:K1999,K9:K1999<>"",""),1),db,INDEX(B:B,LOOKUP(2,1/(K$2:K9<>""),ROW(K$2:K9))),da,INDEX(B:B,INDEX(FILTER(ROW(K9:K1999),K9:K1999<>"",""),1)),vb+((va-vb)/(da-db+1)*(B9-db+1)))"

    ... but it DOES work

    1 vote
  13. DanBC
    Link
    I'd be really interested to see the researchers do a similar analysis on a similar number of human-answered questions to see how accurate etc they are. Or maybe some attempt at blinding - here are...

    I'd be really interested to see the researchers do a similar analysis on a similar number of human-answered questions to see how accurate etc they are.

    Or maybe some attempt at blinding - here are 1000 answers, and here are the questions. Some of the answers were created by ChatGPT. How many answers are any good.

    1 vote
  14. schmonie
    Link
    Has anyone else noticed ChatGPT getting way worse lately? I understand Open AI are very keen on safety, and have written posts/papers (sic) on what they call the “Alignment Tax” (which back in the...

    Has anyone else noticed ChatGPT getting way worse lately? I understand Open AI are very keen on safety, and have written posts/papers (sic) on what they call the “Alignment Tax” (which back in the ye old days of < 1 year ago we called I think “catastrophic forgetting”) but I find it has gotten consistently worse at programming in particular because over time. Friends and colleagues who use it to validate math proofs also say it’s gotten worse there.

    My tin foil hat theory is that they are trying to upsell GPT4 by making 3.5 “safer” and “faster” (meaning probably smaller and less performant)

  15. lou
    Link
    I'm still so impressed it gets half right...

    I'm still so impressed it gets half right...