64
votes
ChatGPT's odds of getting code questions correct are worse than a coin flip
Link information
This data is scraped automatically and may be incorrect.
- Title
- ChatGPT gets code questions wrong 52% of the time
- Published
- Aug 7 2023
- Word count
- 1201 words
I'm not a pro but I code here and there when required professionally and for fun. ChatGPT has been incredibly helpful for me. I guess Im in the sweet spot: doing beginner to intermediate level things and I understand the basics. Like u/friendly said, im not really asking it for canned code snippets. I'm asking for syntax and the right language/terms to dig deeper with. I use Google the same way, but the hit rate is awful when you're not using right keywords. You can start a conversation off sloppily with ChatGPT and keep refining it until you get value. That's the biggest boon imo, how it uses your previous inputs (its "memory"). Great tool.
I am with you here, in the context of only asking for a jumping off point or modifying code rather than "please write me a full program to do this thing"
Some examples have been:
Finding syntax for registering a service bus connection with the .NET DI container
Replacing some logic for trajectory calculations from Z-up to Y-up
Finding the name of an algorithm to search for to solve a problem
It's really good when you're learning a new framework or language and can state the problem but don't have the vocabulary yet to form a good Google search query.
I've had the opposite experience. I was simply trying to figure out how to exclude some symbols in an input, but include others. I spent hours asking ChatGPT. It gave me hundreds of results that were all different and none of them worked.
I've been kind of impressed with the phind.com model:
I've found that it can come back with useful (if not correct) answers to even pretty esoteric questions about languages or features of languages that there is generally little information about in the LLM's training data (which ChatGPT proper would either refuse to answer or hallucinate nonsense to wholesale). But I agree it's still just a jumping off point for further research and won't actually write working code snippets for you.
Very cool. First time hearing about phind so ill have go check it out
This is my experience as well. "Write me code for this." is pretty iffy. Especially for a larger program.
But it's pretty good at "read this code and tell me why it's not working."
And really helpful with " explain the difference between these two things" or " whats the syntax for this thing."
For context, I'm a working developer with a long tech history but less than 1 year of programming experience.
When I've used ChatGPT to help me write code, the more complicated the context is, the less successful it has been in providing anything close to a solution.
If I ask ChatGPT for simple programming logic it is helpful at giving me an overview of what a solution could look like. Sometimes it works out of the box, but more often I am just using it as a soundboard for ideas and syntax.
Yeah, it's worked well as a more advanced auto complete. Trying to get a whole block of software out of fraught with peril. It's a real coin toss to ask it to optimize code too. I've had it simply break the code in very nonsensical ways.
Lately I have been using it for advanced autocomplete. I've noticed it's much much better in well known languages like rust. I've been trying it with zig and it often spits out nonsense.
Oh yeah, it is definitely limited to the most popular languages. This is because it doesn't understand the language at all and is simply regurgitating a guess based on patterns. So uncommon languages simply have a lot less data to mimic from.
Is there some advantage to it over copilot then? Just cost?
Copilot is GPT-3/4 with the prompts built in. You can have conversation with GPT directly like saying "hey this sucks" which you can't do in copilot, but you have to actually paste in all of the relative code. I'd say copilot is better for programmers and GPT is better for learners.
Sometimes you want to ask questions. For example, I sometimes ask GPT4 to suggest alternative ways to do something and pick between them myself, which is different from having Copilot autocomplete its best guess on how to do something based on context and randomness.
(I think Copilot might have a chat dialog too, but I haven't tried it yet.)
I did a cursory reading of both the article and the paper and I only see "ChatGPT" mentioned without specifying whether they used the free version or paid GPT-4 version. That should be the first and most important fact mentioned because in my experience the results would be radically different. I would easily believe the results if they used GPT-3.5, I would be extremely skeptical if they said they used GPT-4.
According to the paper they used 3.5. That being said, in my experience with GPT-4, it is overhyped and hallucinates just as often, though I would really like to see this repeated with 4.
4 is noticeably better in my opinion. But it’s still just an LLM.
When you’re trying to solve a mystery, a source of hints that is 50% accurate is a really good source. This is like appearing on the top half of the front page of a Google search.
This just furthers my theory that there is little difference between AI and politicians in how they act, and how people react to them. Even though they're wrong most of the time, people still vote for them.
I think its very accurate. Politicians ultimately say whatever will get them approval. ChatGPT ultimately is trained to say whatever will sound convincing to a human. To some extent, that means gaining the human's approval.
I wonder how long it will be before someone manages to push an LLM through an election.
America had a bright orange LLM that played golf a lot.
It was very verbose, said mostly the wrong things, managed to get a lot of human approval and turned the US inside out with a silly slogan.
I don’t know much about coding, but I suppose it is as valid a field as any to judge chatgpts prowess. In my limited non-scientific experience, it tends to get even basic arithmetic wrong but will be very confident/convincing and will on occasion make up excuses when corrected.
It’s not a stupid AGI. It’s a smart language model. Don’t expect it to get arithmetic right. It’s good at language tasks. Incidentally GPT4 has modeled some simple systems relatively well.
I wish people specified whether they mean GPT-4 or GPT-3.5. I see your comment the same way I see the linked article - if you mean GPT-3.5, I don't doubt that. If you are talking about GPT-4, I would like to ask you to kindly link to some examples.
If you could train a coin to code for you correctly even half the time, I'd suspect you'd feel quite lucky.
Others mentioned being coders in some form or another and their experiences, I'll add mine as someone who isn't a coder. Most I've ever really done is read documentation on autohotkey and put together some very basic shit to automate mouse or key presses type of thing.
I tried doing that with ChatGPT once recently (getting it to make a few autohotkey scripts for me, even though I knew I could do it myself I figured it might save me time) and it was moderately OK, but often it had something wrong with it that I had to fix but sometimes it was my fault for not knowing how it processes requests. I would tell it I wanted it to do X, Y and Z, and often it could do those things, but it would also prevent me from doing A, B, and C as a part of normally using my PC, and I thought implicitly it would understand that I wanted it to not prevent my usage of the machine.
So if I told it I wanted the script to use a specific key as a toggle to trigger it to send a sequence of other keypresses or movements, but only when a specific application is active/focused does that specific key function as that toggle, it would do that, but it would also make it so that key would be inoperable when that application isn't in focus, even though if I was making the script myself I would have ordinarily understood that I still wanted that key to work as normal, just not when the application I specified is active. At that point I started learning better how to phrase my requests, and it would try to meet my requests, but either I wasn't being specific enough or it just couldn't accomplish it. Sometimes I would use vague instructions, something like 'not prevent normal use of the keys when application is in focus' or something like that and you could see it would attempt to do something like that, and on some occasions it worked or sorta worked, but in many cases it didn't. Hell even if it did manage to adjust it for that one flaw, sometimes it would introduce some other flaw from the adjustment. It was a little aggravating because I knew I could do it myself, but I was trying to learn how to give it instructions in a way for it to give me the solutions I wanted. In some cases I suspect there might not be any magic words to make it output the solution if the problem is too complex.
I’ll note that this is fundamentally an issue with formulating requirements. It’s really hard to do this well and communicate well-formulated requirements to software engineers. Some human software engineers make bad assumptions about implicit requirements or get bitten when they just ignore things that are not explicit in the specifications they are given. The best human software engineers are the ones who don’t even start coding until they have a good understanding of what the software is supposed to do. GPT-* and other current LLMs don’t have access to the wider context or experience of humans, so they either start coding without understanding the real requirements, or they get the implicit requirements correct by chance. (This also happens with junior software engineers as well.) If your software applications are low stakes, though, going back and forth between describing your requirements and filling in implicit requirements when they are not met is a slower, but feasible cycle to get to usable software.
Yeah that's what I started understanding the more I used it was I was just going back and forth filling in different implicit requirements. I think that was the slightly aggravating part because it was difficult for me to keep accounting for all the implicit requirements. Basically I could tell it I want X, Y, Z, then find out right away after testing it that it missed an implicit requirement. Then I give it another instruction, and then there's other implicit requirements that weren't resolved that revealed themselves after getting past the previous one. Then giving more specific instructions to deal with the newest revealed implicit requirement isn't good enough because you also have to account for the past implicit requirements. The first few that can be fine, but once you get past that, it becomes a more complex web of instructions that are hard to convey.
Eventually I just started using it to get the base of something and then I finished it myself, in that way I often didn't have to look up the documentation anymore to find the specific syntax or functions I needed to accomplish it, but I could account for my own implicit requirements more easily.
As a less-than-novice, my odds of getting it right with a coin flip are zero, not 50/50.
"Correct 48% of the time" is remarkable.
chatGPT is popular in the spreadsheet world among people who have never used a spreadsheet before. it writes terrible formulas and, if they even work, they’re just not natural.
for QUERY in sheets, it writes it like SQL instead of learning from the documentation.
i run a sheets sub and automatically remove all mentions of this (contextual) trash.
as one of those guys who has very limited spreadsheet knowledge, I used chatgpt to write this monstrosity:
=LET(vb,LOOKUP(2,1/(K$2:K9<>""),K$2:K9),va,INDEX(FILTER(K9:K1999,K9:K1999<>"",""),1),db,INDEX(B:B,LOOKUP(2,1/(K$2:K9<>""),ROW(K$2:K9))),da,INDEX(B:B,INDEX(FILTER(ROW(K9:K1999),K9:K1999<>"",""),1)),vb+((va-vb)/(da-db+1)*(B9-db+1)))
it works (takes values from a list of date/values and gives me interpolated values in a list of EVERY possible date), but SOMETHING is telling me there's a better way to write it :^)
This is my favorite use of ChatGPT: for writing in annoying languages. I’m never going to write an awk command or bash script again, it’s simply better at handling the syntax than I am.
Anytime I touch nginx configs chatgpt is doing the work now. It's like an algorithm with NP complexity - I can verify that it's not doing something terrible in a short amount of time, so I'm not worried chatgpt is going to give me some insecure server configs, but if I had to go around and remember all the bullshit it's going to take me far more time than iterating with chatgpt.
You could ask ChatGPT what the problems are in doing it this way, if there are alternative ways to do it, whether it could be simplified, and to explain how each part works.
Don’t blindly trust the answers, though. You need to keep going until you either understand it or see the problem.
haha. At least it used LET to shorten it a bit. Its funny that it used $ in the lookups for the starting row but no where else.
Yes this is my personal experience. ChatGPT is way smart and also way stupid.
I think I've seen it best described as confident and authoritative regardless of how correct it actually is. Most language models are programmed to be authoritative because that style of prose is more convincing, and the companies that develop them want you to feel like you're speaking to an expert even if that's not really the case because it sells the experience better.
So, automated mansplaining.
It is trained on a lot of reddit comments…
Regarding code generation, I find Copilot to be magnitudes more useful than ChatGPT.
ChatGPT can give me some high level ideas or patterns to solve a problem, but I find it almost useless with code. It simply takes too much time to write & tune questions & context, and at the end I have a buggy answer that I could have written better myself.
Copilot has the full context of my project, gives suggestions in real time, and while it won't certainly write everything for me, it excels at boilerplate and boring code.
copilot isn't available for free though correct?
Correct, it's $10/mo after a free trial. It's free if you have a popular open source project. The definition of "popular" isn't clear, but it's generally agreed it means +1000 stars.
I've used chatgpt to successfully write some basic sql code (sql being something I'm unfamiliar with). The prompt would be something like, "WRITE SQL code to, in XYZ table, remove columns A and B."
Then, in excel, I've had it write a formula which - for a list of dates and values, will grab the interpolated values of all dates between the given dates. I strongly suspect the code is not optimized in any way and generally super janky, as it looks like this:
"=LET(vb,LOOKUP(2,1/(K$2:K9<>""),K$2:K9),va,INDEX(FILTER(K9:K1999,K9:K1999<>"",""),1),db,INDEX(B:B,LOOKUP(2,1/(K$2:K9<>""),ROW(K$2:K9))),da,INDEX(B:B,INDEX(FILTER(ROW(K9:K1999),K9:K1999<>"",""),1)),vb+((va-vb)/(da-db+1)*(B9-db+1)))"
... but it DOES work
I'd be really interested to see the researchers do a similar analysis on a similar number of human-answered questions to see how accurate etc they are.
Or maybe some attempt at blinding - here are 1000 answers, and here are the questions. Some of the answers were created by ChatGPT. How many answers are any good.
Has anyone else noticed ChatGPT getting way worse lately? I understand Open AI are very keen on safety, and have written posts/papers (sic) on what they call the “Alignment Tax” (which back in the ye old days of < 1 year ago we called I think “catastrophic forgetting”) but I find it has gotten consistently worse at programming in particular because over time. Friends and colleagues who use it to validate math proofs also say it’s gotten worse there.
My tin foil hat theory is that they are trying to upsell GPT4 by making 3.5 “safer” and “faster” (meaning probably smaller and less performant)
I'm still so impressed it gets half right...