A study on robustness and reliability of large language model code generation

teraflop · on Aug 28, 2023

I'm sympathetic to the viewpoint that GPT-4 is prone to mistakes when writing code. Unfortunately, the analysis in this paper is pretty bad and doesn't support that conclusion.

The authors assume that for any given method under consideration, it must only occur within a particular pattern of other method calls and control flow instructions. But the templates they have chosen are clearly only applicable in certain situations.

For example, they claim that I/O operations are "wrong" unless they are wrapped in exception handlers that log any errors:

    try {
        ...
    } catch (IOException e) {
        e.printStackTrace();
    }

But of course, this will cause execution to continue as though the I/O was successful, which might be exactly the wrong thing to do! In many cases, you want the exception to propagate, so that the caller can decide how to handle the failure. (And even if you do want to report the error somehow, writing it to stderr might not be correct; it's pointless in a GUI app.)

Similarly, the authors assume that every time you create a file or directory, you always want to call .exists() first (even though doing so has an inherent race condition); that Map.get() must always be followed by an "if" block; that List.get() must always be guarded by an explicit bounds check; that after doing a database query, you always want to close the connection; and so on. None of those rules are universally applicable.

I would expect the real problem with LLM-generated code to be semantic bugs and "misunderstandings" of the requirements, which would not be caught by superficial checks like this.

oceanplexian · on Aug 28, 2023

I see humans do this all the time, especially abuse of exception handling, even among “Senior” developers. They don’t have a semantic understanding of what they are doing or why they are creating a race conditions or creating perverse control flow logic N layers down in the stack.

The fact that researchers get it wrong is, well, unsurprising. LLMs might actually be an improvement.

JackFr · on Aug 28, 2023

Well, there's no "correct" answer. Depending on the context, the "correct" thing might be 1) to log and swallow the exception and move on; 2) Let the exception percolate up to caller, who can handle and recover; 3) implement your own recovery and handling; 4) kill the process. To know which one to do, one needs to understand the context. Logging and swallowing is not the worst default, but it's also likely not the best.

anonzzzies · on Aug 28, 2023

But this is because almost everyone gets taught this badly. When you ask how to do exception handling best practices etc, even from teachers/college profs etc you get wildly different answers from every single one of them. So people either learn themselves or from colleagues or, most likely, not at all.

afpx · on Aug 28, 2023

What’s a “senior” developer these days? 3-4 years experience on average?

kuchenbecker · on Aug 28, 2023

Knows how to build a feature or small system without handholding.

Anduia · on Aug 28, 2023

Hi, do you have a recommended read for those of us who might inadvertently create race conditions?

lifeisstillgood · on Aug 28, 2023

Now hopefully I won't get horribly dinged for mistakes and poor advice here. What I am trying to say is that I too read "even senior devs don't understand the race conditions they create downstream." And I thought - oh God, don't I.

But five minutes thought can help you walk through most issues. For most applications most of the time you can reason your way through without fear, and when you do encounter gnarly problems they often can go away by redesigning your application! Often the problems you encounter you caused. retrace your steps and find an easier path. Save the hard thinking for genuine problems / value creation.

So race conditions are simply when two processes / threads are likely to affect a single resource. In this case it's a file - and the problem is test if a file exists, then if it does not, create it and then write to it.

If two threads do this, say a log file, the first one creates the file and logs it's important stuff, the second then creates it again wiping out the first log data.

Solutions in this area include

- create as append file (the concept is basically deviates old because this is a decades old problem)

- avoid sharing resources. for logging log to per thread locations. Not always possible but you sure as hell can minimise this to one or two resources you must share.

- hand off creating files to a seperate part of the application. There is a balance between "scripting" and "application" and using small little library functions to do in one line indirectly something that also takes one line using the methods shown in the docs.

- handing off batons / mutexes etc etc. This gets wildly complex. Honestly given a world of async libraries, Erlang, and distributed computing, if you find yourself having to use multi-threading think very carefully if this is the right approach.

magicalhippo · on Aug 28, 2023

For the example in question one approach which might work well is to realize the test is superfluous. Just open the file with the right flags.

That is, the flag for create if not already existing and flaf for deny sharing so any other process/thread will fail to concurrently open it.

Then you just handle the case when the open fails.

Of course this relies on those flags being available and working. For example shared files might not respect the sharing lock.

Many cases of avoiding race conditions boil down to something similar: avoid the problem by not doing extra work.

pulvinar · on Aug 28, 2023

Or, sidestep it by creating an empty log file (if it doesn't exist) during initialization before creating multiple threads.

lifeisstillgood · on Aug 28, 2023

Yeah that's more or less "don't do simple one liners in your application code - write a simple library that does the simple thing, but wraps the simple thing in lots of checks"

There probably is a "design pattern" for that but darned if I can draw it in UML

layer8 · on Aug 28, 2023

For Java, Effective Java, 3rd Edition, chapter 11. For a more detailed treatment, Java Concurrency in Practice is the book to read.

Lerc · on Aug 28, 2023

I have only glanced at a few examples in the paper so far but I would rate some of the cases listed as being better responses from the LLM than their exemplars.

If you ask someone how to open a can of soda you do not want them to tell you how to check to see if the can has been shaken and what to do if it has. You want the instructions to the question you asked.

If anything I would a LLM to produce code that does the fundamental operation, possibly I might like it to offer a more robust framing of the operation as an extended example but I suspect even that would get annoying after a while.

I absolutely would not want The most technically correct but harder to read response to a question of "how do I X?"

13yuan · on Aug 28, 2023

The paper actually explains that it targets at the API misuse problem, not the semantic alignment or bugs. Semantic bugs are difficult to detect, which is already a consensus in software engineering field and still many ongoing work on it. And when we say 'semantic' in this special problem it means more than 'semantic' in programs but also how developers express their semantics, which is definitely a problem bigger than checking the code itself. The API usage patterns are created to help check the code snippets given by LLMs, while the race condition you mentioned could exist but not checkable unless given other components of the programs. If adding the bugs you mentioned, the buggy code generated by LLMs could even exceed the number claimed in the paper.

teraflop · on Aug 28, 2023

But what is "misuse", then, if not something that causes a bug?

"Misuse" is not a formally defined thing in Java, and the authors never define what they mean by it.

> If adding the bugs you mentioned, the buggy code generated by LLMs could even exceed the number claimed in the paper.

Or it could be much fewer. Aside from a few scarce examples, the paper gives no evidence that most of the patterns that were selected are actually associated with "misuse". They just declare it to be so. (The 2018 paper they cite for their dataset also provides little such evidence. It just says that the authors reviewed the patterns, reviewed documentation, and decided which ones they considered to be poor code quality.)

It is easy to come up with valid situations where a piece of code violates those patterns, but behaves as intended and is not misusing the API. So why should I assign any meaning to the fact that some percentage of code snippets violate the patterns?

I could make a list of adjectives that frequently appear in comments near buggy code, and then count how many LLM outputs contain those outputs, and then say that means the LLM output is buggy. But I would not be saying anything meaningful about the LLM's quality by doing so.

VMG · on Aug 28, 2023

> But of course, this will cause execution to continue as though the I/O was successful, which might be exactly the wrong thing to do!

Almost always is the wrong thing to do when I encounter this antipattern. GPT-4 beats these researches in code quality is my takeaway here.

gonzo41 · on Aug 28, 2023

If you take the point of view that LLM's have been trained on all the code that humans write, then that sort of things is probably 'correct' to the model.

For coding it seems like there almost needs to be a weighting to certain 'correct and orthodox' handling of events if you're going to give over control of code generation to an LLM.

keyle · on Aug 27, 2023

Ok but what are we calling API misuses?

    We collect 1208 coding questions from StackOverflow on 24 representative Java APIs. We summarize thecommon misuse patterns of these APIs and evaluate them oncurrent popular LLMs. The evaluation results show that evenfor GPT-4, 62% of the generated code contains API misuses,which would cause unexpected consequences if the code isintroduced into real-world software.

Should have used ChatGPT to correct those sentences. /s

What do we consider an API misuse at the end of the day? And also against the Java APIs... Which are some of the oldest.

I'm not saying ChatGPT answers work, 50% of the time they don't work out of the box. But overal, the time savings are incredible, compared to the fossil digging way of googling. And I value the time savings towards the equally correct way of googling for answers. You reap what you sow. The knowledge the LLM is based upon is the same stuff you used to distill manually. It shouldn't be more or less correct, if you think of it that way.

Surprise, ChatGPT will not replace devs. But it is like having the electrical drill vs. screwdriver type of situation.

dventimi · on Aug 28, 2023

Evidently, THEY are calling the following API misuses:

"To evaluate the API usage correctness in code, RO- BUSTAPI detects the API misuses against the API usage rules by extracting call consequences and control structures from the code snippet, as shown in Figure 2. The code checker firstly check the code snippets to see whether it is a snippet of a method or a method of a class, so that it can enclose this code snippet and constructs abstract syntax tree (AST) from the code snippet. Then the checker traverses the AST to record all the method calls and control structures in order, which generate a call sequence. Next, the checker compare the call sequence against the API usage rules. It infers the instance type of each method calls and use the type and method as keys to retrieve corresponding API us- age rules. Finally, the checker computes the longest common sequence between the call sequence and the API usage rules. If the call sequence does not match the expected API usage rules, the checker will report API misuses."

simonw · on Aug 28, 2023

"compare the call sequence against the API usage rules"

The validity of this entire paper would seem to depend on how robust and agreed upon these "API usage rules" are.

dventimi · on Aug 28, 2023

Everything depends on something else. That's life.

wredue · on Aug 28, 2023

I just don’t see how it could save time.

Programmers nearly universally agree that reading code is harder than writing it.

So when you have chat GPT writing your code, you have to read and understand it to ensure it’s actually doing what you need it to without awkward bugs or problems.

Given that it’s harder to read than write code, it seems to stand to reason that the “time savings” must be nil.

simonw · on Aug 28, 2023

"I just don’t see how it could save time."

Have you tried it much?

It's saved me a ton of time over the past 8 months. I don't know what else to say - I'm not the kind of person to deceive myself into thinking something is saving me time when it isn't.

The time saved is mainly in the micro-research you no longer have to do - the bits where you have to go and look up how to write a for loop in Bash, or how to call super() in a Python class, or which exceptions you need to catch for a call to httpx.get().

GPT-4 writes the code right 90% of the time. The 10% of the time it doesn't you catch in the same testing you would have done against code you had written yourself the long way.

maxbond · on Aug 28, 2023

I have found it to save time when I'm learning a new, widely used framework. (Presumably this would hold for a language as well.) I used it to learn Keras and React + Redux, and it was helpful there. For instance I was able to ask it about what sort of messages would be emitted by Redux in a given circumstance, and the precise values were wrong but it didn't really matter. That was very cool.

Once I have a good sense for how things work, I'd much rather go to the documentation. Most of the time I can do this straight from my editor by typing the method I want and querying the LSP for documentation. In those cases it's more or less instant, there's no improvement to be made there.

Maybe in some of the other cases it would be faster via an LLM integrated with my editor, but I don't think it would justify the cognitive load of considering whether or not it's correct. I'd like to worry about whether the application is correct. And I don't really think I'm wasting time waiting for the Python documentation to load, I'm continuing to chew on the problem.

Additionally, when I go to the documentation, I'm looking for callouts about the safety of an API, and I take the lack of such callouts as authoritative. Eg, recently I was reading that in node-postgres you have to return connections to the pool, which was a good refresher for me, because previously I'd been using Rust where resources are generally cleaned up automatically. I just don't trust a lack of a callout from GPT-4 as being a confirmation of a lack of a safety issue.

jiggawatts · on Aug 28, 2023

GPT can read and understand both code and the comments about 1000x faster than a human.

It's not just about the code it can generate out of a vacuum, it can also understand code for you.

I use it to generate doc-comments, check for differences between the comment and the code, or to explain code I'm not familiar with and don't understand myself.

E.g.: I don't use Python and I had to figure out what a spaghetti-code Python script actually did. I fed it to GPT 4 and started asking questions. That's way faster than learning Python first and then deciphering the script.

dagw · on Aug 28, 2023

GPT won't only writes code, but also documentation and unit tests. It can also read code for you and explain what is going on. Furthermore, even as an academic researcher, most of the code I write is simple boilerplate code that GPT excels at. This means I get to spend more time on the actual code that is hard and novel and that GPT cannot really help with.

Back when GitHub Copilot was first released I though very much like you and was somewhat underwhelmed by my experience with it, however since then the state of the art has improved dramatically.

Yodel0914 · on Aug 28, 2023

It depends on your baseline. If you're comparing it with "I know how to do this already but don't want to type" I think ChatGPT is slower. If you're comparing it to "I can't remember how to use this API so I'm going to read a bunch of Stack Overflow posts" than ChatGPT is seriously faster.

One of the reasons is that you can iterate to refine and expand your answer. Here's a transcript of a recent chat, where I was too lazy to write a script:

Me> I have a CSV file with a header row, whose first column is a date formatted like this: "March 16, 2022 1:51:19 PM". Could you write a script to convert the dates in this CSV to ISO date format? I am using a mac would would prefer to not install any software, but other than that you can choose any language you'd like

ChatGPT>Sure! Since macOS comes with Python pre-installed, you can use a Python script to read the CSV file and convert the dates to the ISO date format.

Me> the first row is a header row, could you please modify the script to handle that?

ChatGPT> Certainly! We'll modify the script to handle the header row separately, ensuring that it is copied to the output file without modification.

Me> can you make it so the input file is a commandline argument?

ChatGPT> Certainly! We can modify the script to accept the input file path as a command-line argument. Here's the updated code

You get the idea.

naasking · on Aug 28, 2023

> Programmers nearly universally agree that reading code is harder than writing it.

I don't think there's universal agreement on that at all. Programmers universally agree that we spend more time reading code than writing it, but that doesn't somehow translate into it being harder.

Various studies have shown that programmers, on average, can only add at about 10 correct lines of code per day. We of course add more lines of code, but they tend to contain many bugs, so in the end only 10 lines of those are correct.

Maybe we only get 10 correct lines of code in because we're not reading the surrounding code correctly, and thus reading code is harder, or maybe 10 correct lines of code is just harder to write. I'm not aware of any hard data on this question.

hmottestad · on Aug 28, 2023

Sounds like those programmers are spending a decent amount of time stuck in meetings :P

sharemywin · on Aug 28, 2023

ChatGPT is pretty good about telling you what a line of code does. So I just copy it in and ask it.

gonzan · on Aug 28, 2023

Not all reading is the same. If I tell an AI to write code that does X then checking if a provided code actually does X is a lot faster than writing it myself.

The way I use AI is to think at a high level how I want to implement something and then instead of writing the code tell the AI to write it by giving it specific instructions. Then proof-read it and fix the 1-2 things that it inevitably gets wrong.

I'm not reading some random code that I don't know what it's supposed to be doing. I'm reading the output of an instruction I gave that include some technical detail on how to implement it. Reading it and doing minor fixes is really fast.

Yiin · on Aug 28, 2023

To write the code you need to know what to write. In many cases I find it easier to read proposed solution to my problem and then figure out from there how I actually want to implement a solution. It's great starting point.

taneq · on Aug 28, 2023

Depends what you're doing. If you tend to hop between languages, software stacks, platforms etc. then you probably (as I do) spend a lot of time looking up what the options are to achieve any given task, finding and interpreting API calls, etc. Having an LLM spit out code that's in the right direction would be a huge time saver.

makestuff · on Aug 27, 2023

I have been trying to use ChatGPT to make a VueJS website (I’m a backend dev). What I have noticed with VueJS specifically is that it has little to no knowledge of the composition API, and when I ask it to help build a component it is pretty good at the scaffolding. The component will generally work; however, it is pretty useless for CSS styling.

I probably should try Copilot as I bet that would work better. Also the VS Code integration would be really nice.

keyle · on Aug 27, 2023

Copilot probably wins because it holds more of _your_ code against the desired output.

I find ChatGPT is really great at getting something going, think scaffolding wise. But if you enter the realm of gnarly stuff, it can get completely delirious with answers full of red herrings and sends you running in circles very quickly.

coffeebeqn · on Aug 28, 2023

The only truly helpful use-case I’ve found is in generating some well defined text in a different form. Say - here’s a JSON object , please write a MD file describing it or turn it to X language struct/object.

simonw · on Aug 28, 2023

How stable was the Vue composition API prior to the September 2021 training cut-off date for the OpenAI models?

Looks to me like https://blog.vuejs.org/posts/vue-2-7-naruto back-ported it in July of 2022, and the earliest snapshot of the documentation for it in Vue 3 was https://web.archive.org/web/20220210121247/https://vuejs.org... in February 2022 - so my guess is it mostly happened after that magic date.

earthboundkid · on Aug 28, 2023

ChatGPT is very bad at CSS. The only good part is you can test it and see that it’s not working very quickly. Unfortunately if you go back to ChatGPT it will waste your time with endless apologies and more broken CSS.

ramraj07 · on Aug 28, 2023

Can you confirm if you used gpt-4? This was not my experience

makestuff · on Aug 28, 2023

It was gpt-3.5 (the free version offered). I will give get-4 a try though that is good to know.

naasking · on Aug 28, 2023

Use Bing chat to get access to GPT-4 for free.

robwwilliams · on Aug 28, 2023

Is this worth reading? When authors cannot write a grammatically correct abstract I tune out. Try to make sense of this monster sentence:

“The misuse of APIs in the generated code could lead to severe problem, such as resource leaks, program crashes, this http URL make things worse, the users of LLM code generation services are actually the developers that are most vulnerable to these code that seems right -- They are always novice developers that are not familiar with the APIs that LLMs generate code for them.”

maxbond · on Aug 28, 2023

Bear in mind that most researchers are not native English speakers.

bowsamic · on Aug 28, 2023

I’m confused, do you think we’re not aware of that?

maxbond · on Aug 28, 2023

I try to restrict my comments to addressing the content of the parent comment, rather than what my assumptions about people are.

bowsamic · on Aug 28, 2023

I don't see how it added any extra information to the parent comment

maxbond · on Aug 28, 2023

They asked whether this paper was worth reading, because the English wasn't conventional in their dialect.

I'm pointing out that that isn't a good way to screen papers, because most English speakers don't share the same dialect.

Perhaps unstated or understated is the implication that, were you to discard papers in this way, you're liable to miss out.

I'll leave it to you whether that added information. That wasn't really my goal in this case, my goal was to respond to the question in the manner I viewed appropriate. Feel free to downvote my comment if you don't feel it was appropriate or informative or otherwise didn't uphold whatever criteria you're judging it by.

bowsamic · on Aug 28, 2023

> I'm pointing out that that isn't a good way to screen papers, because most English speakers don't share the same dialect.

You did not make that clear at all.

> Perhaps unstated or understated is the implication that, were you to discard papers in this way, you're liable to miss out.

Indeed, that implication was totally missing.

> Feel free to downvote my comment if you don't feel it was appropriate or informative or otherwise didn't uphold whatever criteria you're judging it by.

I would certainly do so. Like their paper, it is quite poor, and not helpful for this forum. Unfortunately, /u/dang has removed my ability to downvote.

maxbond · on Aug 28, 2023

Alrighty. Thanks for the feedback. Have a great day.

(For what it's worth, I often feel my comments are too long and wordy, and occasionally I've gotten feedback to that effect, so I was trying to be brief in my original comment.)

bt1a · on Aug 28, 2023

i understood your original comment :)

totally agree that dismissing the content of a paper for its grammatical mistakes alone is foolish.

this isn’t a paper about classic roman literature

Tiberium · on Aug 28, 2023

This argument is not really valid anymore, especially in a paper that talks about LLMs.

maxbond · on Aug 28, 2023

It's an observation, not an argument, but note that this is a paper about LLM mistakes. But if the language bothers you and you view LLMs as the solution, you're certainly free to feed it into an LLM yourself.

To be frank, I find it a little off-putting to suggest being a non-native English speaker is "no longer valid."

clarionbell · on Aug 28, 2023

I'm not a native english speaker. For serious work, that is, anything that may be printed or published with my real name on it, I use grammar checks. That's the least I can do.

boxed · on Aug 28, 2023

Maybe the previous commenter meant that they could have used an LLM to translate from Mandarin to English to get a better sentence?

Which is... ironic I think :P

QuadrupleA · on Aug 28, 2023

Just a grammar check or a proofreading pass through an LLM is all it needs. It's an English-language paper, so we e.g. separate words by spaces.

Hate this modern kneejerk reaction to get preachy and supercilious whenever anything vaguely cultural comes up.

maxbond · on Aug 28, 2023

There might be preaching going on here, but it's not coming from me. :)

QuadrupleA · on Aug 28, 2023

I'm rubber and you're glue...

barbarr · on Aug 28, 2023

I found it somewhat charming that the authors, who express skepticism about the use of LLMs by novices, indeed stuck to their views and decided not to run their phrasings through an LLM. This will probably be one of the dwindling number of papers released each year with "fairly bad" grammar - and part of me wonders if we're losing something in the process of LLM-meditated homogenization.

mdp2021 · on Aug 28, 2023

> not really valid anymore

You have to express reasons when not evident.

In case, just in case (and to also show how lack of definition may trigger any weak interpretation), after suggestion from a nearby member, the idea would had been that today people could have sentences reformulated by language models: the composer intends to express a logic, so any translator would have to grasp said logic primarily and first.

splatzone · on Aug 28, 2023

Run the code it writes, if it gives an error, paste it into the chat and 90% of the time the LLM can fix the issue.

People are missing the point here - it's not about writing code in one-shot. An LLM-enabled loop can generate the code and test and refine it until it works

tedunangst · on Aug 28, 2023

Someone posted a transcript of having chatgpt write them a bash script with the comment "see how much easier this was than figuring it out" and it took like a dozen tries of pasting error messages back in. I was infuriated just reading it. I cannot imagine trying to develop this way. But if people like it, whatever blows your hair back.

Allegedly the benefit is it will do the boring stuff like write the error handling code for you, but I have no faith it would work if it takes so many tries to get the happy path correct.

arp242 · on Aug 28, 2023

> I cannot imagine trying to develop this way. But if people like it, whatever blows your hair back.

I agree; who am I to comment on how you write the code, as long as it gets written, right? That said, I've already had a "this is really odd code you committed last week, what's up with that?" and the reply was "oh, dunno, that's just what ChatGPT gave me". Meh. Actually writing things yourself does increase your understanding, so it's not the same, not really. It's when they told you to take notes at school: I didn't believe my teachers it'll help retain knowledge on account of being a stubborn little idiot, but they were right!

For people who have already done this programming thing for a while I guess it'll work out, but my main worry is the effect it will have on more junior people who will "grow up" on ChatGPT. "Figuring it out" yourself has a lot of value.

Just because the code is free of syntax errors doesn't mean it's free of bugs. Shell scripting is a classic case where it's hard to make something work but also buggy. You need to actually understand the code and reason about it. Trail-and-error programming rarely leads to good code.

I also fear we'll end up with hard to read overly verbose/repetitive code, "because ChatGPT/copilot will just generate it".

teaearlgraycold · on Aug 28, 2023

Anyone committing code they don’t understand should be put on a PIP.

arp242 · on Aug 28, 2023

He was the team lead shrug

This is one of several reasons I worked there for a week only.

Etheryte · on Aug 28, 2023

Man I sure love running buggy bash scripts on my system just to see what happens.

wavemode · on Aug 28, 2023

I also cannot understand how people develop that way.

That being said, better workflows do exist. I imagine IDE-integrated LLM tools like Copilot and CodeGPT are much more productivity-enhancing than copy-pasting code between a browser and an editor.

wokwokwok · on Aug 28, 2023

I think you’re missing the point here.

Language models scale.

It doesn’t matter if a single pass doesn’t solve the problem, has syntax errors, etc. A single pass costs a fraction of a cent.

You can just automate the process of code -> create variant -> fix from LLM -> apply deterministic tests until the code at least compiles -> pass it to the user; the fact you can’t do that with chatgpt is just because it’s not a coding tool.

Heck, there are already companies doing exactly this with vulnerability scanners (ie. the deterministic feedback loop) to suggest security fixes for code.

You just repeat a heap of times until you get a solution that passes all the scanners.

If it takes 10 tries, or 100 tries, it still is zero effort from a human.

Of course, whether the result does the right thing is another matter, but the frustrating ergonomics of copy-paste cycle is because of the ui, not the technology.

You will see “surprisingly good” output from professional tools in this space (we already are); but a lot of it is not magic sauce; it’s just the same tools, run multiple times, in a way that saves you doing it manually and just shows the best results, with a pretty ribbon on it.

ChatGTP · on Aug 28, 2023

Sounds kind of silly, just run the thing till it gives the right answer ? How do you know each pass has not introduced a new problem ?

wokwokwok · on Aug 28, 2023

Yup.

…and if it works (and it does) then it really goes to show that you (and me, and people in general) have poor intuition about this stuff.

ChatGTP · on Aug 28, 2023

Yeah, please don't try bullshit me dude.

It's a silly, potentially dangerous and inefficient way to "code", which is the easiest part of my job. If my job was just coding all day, I'd be happy.

_bramses · on Aug 28, 2023

Completely agree. Any metric unto itself will only show a partial picture. Using LLMs as an iteration partner can solve any problem which is embedded in latent space (which is pretty much all problems that devs get paid to solve).

People seemingly desire to zealously defend the LeetCode way of solving problems for some inane reason.

maxbond · on Aug 28, 2023

If you're writing eg a bash script and it contains a difficult to spot bug which causes causes data loss (see [1] caused by a single space), what then?

[1] https://github.com/MrMEEE/bumblebee-Old-and-abbandoned/issue...

YeGoblynQueenne · on Aug 28, 2023

Then you ask ChatGPT to write you an email to apologise to your boss.

They'll probably ask ChatGPT to write an email to reply, so that's all good.

taspeotis · on Aug 27, 2023

I mean the landing page of GitHub Copilot just shoves text unencoded into a urlencoded body, this is nothing new...

    const response = await fetch(`http://text-processing.com/api/sentiment/`, {
      method: "POST",
      body: `text=${text}`,
      headers: {
        "Content-Type": "application/x-www-form-urlencoded",
      },
    });

rany_ · on Aug 27, 2023

At least they can't be sued for false advertising. That's unfortunately commendable these days.

earthboundkid · on Aug 28, 2023

I find it kind of crazy that they haven’t fixed given how prominently it was called out. The Go example is wrong too. It doesn’t check the database call for a final error.

teaearlgraycold · on Aug 28, 2023

I think to avoid false advertising claims they are using actual outputs. And if the listed examples are doing their job (getting people to sign up) they won’t change them.

willsmith72 · on Aug 27, 2023

It's an interesting idea, but why is this written so weirdly?

"the users of LLM code generation services are actually the developers that are most vulnerable to these code that seems right -- They are always novice developers that are not familiar with the APIs that LLMs generate code for them"

It looks like they could've benefited from an LLM doing a pass-over. Are they saying it's only "novice developers" using LLMs?

sen · on Aug 28, 2023

Just looks to me like the people who wrote this don’t speak English as a first language? It’s not difficult to parse what they meant to say.

Most of the users who are likely to turn to LLMs to help them code, are people who aren’t very experienced programmers. Hence needing help. These same people are more likely to fall for mistakes in the generated code as they don’t have enough experience to notice the errors.

LLMs are very good at sounding confidently correct about things that are subtly wrong. You need experience in the domain to notice these issues usually.

jprete · on Aug 27, 2023

I think the author is simply weak at English, although there are some typos that really should have been corrected even in non-idiomatic English (e.g. missing spaces between words).

Really, as long as the premise, evidence, and results are clear and reproducible, the quality of the English doesn’t much matter.

dools · on Aug 27, 2023

The claim “They are always novice developers that are not familiar with the APIs that LLMs generate code for them” seems pretty outlandish to me.

robwwilliams · on Aug 28, 2023

Writing is code too. Writing English well does matter if you want to have your work read.

dools · on Aug 27, 2023

Yeah that statement seemed very strange to me too. I’ve been writing software for 20 years and I use ChatGPT4 everyday because it’s faster to edit and debug some code with a LLM than it is to write it all from scratch

version_five · on Aug 28, 2023

There was a good video posted here yesterday about a (for me) seemingly typical experience writing code with chatGPT https://www.youtube.com/watch?v=U2Q3KAOSAEY

The combination of hallucination and just bad advice makes it time consuming and not clearly worth it. I think it's a step in the right direction for conversational interfaces. The fact that you really can have a back and forth to clarify intent, make corrections, and arrive at a "meeting of the minds" is lightyears beyond "intent recognition" type chat, and what a conversation should be. It just needs more work, which may or may not be straightforward.

Lerc · on Aug 28, 2023

Having the ChatGPT 4 icon being purple has been a godsend when evaluating claims like these.

version_five · on Aug 28, 2023

Interesting, I never noticed but I guess you're saying he used 3.5? I've only used 4, which I understand is way better, and I've still had the same kind of problems, but I agree it's disingenuous to equate 3.5 with the current state of the technology.

stephc_int13 · on Aug 27, 2023

I used Copilot for a few months, and I've toyed with different versions of Chat-GPT, for code.

Not impressed so far. With some effort and improvement this kind of tech can be used to fill boilerplate and glue code, and that is very nice as this is something that can help to stay in the flow by avoiding boring work.

But, the problem is that it is too unreliable even for this style of simple tasks, sometimes it works directly but it is too often plain wrong. Frustrating to use, not enough value, so I will try again later.

damowangcy · on Aug 28, 2023

API misuses is one thing, the more concerning outcome is the misuse of AI itself.

Just like how there is "common misuse pattern of x language", there is also "common misuse of copilot/chatgpt4".

From my observation, those who are successful with code generation tools are using it as an assistant, they already know the language quite well, AI is there just to help, most of the code are boilerplate code or common patterns.

Those who complaint are usually using it to do more than just that, relying it to generate working piece of code for a particular function without knowledge of the language. Currently, we use StackOverflow for this purpose, when we asked question, we don't just say, "hey, code please" but we are trying to understand why and how things work instead. This knowledge will then help us to code contextually, which current AIs are incapable of. The plus side is that code generation is faster.

Also the point of the paper is not to prove that GPT-4 is unusable but that it can be further improve. They didn't just manually determine misuses but did it via an API checker that they designed. So, in the future, AIs are just going to be more reliable not less.

We are not at a point that AI is common enough to talk about misuse yet though, most don't even have access to CoPilot or ChatGPT 4.

zeroCalories · on Aug 28, 2023

While I do see LLMs helping programmers a lot, I'm not super impressed by seeing it write what looks like a lot of boilerplate. If things become so common that it can be replicated by an LLM, it seems like we need to be abstracting it away.

ChatGTP · on Aug 28, 2023

The worst programmers I know are the busiest.

We’re already in this world. Seriously develop and app with Django and tell me how much actually “code” you write ?

I’ve been a developer for almost 2 decades and the coding part of the job is about 5% of my time :) The rest is stopping people from building stupid shit they don’t need to build and trying to make sense of “the business”

With generative AI, you can have all the stupid code you want in 1/1000th of the time :) :)

For me personally, we invented computers so we can spend time at the beach…ever seen the early Apple computer ads ? We shouldn’t be writing code constantly if we’re doing our jobs properly. We should be enjoying our lives which technology should be working and affording us the time to do so.

zeroCalories · on Aug 28, 2023

Yeah I agree. Webdev in particular could use a lot more work. I think serverless does a good job, but platform lock-in prevents it from becoming the standard. I think databases are a great example of abstraction done right. I rarely see database code(sql, configs) that seems copy-pasted.

Gigachad · on Aug 27, 2023

I feel like AI for programming has been so overhyped. I've attempted to use ChatGPT for programming so many times, and almost every time it's just wasted my time. Giving me outright lies or generating stuff that looks right but doesn't work. And just throwing out it's answer and starting from scratch is faster than fixing its output.

I've only found it useful for explaining basic terminology or concepts for a topic I'm not familiar with. Though this is the most dangerous use case since I'm not familiar with the answer, I can't easily fact check it.

thomasfromcdnjs · on Aug 27, 2023

I've heard this from many people now but I cannot relate at all. Copilot has changed my entire strategy of programming (which I have been doing for 20 years)

It writes lots of code before I even think about what I was going to do in that file. Half the time it just works outright, and if not, with minor changes.

It is fantastic when I have to build functionality in languages I am not great at.

I regularly copy and paste code from my colleagues PR's and paste it into GPT-4, it explains it perfectly and gives tips on how to improve it (which I add as comments to my PR review)

enumjorge · on Aug 28, 2023

It's comments like these that make me wish I could stand behind the chair of the people who get these results because I feel like there's something lost in translation. I don't know if I'm taking your comment too literally but this part confuses me:

> It writes lots of code before I even think about what I was going to do in that file.

If you haven't even thought of what you want to do in the file, how are you prompting Copilot to write the code you need? Am I understanding correctly that Copilot not only generates the correct code, but it's also able to deduce what code you meant to ask for without any input from you? I don't see how this is possible except in the uncommon case where you are bringing a file to be more in line with other files in your code base that have a similar structure.

I do think Copilot can be useful, but I'm confused at how different my experience is from what others are getting out of the tool.

abootstrapper · on Aug 28, 2023

ChatGPT 4 is amazing! I use it when I’m blocked to get unstuck. I don’t need perfectly functioning code, just something to get my gears turning. I use it to review my own code. It often has pretty good suggestions, or can validate my approach. And I use it to sketch out unit tests for me. I still have fill in pieces, but it’s absolutely improved my output. It’s like having an egoless programming partner. Yeah, it makes mistakes, but I treat the AI like a programming buddy, not a flawless code wizard. It doesn’t solve problems for me, but we come to solutions together.

squeaky-clean · on Aug 28, 2023

> Am I understanding correctly that Copilot not only generates the correct code, but it's also able to deduce what code you meant to ask for without any input from you?

You can know what you want to program, but not how you would approach the code. "Hey ChatGPT, how do I draw a curved line plot in JUCE?"

If I just ask it "Can you write me a script that gathers the current stories on the front page of hacker news and sorts them by number of comments, then prints the title and number of comments?" it writes a script in Python using requests and BeautifulSoup4.

thomasfromcdnjs · on Aug 28, 2023

What I find useful is I write two lines of comments when starting a new file, about what it is going to do.

Otherwise, even just creating a function, generally they will be filled out automatically.

I suppose I'm quite big on naming (vars and functions)

Maybe if your language or code is more terse, it might struggle more.

redhale · on Aug 28, 2023

This!

I wonder if it's a language difference thing (I use C# mostly), but my experience with Copilot has been nothing short of mind-blowing.

It's not writing every line, and when writing truly new feature code it's less useful. But here are two patterns that I've noticed it is especially and consistently good at as a potential place to look for initial value if you are skeptical:

- If you have any kind of repeated pattern, even very complex ones, like performing a set of operations on a set of objects, or initializing a set of things, or whatever, it will guess everything else after the first line 90% of the time. This stuff is almost always plumbing/wiring/boilerplate-type code, so pure time gained. Think about Excel filling incrementing numbers down a column, but for pattern-matched lines or blocks of code.

- For any reasonably testable class, if I write the name of the unit test, Copilot will write the entire unit test perfectly 90% of the time, down to variable names and //Arrange//Act//Assert comments that I stylistically prefer. Seriously, it's sort of scary how good it is at this.

toyg · on Aug 28, 2023

Languages like C# and Java have always required a lot of boilerplate and repetitive (sorry, "patterned") code. They are perfectly suited to GPTs.

justinlloyd · on Aug 28, 2023

I've used ChatGPT and Copilot with Javascript & Typescript and a bunch of frameworks. Python, C, C++, C#, Go, 6502, Z80, ARM and a few other languages so far. They've all worked pretty good. I wish it had a wider breadth of APIs and documentation, but it is pretty good so far.

hedora · on Aug 28, 2023

Meh. My coworkers do what you’re doing, and then I end up rewriting their stuff. (AI generated stuff generally passes code reviews because it gets the benefit of the doubt.)

wesleywt · on Aug 28, 2023

Meh. I started coding professionally in the age of AI. My productivity is way higher than it should be. I don't let the AI "write" the code for me. I use it as a companion that I asked question of and suggestions.

It suggests solutions to problems, I add the complexity. So far my code has been both clear and performant. This should not be the case at my level.

enumjorge · on Aug 28, 2023

> I started coding professionally in the age of AI.

What does this mean?

Copilot was released in March of this year. ChatGPT 4, which the community considers to be the only version of ChatGPT to be competent enough at coding tasks, was also released in March.

My guess is that you didn't mean that statement to be a fancy way of saying "I've been coding professionally for around 6 months", but I don't know what else you mean by the "age of AI".

anon25783 · on Aug 28, 2023

Maybe they think this is LinkedIn? /j

Philpax · on Aug 28, 2023

Minor correction - the original version of GitHub Copilot was released in late-2021.

FeepingCreature · on Aug 28, 2023

The community is wrong, ChatGPT 3.5 is fine.

master_crab · on Aug 28, 2023

I feel like using suggestions for line by line composition might lend itself to better review by programmers.

As opposed to ChatGPT which spits out dozens of lines of code based on a request and therefore requires more involved editing after the fact to understand what’s happening.

wesleywt · on Aug 28, 2023

Yes, GPT is an aid not a programmer. You should still be doing the programming while ChatGPT gives you quick solutions to problems without having to look up Stack Overflow.

dev1ycan · on Aug 28, 2023

copilot? lol

I can understand using chatgpt 4 but copilot??? copilot can barely autocomplete... and half the time it does it incorrectly when variables and other stuff are involved

simonw · on Aug 28, 2023

Sounds like you would benefit from spending some time learning how to use Copilot.

Which isn't easy, because there's virtually no documentation for it.

Once you get the hang of it it's incredibly useful. I really feel it when I'm working in an editing environment without it now.

_k4hw · on Aug 28, 2023

Copilot has been incredibly useful for me

skepticATX · on Aug 28, 2023

There's a big gap between people using it to write greenfield applications and/or smaller tools, and people working on large codebases.

I am in the latter group, and I don't find it all that helpful. There simply aren't any tools that can plug into a massive codebase with millions of lines. I never just work on one specific repository either - a change generally involves multiple repositories.

On the other hand, if you're writing a smaller standalone utility, or working on a greenfield application, it can be helpful (with all the useful caveats about hallucations).

I think that a lot of ML folks, especially those in research, fall into the former category, which is why there has been so much hype.

plorkyeran · on Aug 28, 2023

I do think that's the big disconnect. There's jobs where you basically only ever create brand new code from scratch - writing lots of little scripts, creating new screens in a webapp, or whatever - and current AI tools can save a ton of time for that. There's also jobs where it's normal to spend a week figuring out what to change and then five minutes changing it, and current AI tools are worthless for that. Most jobs are somewhere in between those extremes, but the hype is coming from people pretty far on the "mostly writing new code" end.

version_five · on Aug 28, 2023

I'd go further and say the sweet spot is probably writing a simple function that fits in the model's context window, in a language you don't know. Even here it screws up, but you can go back and forth and get somewhere. I can only imagine the mess one would make using it to build an actual application of any size. The analogy that comes to mind is using google translate to try and write an essay.

CharlesW · on Aug 28, 2023

> There simply aren't any tools that can plug into a massive codebase with millions of lines.

If Sourcegraph's offering (https://about.sourcegraph.com/cody) has limits I'm not aware of, this is just a matter of time. But in my experience, this is a "nice to have" rather than a "must have" when it comes to the benefits of GenAI as it applies to coding.

ianbutler · on Aug 28, 2023

Yeah I think this is really one of the key issues. I found all of this tech to be really great with my side projects and really lacking with my big corpo codebase fragmented across multiple teams, projects and libraries.

throwaway09223 · on Aug 28, 2023

You have to put it in context: How good is the human alternative?

I have interviewed hundreds of engineers across the entire skill spectrum. I think GPT4 right now is about on-par with a mid level developer.

It makes a lot of mistakes, especially around counting, and usually can notice and fix them if they're pointed out. Humans do this all the time -- how often do you get a compiler error from something silly?

It often misapplies interfaces on the first go-around. Again, this is just like humans. We make mistakes, notice them, fix them. If you simulate this by telling GPT that its code produced an error it will often correct itself.

The absolutely killer feature of GPT4 is that it has these skills in every subject. It's fluent in kernel operations. Databases. Networking. Various UX frameworks. Any language.

It's definitely not perfect. But, if the alternative is hiring a mid-level human engineer, GPT4 is a really compelling alternative.

xtagon · on Aug 28, 2023

A mid-level human engineer can iteratively fix the mistakes they start making, instead of that just being the final result. Can GPT-4 do that?

throwaway09223 · on Aug 28, 2023

Yes, like I said if you give it the results, eg "That code resulted in this error. What's wrong?" it will more often than not correct its mistake.

Hooking it up to automatically run the code in question and examine the output is a trivial undertaking - many folks have already done this.

jhugo · on Aug 28, 2023

My experience has usually been that it will apologize and then produce another snippet with a different flaw. When that is pointed out, it will usually go back to the original snippet with the original flaw. It sometimes also insists that it has “run the test suite and they all pass”, just like a human programmer that is trying to fake it until they make it I guess?

kubrickslair · on Aug 28, 2023

It does it pretty well for Python with code interpreter, especially for quant/ ML heavy code.

tomjakubowski · on Aug 28, 2023

I wonder if anyone has done a feedback loop with GPT and compiler errors to train this sort of thing

aldarisbm · on Aug 28, 2023

they have, with langchain

Yodel0914 · on Aug 28, 2023

> I think GPT4 right now is about on-par with a mid level developer.

At writing code, that might be a slight exaggeration. At explaining code it's at least at mid level, possibly better.

saberience · on Aug 28, 2023

How many mid level developers do you know that understand the syntax and apis of every single programming language in existence?

I’ve used chatgpt recently to build an entire c++ game with sound, keyboard and mouse control, ai, etc, having never used c++ before. I’ve also never built a game before. I’ve worked with a lot mid level developers and I’ve never seen any that can write me a a c++ file giving me working code for a collision system in 30 seconds. Let me know if you find one though.

bcrosby95 · on Aug 28, 2023

I would peg gpt as between junior and mid. Mids (around 5-10 years of experience) tend to produce overly complex code for the problem at hand.

saberience · on Aug 28, 2023

Dude, chatgpt as a junior or mid level programmer? Have you actually met junior or mid level programmers?

I’ve got 20 years of experience in the tech industry and chatgpt is far far better than any mid level engineer I’ve ever met and I include FAANG engineers too.

Chatgpt can answer any leetcode programming question in essentially every programming language in existence in less than 20 seconds.

Chatgpt can analyze classes for errors written in every language in seconds.

Chatgpt helped me write animation, ai, sound handling code in c++ and helped me build a full working game, the total code generation time for chatgpt “thinking” was on the order of one hour.

I don’t know any senior engineer who could help me (a c++ newbie) write a whole working game engine with all the necessary systems in less than one hour of “thinking time”.

I’m starting to think that all the “naysayers” of chatgpt simply used it once or twice, saw one error and dismissed the whole thing. Or they’re simply afraid for their own skills and avoid chatgpt as the alternative scares them too much.

bcrosby95 · on Aug 28, 2023

Yes, we employ them regularly. And yes, I use both ChatGPT and copilot regularly.

In my experience, if you compare the unedited output of ChatGPT to the unedited output of someone with a couple years of experience in the given domain, ChatGPT will have more subtle bugs than the human provided output.

Hence my between junior & mid assessment. It doesn't mean that it's useless though. Having a tool at that level of experience for every domain in the world that can cook up code near instantaneously is damn useful. And I assume it'll only get better.

YeGoblynQueenne · on Aug 28, 2023

Can you share a link to the game you made with ChatGPT?

capableweb · on Aug 28, 2023

Which version of GPT are you using?

I've found GPT4 moderately helpful, like getting help from someone with a wide but shallow experience of lots of things.

> I can't easily fact check it

Why not? How did you find out things about basic terminology or concepts you're not familiar about before GPT? Apply the same methods, although you just have to fact check rather than coming up with what to check, so removes a small initial discovery, at least for me.

Gigachad · on Aug 28, 2023

>Why not? How did you find out things about basic terminology or concepts you're not familiar about before GPT?

Because if it tells me something that could just be found on google, I could have found it there first and not have to do as much fact checking. And if it tells me something I can't easily find, I can't tell if it's got some deep insight on harder to locate info, or if it's just made up.

FeepingCreature · on Aug 28, 2023

ChatGPT gives you ideas and terms that you can google. That's often a rarity in a new language/framework.

Retric · on Aug 28, 2023

It’s a question of time investment, fact checking is slower you than actually learning the material directly. So if you need to fact check every single thing you’re better off just leaning it in the first place.

pnpnp · on Aug 28, 2023

In the original comment’s defense, sometimes you need to find where to start. For example: when writing a research paper, one might hastily cast a wide net before locating and drilling into a topic from a respected source.

I’m not well-versed in JavaScript (I’d like to be, but neither me nor my company have the resources right now to commit). When JS work comes up, I’ll often Google/ChatGPT to point me in a general direction before pulling up technical documentation on whatever comes back. I’m pretty good at learning, but sometimes it takes me a while to find out where to start.

Retric · on Aug 28, 2023

I get where you’re coming from. However it’s easy to fall into the generalist trap where muddling through works well enough that you never spend the time to actually learn anything. It’s risky to really lean the details of a web framework which you may only use for a single 6 month project.

However, there’s a few things like SQL and JavaScrip that are likely to stick around mostly unchanged for your entire career. The risks are therefore much lower.

IMO, a reasonable heuristic is spend around 10% of the time you expect to work on technology in the next year actually learning the fundamentals until you feel comfortable. Often what looks like a major time sink goes away when you stop fumbling around.

jay_kyburz · on Aug 28, 2023

Yeah, I have been learning some new topics lately, and GTP is actually very helpful in giving me the language I need to use start searching.

I'm not interested in starting at the beginning and learning a whole new language or API, I just want to get my task done, so I can ask GTP to get me started, then I can run the code it writes and start building tests and things to verify it works as expected.

pnpnp · on Aug 28, 2023

This is how I use it as well. I don’t trust it, but find it more useful than Google these days for zeroing in on a topic.

moralestapia · on Aug 28, 2023

Particularly with Google Search getting shittier and shittier with time.

pnpnp · on Aug 28, 2023

It’s really too bad! I’d rather not be using AI as a search replacement, but I do feel like search results have degraded significantly.

corethree · on Aug 28, 2023

The over-hype is causing it to look under-hyped. 48% of the code being freaking correct from 0% about 5 years ago is huge.

With everyone and all the news talking about it constantly inevitably people are going to start rolling their eyes.

This is bias. 48% is half way to 100%. Once it reaches 100% you don't have a job. You realize that right? It's halfway their to taking your job and you're underwhelmed. The bias is on your side.

It's typical. It's like an indie band is only popular when not many people know about it. Once everybody starts talking about it all the time it loses it's popularity. People start thinking it's not cool anymore. That's the bias you are suffering from.

jhugo · on Aug 28, 2023

You assume that getting from “48% as good” as a human programmer (on an undoubtedly biased sample) to “100% as good” is about equally hard as getting from 0% to 48%. That seems a pretty big assumption.

Many programmers also do work that doesn’t just involve stitching APIs together; GPT consistently fails badly when forced to reason, making it useless for many tasks. For example, ask it to implement a common algorithm like SHA - it will do a decent job. Then ask it to do it with an arbitrary limitation which no implementation it was trained on would have (e.g. use no integer type wider than 8 bits). In my experience, it cannot achieve this task with a correct result, even with significant hand-holding.

corethree · on Aug 28, 2023

>You assume that getting from “48% as good” as a human programmer (on an undoubtedly biased sample) to “100% as good” is about equally hard as getting from 0% to 48%. That seems a pretty big assumption.

I meant 38% aka 40% made a mistake on my math.

It's not a big assumption. It's the most reasonable assumption. When you drive 50% of 10 miles the next 50% takes the same amount of time. It's the default assumption.

I understand where you're coming from though. Many products made by entrepreneurs follow that model where the remaining percentage is always harder than the beginning green fielded tech that was built. But can we honestly say that this is what AGI will be like? The LLM was an unexpected jump forward by an unprecedented amount. It could be we fill the gap to 100% with another such jump.

__loam · on Aug 28, 2023

> It's not a big assumption. It's the most reasonable assumption. When you drive 50% of 10 miles the next 50% takes the same amount of time. It's the default assumption.

And it's wrong, as anyone with experience in ML will tell you. 60% is easy. 70-85% is not too bad. 95% is hard. 100% is effectively impossible. This is the core problem of ML systems. They're good enough for 90% of cases but that last 10% can be incredibly important, to the point that the systems become unusable.

corethree · on Aug 28, 2023

The topic at hand is the pace of AI technology. This one is hard to quantify as it's been mostly linear in terms of technological progress. Example: From gans to LLMs. That progress has not been diminishing.

You seem to think the topic is about the current state of training algorithms and how it's hard to bring the model to 100 percent. That's a different topic with a subtle distinction.

jsunderland323 · on Aug 28, 2023

> 48% if half way to 100% … It's halfway their to taking your job and your underwhelmed.

You’re implying 100% reliability is actually attainable. If that were the case, wouldn’t that mean the halting problem would have been solved by AI? I’m not an expert but I’ve heard that’s like one of those fundamental laws of information theory that really can’t be broken.

> It's typical. It's like an indie band is only popular when not many people know about it. Once everybody starts talking about it all the time it loses it's popularity.

I think applying the sociology of hipster band fans to LLMs is a mistake and I’m not sure how the vogueness of the technology correlates with the correctness of the actual models. Sometimes an early technology seems docile or useless at first but eventually reaches ubiquity and in retrospect the utility is obvious. But sometimes (more often than not) standard adoption curves don’t make sense to apply to a new technology because it isn’t useful enough to go through an adoption cycle. I think there’s a temptation to apply the analogy of something like the early internet or smartphone to LLMs but those are networked products. LLMs don’t really improve with the various applications built on top of them if the LLM is itself fundamentally broken or faulty to the point that it is unsafe to use in practice. Furthermore given the massive amount of premature hype due to the AI zeitgeist, you can safely assume enough user hours have been spent messing with LLMs to get a verdict on their utility. Unlike the feeble technology that takes a longtime to reach a scale to know its utility, we don’t have to wait 10 years for 100m people to try LLMs, it happened in a week. My only point being, I think we should be apprehensive about trying to draw analogies of other adoption cycles that structurally are very very different to this one.

Obviously, I, like probably everyone else on this website would love LLMs to be reliable to a high degree. Just this morning I had such a good use for an LLM that I was seriously considering building (and admittedly still am pondering) but the second I started to think through the LLM faultiness, I had to consider the complexity of the safe guards and weigh if it was really better to use GPT or just write a nasty regex script and constrain the problem. I’m leaning heavily towards the latter but I’d much prefer a silver bullet if it really killed vampires. Until then (if that day ever comes), it’s lead bullets for me.

corethree · on Aug 28, 2023

>You’re implying 100% reliability is actually attainable. If that were the case, wouldn’t that mean the halting problem would have been solved by AI? I’m not an expert but I’ve heard that’s like one of those fundamental laws of information theory that really can’t be broken.

Nah even 100% reliability isn't attainable by a human. 100% obviously doesn't imply solving the halting problem.

100% as in 100% as reliable as a human. Even surpassing a human. Sort of in the same way as a computer now beats humans at chess.

>I think applying the sociology of hipster band fans to LLMs is a mistake

Why would it be a mistake? Human psychology is similar across all spectrums. What happens in one area is likely possible in another area. If it happens among hipster bands it can happen among hipster technology fads.

>I think there’s a temptation to apply the analogy of something like the early internet or smartphone to LLMs but those are networked products. LLMs don’t really improve with the various applications built on top of them if the LLM is itself fundamentally broken or faulty to the point that it is unsafe to use in practice.

This is a valid speculation. But it's speculation. Basically your saying that LLMs are fundamentally broken and stuck at 38% forever because of fundamental and permanent flaws. The jury on that one is still out. And your point is highly, highly speculative.

We see quantitative improvements on LLMs constantly AND this is a nascent technology we don't completely understand yet. The most probably and logical conclusion is to follow the technological trendline. That trendline is pointing up.

To speculate on fundamental flaws of the LLM when people don't even fully understand what's going on with the LLM is illogical because you can't derive conclusions from something you don't understand. We can only generalize the trendline and constant improvements we've seen in AI for the past decade. Again that trendline is pointing to further break throughs in the future.

>Obviously, I, like probably everyone else on this website would love LLMs to be reliable to a high degree.

No this is not obvious to me. I disagree. I think some people are like you but other people, for example Geoffrey Hinton are in the apocalyptic camp. Personally I'm in the middle, I think it could go either way. It will definitely harm a segment of our society by taking over work, but whether the benefits of AGI outweighs the harm remains to be seen.

jsunderland323 · on Aug 28, 2023

I'm not going to rebut everything but I want to point out that I think Geoffrey Hinton is way more in your camp in regards to faith in the technology (which is what we're discussing). I'm in the camp that thinks LLMs are at best a viable competitor to furby -- I have zero existential fears and have very little confidence in them. I'm not opposed to AIs, I just think the state of the art is really really bad (sorry) and the doom and existentialism is a marketing ploy to a world imbued in conspiracy theory and institutional distrust in order to compensate for a wildly over promised and under delivered product. But hey, when you gotta raise money, you gotta raise money, and as you put it, the "trend line is going up" and that's all that matters.

corethree · on Aug 28, 2023

>I'm not going to rebut everything but I want to point out that I think Geoffrey Hinton is way more in your camp in regards to faith in the technology (which is what we're discussing). I'm in the camp that thinks LLMs are at best a viable competitor to furby

Your point was on whether everybody wants to see AI take over. To that Geoffrey Hinton doesn't want AI to take over, while you do. I for one am not sure.

As for whether Geoffrey Hinton thinks AI is legit or not is a different story. On that topic he's on my side, but that was besides the point I brought him up to point out that your take on "everybody" wanting AI to develop further is incorrect.

> and the doom and existentialism is a marketing ploy to a world imbued in conspiracy theory and institutional distrust in order to compensate for a wildly over promised and under delivered product.

Other way around. With all the money and business interests going into LLMs business interests are promoting LLMs in a future that you want and that is not apocalyptic. The conspiracy theories aren't a thing. It makes no sense as those theories don't align with where the money is being thrown.

>But hey, when you gotta raise money, you gotta raise money, and as you put it, the "trend line is going up" and that's all that matters.

Bro. I am not saying "trendline" as if it's something I have to keep throwing money at to support.

I am talking about a mathematical projection based on data. The pace of technology in the past when graphed points to an ever increasing line on a line graph. When you take the slope of that line and use it to do a quantitative prediction, that line just points up. That's just a fact of reality. The logical outcome of the data we see.

FeepingCreature · on Aug 28, 2023

Your hypothesis is "our company will destroy the world" is a marketing move?

Look. People in reality are basically never this gigabrained. Maybe consider as your first-line theory, that when people say "AI will destroy the world", that they mean to express that they believe that AI will destroy the world?

__loam · on Aug 28, 2023

It's definitely a marketing strategy. "Our technology is so good it might destroy the world. So we're letting you use it for 19.99 a month." It's just a completely inconsistent position. Who benefits from the government saying only openAI and a few other companies can make this stuff? Would openAI rather talk about the (non-existent) existential threat of AGI or actual problems with their technology like data privacy issues and the amount of power their bullshit consumes?

jsunderland323 · on Aug 28, 2023

Yeah, I think that's well said. I think the doomerism is more of a sell to investors than users sorta situation but it cuts both ways, which makes me sad for the tech industry having just made this same mistake with crypto. Who benefits from this? Anyone invested who doesn’t want to disappoint an investor (LPS, VCs, and founders) and/or anyone who’s ego is invested in the space and/or anyone who is raising money in the space and reliant on the technology and/or AI companies who benefit from the publicity/legitimacy of having their CEO talk to congress and/or AI companies who get an entire NYT piece run for a week about how an LLM is going to kill us all and/or any public company that can benefit from a stock rally by pushing the narrative that they’re in a land grab market and have the coveted goods and/or any large company that could benefit from blocking out nimbler competitors by getting arduous regulation passed… I could go on. You realize that like every 10th article on HN is an AI existentialism piece? Reddit is even worse. It also diverts the conversation from LLMs being shitty to the machines are coming. I’m fine with doomerism (getting hysterical can be fun) but I think it’s completely unwarranted with where we’re at. I don’t think we should overlook the risks associated with intellectual property, impersonation, data privacy, the ever growing body of bot spam destroying the internet etc. but worrying about the more dramatic risks has proven to be a pretty effective diversion tactic from the obvious acute problems — namely that LLMs have very little practical utility as they are now and the little utility they do have is primarily beneficial to spammers/scammers. I can already feel the person chiming in to tell me they love co-pilot (please just learn the language you’re working in).

But to be clear I’m not saying this is a methodical highly coordinated ad campaign amongst multiple companies. I don’t think human beings are that competent. I think it’s way more grassroots and feeds on cultural tropes that have existed since at least the 1950s but probably the early 1900s. I think Anthropic’s leadership probably genuinely believes their own bullshit but I also think they understand that that same bullshit has raised over a billion in funding.

__loam · on Aug 28, 2023

We'll said, and I agree. The primary use cases are things where accuracy doesn't matter like spam and scams.

__loam · on Aug 28, 2023

I agree with the majority of your well considered comment. I'm okay with LLMs being really unreliable though. I think if they actually worked, it would be an unmitigated disaster for workers globally.

jsunderland323 · on Aug 28, 2023

I think there’s arguments to be made both ways. I‘m generally of the belief that technology that works is a net win for humanity. But I think we need to stop being vague about what it means for an LLM to work. For me, I just want to tell an LLM to replace all the hard coded strings in my app with translation tags. Sadly, this isn’t possible to do reliably with what we have today. I’m not sure who’s job this would eliminate other than my own. I think the more existential questions are about if we made a god box that could perform all the white collar jobs. I think we’re so far away from that despite being sold that vision, that we might as well consider the moral and philosophical implications of a time machine or teleportation device if we’re going to continue to entertain AI doomerism for LLMs.

__loam · on Aug 28, 2023

Hah fair point

wpietri · on Aug 28, 2023

> Once it reaches 100% you don't have a job. You realize that right? It's halfway their to taking your job and you're underwhelmed. The bias is on your side.

What this study looked at was feeding StackOverflow questions to LLMs and then looking at the quality of the code. If you think a programmer's job is just turning an english-language description of a function into isolated code that never gets modified, I don't know what to tell you. In my opinion, anybody like that should not have a job today, never mind the future.

A proper professional programming job involved AGI-level understanding of human users and their needs as embedded in a social context. Plus the ability to create novel solutions. Plus the ability to use code as a collaborative medium to make a code base that is sustainable over the long term by their colleagues.

LLMs are not even 1% of the way to replacing professional developers.

corethree · on Aug 28, 2023

Not true. They are well past 1%. That's not looking at reality. You can lower it below 38% but 1% is pure fantasy.

wpietri · on Aug 28, 2023

First off, you did not address most of my point.

To answer your complaint, I am absolutely looking at reality. Please point us to an actual real-world professional programming job, matching the criteria I list above, for which a current LLM can economically replace a human for over a 5-year period.

My believe is that there are exactly zero such jobs. If your claim is that we're at least at 1%, then you're claiming that there are at least 269k fully automatable programmer jobs [1]. It shouldn't be hard for you to find at least one to start.

[1] Wikipedia says there are an estimated 26.9m professional programmers: https://en.wikipedia.org/wiki/Software_engineering_demograph...

corethree · on Aug 28, 2023

>My believe is that there are exactly zero such jobs. If your claim is that we're at least at 1%, then you're claiming that there are at least 269k fully automatable programmer jobs [1]. It shouldn't be hard for you to find at least one to start.

If I built an entire car but I am missing the key. I cannot drive the car but the car is 99% complete. It's just missing the key. So what happens is, it can replace 0% of human locomotion but it's 99% complete? get it?

Just because the tool can't be used doesn't mean it's zero percent of the way to being drive able. Same with your job. The LLM can't replace any job yet, but it doesn't mean it's 1% of the way there.

But you see what I just explained to you is obvious. You already know this just like I and every one else on the face of the earth already knows this fact.

You're taking the discussion into a play on words. What does 1% apply to? Rather then use common sense and derive what I mean you prefer to redirect the conversation into a very specific definition of 1% that serves your own purpose.

We can play this game all day. We can discuss whether your application of 1% is more fitting than my application of 1%. What a waste of everyone's time.

I think it's better if rather then playing these games to "win" discussions, use your common sense to move the discussion past these games. Otherwise we're going to be talking about obvious things all day and getting all worked up about personal definitions.

Clearly the LLM is more than just 1%.

wpietri · on Aug 28, 2023

Your "1%" is a fantasy measure. It is purely subjective. It creates a false sense of linearity when the progress curve is not only not linear, it may not be possible to complete. You just handwave away any criticism with a "clearly" when it isn't clear at all.

Mine is clearly measurable. You wanted to talk about job replacement; I'm measuring jobs replaced. You believe that we're at 48% complete (or 38% or whatever) but you can't point to even one programming job that ChatGPT can take over. If your fantasy 48% measure means 0% real-world success, I think the 0% is the more useful number to look at.

Humans may not be smart enough to build human-grade minds, in the same way that cats will never learn to program no matter how much they paw keyboards. People mistaking ChatGPT for AGI strikes me as the same kind of error where they mistook Eliza for a person, or 1970s/1980s computers as the things that would develop consciousness and maybe take over. It doesn't tell us much about the power of computers, but rather the inability of some humans to understand how complex and powerful minds are.

corethree · on Aug 29, 2023

>Your "1%" is a fantasy measure. It is purely subjective

unfortunately the hard topics and the topics that matter are the questions that matter most are often qualitative.

Measurable answers are easy. Nobody cares about those things because it's obvious. Has ChatGPT replaced any jobs for programmers right at this moment? Overall no. But why argue a point everybody already knows?

Will chatgpt replace you in the future as the technology develops? Is it a precursor to the machine that will replace you? Qualitative trend-lines point to an unknown artificial entity that does replace all of us. It is useful to speculate and extend the trendline in that direction.

You may wish to shield yourself to reality and only look at things with quantitative numbers and lengths that are measurable by rulers but the shield is an illusion what you are doing is blindness.

Natural selection and natural history is built from qualitative understanding. Without subjective analysis on qualitative data we would not be able to understand natural selection from the macro perspective. You would be ignorant of the concept of evolution and natural selection and not much different than a creationist.

Obviously you aren't like that, you're just using a tactic to win a discussion. Always use hard data works when it works. But ultimately this strategy has failed.

raincole · on Aug 28, 2023

38%.

And 38% is not "a third way to 100%". I'm sure you know what diminishing return is.

corethree · on Aug 28, 2023

my mistake. 40% of the way there, point still stands.

raincole · on Aug 28, 2023

> point still stands

No?

I assume you actually don't know the concept of diminishing return, and it's totally fine. Every other comment here is explaining it to you already. I won't bother to repeat. Please read the sibling comments carefully.

corethree · on Aug 28, 2023

"Hey do you know what exponential growth is?" You think I just say that sentence and suddenly all your arguments are suddenly flushed down the toilet because I stated some random concept out of nowhere? Come on. You need evidence.

You brought up this question of diminishing returns out of nowhere and offer no evidence for it. So why would my point not stand?

I read the sibling comments. One person brought it up but offered no proof that this is what's happening. Simply did what you did in a manner that was less rude, just stated the concept of diminishing returns applies to LLMs without offering a shred of evidence that indicates this is what is happening.

There is a difference between introducing the concept of diminishing returns out of nowhere, and providing evidence that diminishing returns is WHAT is actually HAPPENING with LLMS.

We're about a year out from the introduction of chatgpt, that's not enough time to know if all the gains in the past decade of AI has suddenly hit wall of diminishing returns.

__loam · on Aug 28, 2023

I think you're misunderstanding the problem here. It could be 97% correct and it would still be unusable. Getting things to 100% is so hard it might as well be impossible, and the fact that you always need a person to check all the output limits the utility.

jki275 · on Aug 28, 2023

Even if it were possible to get to 100% (it isn't), I'd still have a job.

It's a great assistant. I use it. It's not taking anybody's job.

corethree · on Aug 28, 2023

It's taking your assistants job.

jki275 · on Aug 29, 2023

That's one contention, but I don't agree with that either.

Panzer04 · on Aug 28, 2023

Except the eternal problem with all "AI" so far is edge cases. A product that is 90% correct may as well be useless, honestly.

corethree · on Aug 28, 2023

Well then we are part of the way their still right?

It works but only edge cases are a problem. Next step is to fix the edge cases.

jki275 · on Aug 28, 2023

I wouldn't argue useless, I'd argue it's a product that still needs an expert user to validate.

tofukant · on Aug 28, 2023

100-62=38%

corethree · on Aug 28, 2023

dumb. I admit the mistake.

albedoa · on Aug 28, 2023

[flagged]

corethree · on Aug 28, 2023

Oh shut up. Don't be a dick.

Blahah · on Aug 28, 2023

100% - 62% = 38%

corethree · on Aug 28, 2023

my mistake.

ianbutler · on Aug 28, 2023

I don't relate, I'm currently designing a lower level data structure based on some papers and its been a pretty valuable rubber duck. I've only had one real instance of it being incorrect that wasted my time, higher ranked trait bounds in Rust can only be applied over lifetimes and instead it was suggesting I try to use them for non lifetime generics as well.

That said I'm not asking ChatGPT to write code that I use directly, I'm basically asking clarifying questions about the papers and other implementations I have on hand, and that's similar to how I use it in most cases.

For code generation I rely on co-pilot, which actually has the context of my codebase.

Edit: I will say this because someone else made me think of it: I find this tech all super useful on my like 20k loc and smaller side projects which are all self contained where as when I tried co pilot at work on a fragmented big corp code base I found it comparatively lackluster.

bugglebeetle · on Aug 28, 2023

It’s both under and over-hyped. The whole “AI is going to write your entire app for you with a couple sentences prompt” crowd is full of shit, but for small sections of code, variations on existing scripts and functions, writing tests, etc, GPT-4 is pretty incredible.

I work in data science and have to write a lot of repetitive stuff for parsing and cleaning data. It’s reduced my toil so dramatically in this respect, I can’t imagine going back to writing all that stuff again. I’m now much more ambitious in what I’ll experiment with as well because I know setting up the first stages of the data pipeline are going to be 10X less work than before.

justinlloyd · on Aug 28, 2023

I have been happy about 90% to 95% of the time with code that gets spat out by ChatGPT, though I don't fear for my job being taken by an LLM any time soon. It usually gives good insight in to unfamiliar code, or code in a language that I am familiar with but might have forgotten how to use. That said, I don't use it for every piece of code I write. Usually sections where I am exploring an idea.

At times, I need to correct the LLM on some usually obscure detail. It got the memory layout of the video screen on the BBC Micro completely mixed up with the ZX Spectrum, and I had to correct it about five times before it got it, and then it stuck for the rest of the conversation.

I've been on some wild goose chases, where I have fooled myself into thinking a particular approach would work and ChatGPT has been enthusiastically right there cheering me on.

It's like my dog going on an adventure with me, the dog doesn't care about where we're going or what we're doing, it's just excited to be part of the journey.

And in other cases, ChatGPT has been exceptionally useful in pointing out some blindingly obvious mistakes. It is like having a very knowledgeable, but exceptionally junior developer at your elbow.

If you ask the right questions, and don't except one shot questions to provide perfect answers, it works well for the most part.

ChatGTP · on Aug 28, 2023

you probably don’t fear your job loss because you’re still generating a bunch of shit your PMO is still confused by. That’s the game developers are going to be playing for a while I think.

drewfis · on Aug 28, 2023

Chat GPT is a game changer for a lot of my work. I make it write stored procedures in MS SQL along with the calling code in .NET, C#, and Dapper.

The generated code will likely have a few small issues, but it still makes me way more productive. It allows me to work ~10 hours a week less and enjoy my life.

toyg · on Aug 28, 2023

Until your boss/customer gets wind of it, and claims back that time.

beatthatflight · on Aug 28, 2023

I've used it for performance boosts. It's not perfect, but if I paste a method (my own code) and say 'rewrite in fewer lines' to simplify, or 'rewrite to run faster', I've been in awe at what it comes up with that hadn't occured to me. Tricks in python I wasn't aware of, or hadn't considered using a different datastructure that can be searched faster (for some reason my brain always defaults to lists, when a set may be better, for example). It absolutely makes errors though, so unit tests are critical, but I've got code running so much faster as a result.

FeepingCreature · on Aug 28, 2023

My experience is that ChatGPT often makes errors, but the errors are different than the errors I'd make. So it's easy to work with it collaboratively: I take a turn, GPT takes a turn, I take a turn... and it has an extremely broad spectrum of surface-level knowledge about tech.

The most annoying part is it's impossible for it to say that it doesn't know something. Hard to train for, I know.

malux85 · on Aug 28, 2023

Yeah, {zero, one}-shot is frequently wrong, to get good results you need to ToT or GoT (Tree of thought, graph of thought) which is currently only useful in more automated codewriting systems (like I'm building with https://atomictessellator.com) and not really useful in a co-pilot scenario.

_2z1p · on Aug 28, 2023

AI for nearly everything has been very overhyped. Paintings of people have extra fingers and other strange artifacts. Using it for writing prose is hit or miss. Full self driving cars are thwarted by rogue traffic cones.

I think we'll have AI some day, but today it's just not at the level that all the hype claims it is.

rvz · on Aug 28, 2023

There you go. But even before that 8 years ago, ConvNets, AlexNet, etc were getting all the hype and then came the adversarial images and issues which weren't addressed and then that hype fizzled out very quickly.

Now the AI bros here are attempting to sell us their new snake-oil in the form of a stochastic parrot which promises to be the solution to everything.

Well, unsurprisingly it is another unexplainable AI black box which still requires the human to check every single output so that it doesn't hallucinate something incorrect, which it does almost all the time with a lack of transparent explainability as to why it hallucinated.

So the fact is, it is already overpromising and under-delivering for serious use-cases. Just like FSD (Fools Self Driving) was when that was over-hyped and with little to no Tesla Robo-taxis on the road.

m3kw9 · on Aug 28, 2023

My experience is that you need to do a few back and forth get get it right especially for a more detailed functions. What works really well is if you need it to explain to you how something works, but to verify, you need to have a good sniff test which novices won’t have so they take it dangerously straight up

Retric · on Aug 28, 2023

Yea, I had a thought early on that programmers might start poisoning the well with these LLM by uploading a lot of broken code. But I think people have been unintentionally doing so for years.

rvz · on Aug 28, 2023

I can guarantee you that no-one here fully trusts any LLM to write complicated bug free code without you (a human) to keep checking over it.

Anyone with an understanding of unexplainable black-box AIs knows that LLMs hallucinating is not 'the same thing as a human'. Humans can be held accountable for their mistakes and can explain themselves transparently. LLMs fundamentally cannot reason or explain themselves transparently, other than rewording its original answer(s) to make themselves sound credible; like an expert sophist.

It goes to show that this overhyped snake-oil is now at the late stage peak of inflated expectations of the Gartner hype cycle.

FeepingCreature · on Aug 28, 2023

Of course, humans also often try to bullshit their way through wrong answers.

Denote6737 · on Aug 28, 2023

Given that chat gpt is trained on the internet. Places like Reddit.

Think to how many times you have seen code written with the caveat 'this is a general gist. I haven't tested this code'

I've seen it many times. More than anything else really. So it should come as no surprise that models trained on that data produce similarly wonky code.

This is then compounded by the fact that the user thinks of the model as they would a person. They receive a humanesqe response. So attribute the expected human backend. The understanding of the request and the feedback.

All current "AI" LLMs are just Chinese rooms.

rckrd · on Aug 27, 2023

In more impressive news, "38% of code generated by GPT-4 does not contain API misuses"

mekster · on Aug 28, 2023

Seems to be already surpassing humans.

floridsleeves · on Aug 27, 2023

The misuse of APIs in the generated code could lead to severe problem, such as resource leaks, program crashes, etc. Existing code evaluation benchmark and datasets focus on crafting small tasks such as programming questions in coding interviews, which however deviates from the problem that developers would ask LLM for real-world coding help. To fill the missing piece, researchers from UCSD propose a dataset RobustAPI for evaluating the reliability and robustness of code generated by LLMs. By collecting 1208 coding questions from StackOverflow on 24 representative Java APIs, they evaluate them on popular LLMs including GPT-3.5, GPT-4, Llama2, Vicuna. The evaluation results show that even for GPT-4, 62% of the generated code contains API misuses, which would cause severe consequences if the code is introduced into real-world software.

jrm4 · on Aug 27, 2023

That's more an indictment of APIs, (which fundamentally do sort of suck) than it is of, e.g. GPTs

ajross · on Aug 28, 2023

I mean, of course? It's trained on real world code. Real world code "misuses" APIs at, sure, a 62% rate. Sounds roughly right to me. Obviously there's dithering to be done over what constitutes "misuse" vs. actual bugs, etc... But really this sounds unsurprising.

AI isn't going to give you perfect code (or answers, or anything really). It's going to give you typical code, based on extremely broad "intuition" about how others ("all" others, really) have solved (or answered) the same problem. And that has value.

But it's not going to produce something better than the existing consensus, by definition.

jojobas · on Aug 27, 2023

Garbage in, garbage out. I wonder if it can contextualize the code in StackOverflow questions (i.e. why doesn't this work?) as bad and that in highly rated answers as good, or "code is code".