I'm actually having a hard time interpreting your meaning.
Are you criticizing LLMs? Highlighting the importance of this training and why we're trained that way even as children? That it is an important part of what we call reasoning?
Or are you giving LLMs the benefit of the doubt, saying that even humans have these failure modes?[0]
Though my point is more that natural language is far more ambiguous than I think people give credit to. I'm personally always surprised that a bunch of programmers don't understand why programming languages were developed in the first place. The reason they're hard to use is explicitly due to their lack of ambiguity, at least compared to natural languages. And we can see clear trade offs with how high level a language is. Duck typing is both incredibly helpful while being a major nuisance. It's the same reason even a technical manager often has a hard time communicating instructions. Compression of ideas isn't very easy
[0] I've never fully understood that argument. Wouldn't we call a person stupid for giving a similar answer? How does the existence of stupid mean we can't call LLMs stupid? It's simultaneously anthropomorphising while being mechanistic.
I was pointing out humans and LLMs have this failure mode so in a lot of ways it is no big deal/not some smoking gun that LLMs are useless and dangerous, or at least no more useless and dangerous than humans.
I personally would stay away from calling someone, or an LLM, 'stupid' for making this mistake because of several reasons. First, objectively intelligent high functioning people can and do mistakes similar to this so a blanket judgement of 'stupid' is pretty premature based on a common mistake. Second, everything is a probability, even in people. That is why scams work on security professionals as well as on your grandparents. The probability of a professional may be 1 in 10k while on your grandparents it may be 1/100 but that just means that the professional needs to get a lot more phishing attempts thrown at them before they accidentally bite. Someone/something isn't stupid for making a mistake, or even systemically making a mistake, everyone has blind spots that are unique to them. The bar for 'stupid' needs to be higher.
There are a lot of 'gotcha' articles like this one that point out some big mistake an LLM made or systemic blind spot in current LLMs and then conclude, or at least heavily imply, LLMs are dangerous and broken. If the whole world put me under a microscope and all of my mistakes made the front page of HN there would be no room left for anything other than documentation of my daily failures (the front page would really need to grow to just keep up with the last hour worth of mistakes more than likely).
I totally agree with the language ambiguity point. I think that is a feature and not a bug. It allows creativity to jump in. You say something ambiguous and it helps you find alternative paths to go down. It helps the people you are talking to also discover alternative paths more easily. This is really important in conflicts since it can help smooth over ill intentions since both sides can try to find ways of saying things that bridge their internal feelings with the external reality of dialogue. Finally, we often really don't know enough but we still need to say something and like gradient descent, an ambiguous statement may take us a step closer to a useful answer.
> I personally would stay away from calling someone, or an LLM, 'stupid' for making this mistake because of several reasons.
I wouldn't. Because there's a difference between calling someone's action stupid and saying that someone is stupid. These are entirely dependent upon the context of the claim. Smart people frequently do stupid stuff. I have a PhD and by some metric that makes me "smart" but you'll also see me do plenty of stupid stuff every single day. Language is fuzzy...
But I think responses like yours are entirely dismissive at what's being attempted to be shown. What's being shown is how easily they are fooled. Another popular example right now being the cup with a sealed top and open bottom (lol "world model"?).
> There are a lot of 'gotcha' articles
The point isn't about getting some gotcha, it is about a clear and concise example of how these systems fail.
What would not be a clear and concise example is showing something that requires domain subject expertise. That's absolutely useless as an example to everyone that isn't a subject matter expert.
The point of these types of experiments is to make people think "if they're making these types of errors that I can easily tell are foolish then how often are they making errors where I am unable to vet or evaluate the accuracy of its outputs?" This is literally the Gell-Mann Amnesia Effect in action[0].
> I totally agree with the language ambiguity point. I think that is a feature and not a bug.
So does everybody. But there are limits to natural language and we've been discussing them for quite a long time[1]. There is in fact a reason we invented math and programming languages.
> Finally, we often really don't know enough but we still need to say something and like gradient descent, an ambiguous statement may take us a step closer to a useful answer.
Was this sentence an illustrative example?
Sometimes I think we don't need to say something. I think we all (myself included) could benefit more by spending a bit longer before we open our mouths, or even not opening them as often. There's times where it is important to speak out but there are also times that it is important to not speak. It is okay to not know things and it is okay to not be an expert on everything.
> This is literally the Gell-Mann Amnesia Effect in action.
Absolutely! But there is some nuance, here. The failure mode is for an ambiguous question, which is an open research topic. There is no objectively correct answer to "Should I walk or drive?" given the provided constraints.
Because handling ambiguities is a problem that researchers are actively working on, I have confidence that models will improve on these situations. The improvements may asymptotically approach zero, leading to ever increasingly absurd examples of the failure mode. But that's ok, too. It means the models will increase in accuracy without becoming perfect. (I think I agree with Stephen Wolfram's take on computationally irreducibility [1]. That handling ambiguity is a computationally irreducible problem.)
EWD was right, of course, and you are too for pointing out rigorous languages. But the interactivity with an LLM is different. A programming language cannot ask clarifying questions. It can only produce broken code or throw a compiler error. We prefer the compiler errors because broken code does not work, by definition. (Ignoring the "feature not a bug" gag.)
Most of the current models are fine-tuned to "produce broken code" rather than "compiler error" in these situations. They have the capability of asking clarifying questions, they just tend not to, because the RL schedule doesn't reward it.
Producing fewer "Compiler errors" and more "broken code errors" is a fundamental failure. The cost of detecting compiler errors is lower than detecting broken code. If the cost of detecting and fixing broken code increases at the same rate as LLMs "improve" then their net benefit will remain fixed. I asked my five year old the above "brain teaser" and he got it right. I did a follow up of what should he wash at a car wash if he walked there, he said, "my hands." Chat answered with more giberish.
I agree it is a fundamental failure of the current state of models. I believe it is solvable. The nuance is just that "solving" the problem might not look like what we think of as a solution. Hence the asymptote.
Are you criticizing LLMs? Highlighting the importance of this training and why we're trained that way even as children? That it is an important part of what we call reasoning?
Or are you giving LLMs the benefit of the doubt, saying that even humans have these failure modes?[0]
Though my point is more that natural language is far more ambiguous than I think people give credit to. I'm personally always surprised that a bunch of programmers don't understand why programming languages were developed in the first place. The reason they're hard to use is explicitly due to their lack of ambiguity, at least compared to natural languages. And we can see clear trade offs with how high level a language is. Duck typing is both incredibly helpful while being a major nuisance. It's the same reason even a technical manager often has a hard time communicating instructions. Compression of ideas isn't very easy
[0] I've never fully understood that argument. Wouldn't we call a person stupid for giving a similar answer? How does the existence of stupid mean we can't call LLMs stupid? It's simultaneously anthropomorphising while being mechanistic.