Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Here’s a simple rule, based on the fact no one has shown that an llm or a compound llm system can produce an output that doesn’t need to be verified for correctness by a human across any input:

The rate at which llm/llm compound systems can produce output > the rate at which humans can verify the output

I think it follows that we should not use llms for anything critical.

The gunghoe adoption and hamfisting of llms into critical processes, like an AWs migration to Java 17, or root cause analysis is plainly premature, naive, and dangerous.



This is a highly relevant and accurate point. Let me explain how this happens in real life instead of breathless C-type hucksterism:

We have a project working on very large code-base in .NET Web Forms (and other old tech) that needs be updated to more modern tech so it can be in .NET 8 and run on linux to save hosting costs. I realize this is more complicated that just convert to later versions of Java, but it's roughly the same idea. The original estimate was for 5 devs for 5 years. C-types decide it's time to use LLMs to help this get done. We use both Co-Pilot and later others, Claude of which turned out to be the most useful. Senior devs create processes that offshore teams start using to convert code. Target tech can be varied based on updated requirements, so some went to Razor pages, some to JS with .NET API, some other stuff. Looks to be pretty good modernization at the start.

Then the Senior devs start trying to vet the changes. This turns out to be a monumental undertaking. Literally swamped code reviewing output from the offshore teams. Many, many subtle bugs were introduced. It was noted that the bugs were from the LLMs, not the offshore team.

A very real fatigue sets in among senior devs where all they're doing is vetting machine generate code. I can't tell you how mind numbing this becomes. You start to use the LLMs to help review, which seems good but really compounds the problem.

Due to the time this is taking, some parts of the code start to be vetted by just the offshore team, and only the "important things" get reviewed by Senior devs.

This works fine for exactly 5 weeks after the first live deploy. At that point the live system experiences a major meltdown and causes an outage affecting a large number of customers. All hands on deck, trying to find the problem. Days go by, system limps along on restarts and patches, until the actual primary culprit is found, which turns out to be a == for some reason being turned into a != in a particular gnarly set of boolean logic. There were other problems as well, but that particular one wreaked the most havoc.

Now they're back to formal, very careful code reviews, and I moved onto a different project on threat of leaving. If this is the future of programming, it's going to be a royal slog.


> Here’s a simple rule, based on the fact no one has shown that an llm or a compound llm system can produce an output that doesn’t need to be verified for correctness by a human across any input:

I’m still not sure why some of us are so convinced there isn’t an answer to properly verifying LLM output. In so many circumstances, having output pushed 90-95% of the way is very easily pushed to 100% by topping off with a deterministic system.

Do I depend on an LLM to perform 8 digit multiplication? Absolutely not, because like you say, I can’t verify the correctness that would drive the statistics of whatever answer it spits out. But why can’t I ask an LLM to write the python code to perform the same calculation and read me its output?

> I think it follows that we should not use llms for anything critical.

While we are at it I think we should also institute an IQ threshold for employees to contribute to or operate around critical systems. If we can’t be sure to an absolute degree that they will not make a mistake, then there is no purpose to using them. All of their work will simply need to be double checked and verified anyway.


There isn’t one answer to how to do it. If you have an answer to validation for your specific use case, go for it. this is not trivial because most flashy things people want to use llms for like code generation and automated RCA’s are hard or impossible to verify without the I Need A More Intelligent Model problem.

2. I believe this is falsely equating what llms do with human intelligence. There is a skill threshhold for interacting with critical systems, for humans it comes down to “will they screw this up?” And the human can do it because humans are generally intelligent. The human can make good decisions to predict and handle potential failure modes because of this.


Also, let’s remember the most important thing about replacing humans with AI - a human is accountable for what they do.

That is, ignoring all the other myriad, multidimensional other nuances of human/social interactions that allow you to trust a person (and which are non-existent when you interact with an AI).


Why not automate verification itself then? While not possible now, and I would probably never advocate for using LLMs in critical settings, it might be possible to build field-specific verification systems for LLMs with robustness guarantees as well.


If the verification systems for LLMs are built out of LLMs, you haven't addressed the problem at all, just hand-waved a homunculus that itself requires verification.

If the verification systems for LLMs are not built out of LLMs and they're somehow more robust than LLMs at human-language problem solving and analysis, then you should be using the technology the verification system uses instead of LLMs in the first place!


> If the verification systems for LLMs are not built out of LLMs and they're somehow more robust than LLMs at human-language problem solving and analysis, then you should be using the technology the verification system uses instead of LLMs in the first place!

The issue is not in the verification system, but in putting quantifiable bounds on your answer set. If I ask an LLM to multiply large numbers together I can also very easily verify the generated answer by topping it with a deterministic function.

I.e. rather than hoping that an LLM can accurately multiply two 10 digit numbers, I have a much easier (and verified) solution by instead asking it to perform this calculation using python and reading me the output


Spitballing, if you had a digital model of a commercial airplane, you could have an llm write all of the component code for the flight system, then iteratively test the digital model under all possible real world circumstances.

I think automating verification generally might require general intelligence, not an expert though.


The same is true of computers, in fact it has been mathematically proven that it is impossible to answer the general question if a computer program is correct.

But that hasn't stopped the last 40 years from happening because computers made fewer mistakes than the next best alternative. The same needs to be true of LLMs.


The theory you’re alluding to says it is impossible to create a general algorithm that decides any non-trivial property of any computer program.

There is nothing in the theory that prevents you creating a program that verifies a particular specific program.

There is an entire field dedicated to doing just that.


The issue is there to verify a program you need to have a spec. To generate a spec you need to solve the general problem.

This is what gets swept under the rug whenever formal methods are brought up.


That is not true at all. You do not need to generate a spec. All you need to do is prove a property. This can be done in many ways.

For example, many things can be proven about the following program without having to solve any general problem at all:

echo “hello world”

Similarly for quick sort, merge sort, and all sort of things. The degree of formality doesn’t have to go to formal methods which are only a very small part of the whole field


>echo “hello world”

Congratulations, you just launched all the worlds nuclear missiles.

This is to spec since you didn't provide one and we just fed the teletype output into the 'arm and launch' module of the missiles.


What you’re saying is equivalent to throwing out all of mathematics due to the incompleteness theorem and start praying to fried egg jellyfish on full moon


No that's what OP is saying about LLMs.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: