I don't think "results don't match promises" is the same as "not knowing how to use it". I've been using Claude and OpenAI's latest models for the past two weeks now (probably moving at about 1000 lines of code a day, which is what I can comfortably review), and it makes subtle hard-to-find mistakes all over the place. Or it just misunderstands well known design patterns, or does something bone headed. I'm fine with this! But that's because I'm asking it to write code that I could write myself, and I'm actually reading it. This whole "it can build a whole company for me and I don't even look at it!" is overhype.
Prompting LLMs for code simply takes more than a couple of weeks to learn.
It takes time to get an intuition for the kinds of problems they've seen in pre-training, what environments it faced in RL, and what kind of bizarre biases and blindspots it has. Learning to google was hard, learning to use other peoples libraries was hard, and its on par with those skills at least.
If there is a well known design pattern you know, thats a great thing to shout out. Knowing what to add to the context takes time and taste. If you are asking for pieces so large that you can't trust them, ask for smaller pieces and their composition. Its a force multiplier, and your taste for abstractions as a programmer is one of the factors.
In early usenet/forum days, the XY problem described users asking for implementation details of their X solution to Y problem, rather than asking how to solve Y. In llm prompting, people fall into the opposite. They have an X implementation they want to see, and rather than ask for it, they describe the Y problem and expect the LLM to arrive at the same X solution. Just ask for the implementation you want.
Asking bots to ask bots seems to be another skill as well.
Let me clarify, I've been using the latest models for the last two weeks, but I've been using AI for about a year now. I know how to prompt. I don't know why people think it's an amazing skill, it's not much different from writing a good ticket.
Writing a good ticket is not a common skill. IMO it seems deceptively easy but usually requires years of experience to understand what to include and express it in the most concise yet unambiguous terms possible for the intended audience.
Do you use an agent harness to have it review code for you before you do?
If not, you don't know how to use it efficiently.
A large part of using AI efficiently is to significantly lower that review burden by having it do far more of the verification and cleanup itself before you even look at it.
This is correct, but part of the issue is that it significantly increases token usage costs. Some companies are doing:
- PRD and spec fulfillment review
- code review + correction loops
- security review + corrections
- addl. test coverage and tidying
- addl. type checks and tidying
- addl. lint checks and tidying
- maybe more I haven't listed
And these are run after each commit, so you can only imagine the costs per engineer doing this 10, 20, 50+ times per day depending on how much work they're knocking out.
I think if you were to scale that kind of usage across a reasonable team size, costs would start to add up fast — and possibly beyond the cost of paying another engineer every year, especially if a lot of your teammates are new to AI, or aren't using it efficiently. Of course, it all depends on the appetite of the company.
The other constraint is, for those who are being laid off (maybe because of cost reduction to support an AI budget for a smaller team to use), engineers wanting to expand their skill set and practice these levels of usage + efficiency are effectively unable to with their own funding, making it more difficult to find employment as expectations heighten.
Prior to AI entering the fray, software development was largely free for everyone, allowing anyone with enough time and motivation to build the skills towards gainful employment. As AI becomes more prevalent and expectations around how it's used become higher, fewer and fewer applicants will be able to claim they have the experience necessary because it was out of reach due to costs.
> Do you use an agent harness to have it review code for you before you do?
Right now you need to be Uncle Moneybags to do this in your personal life.
If you're lucky, your employer is footing the bill but otherwise... Ugh. It's like converting your app running perfectly fine on a cheap VPS to AWS Lambda. In theory, it's fine but in reality the next bill you get could make you faint.
It's down to how much you value your time. If your value your time low enough, it doesn't pay to make AI take over. If you value it high enough, it does.
I have it run tests and every few days I ask it to do a code quality analysis check on the codebase.
I'm unconvinced AI reviewing AI is the answer here, because all LLMs have the same flaws. To me, the harness/guard rails for AI should be different technologies that work differently and in a more formal sense. IE, static code analysis, linters, tests, etc.
(Linting has actually been, by far, the BEST code quality enforcers for the agents I've run so far, and it's a lot cheaper and more configurable than running more agents.)
If you know good architecture and you are testing as you go, I would say, it is probably pretty damn close to being able to build a company without looking at the code. Not without "risk" but definitely doable and plausible.
My current project that I started this weekend is a rust client server game with the client compiled into web assembly.
I do these projects without reading the code at all as a way to gauge what I can possibly do with AI without reading code, purely operating as a PM with technical intuition and architectural opinions.
So far Opus 4.6 has been capable of building it all out. I have to catch issues and I have asked it for refactoring analysis to see if it could optimize the file structure/components, but I haven't read the code at all.
At work I certainly read all the code. But would recommend people try to build something non trivial without looking at the code. It does take skill though, so maybe start small and build up the intuition on how they have issues, etc. I think you'll be surprised how much your technical intuition can scale even when you are not looking at the code.
That is why I said "risk". Though the models are pretty good "if" you ask for security audits. Notice I didn't say you could do it without technical knowledge right now, so you need to know to ask for security review.
I have friends in security on major platforms who are impressed by the security review of the SOT models. Certainly better than the average bootstrapped founder.
Simultaneous turn based top down car combat where you design the cars first. Inspired by Car Wars, but taking advantage of computers, so spline based path planing and much more complicated way of calculating armor penetration and damage.
> and it makes subtle hard-to-find mistakes all over the place.
I agree. I'm constantly correcting the code it generates. But then, I do the same for humans when I review their PRs, and the LLM generated the code in a 100th of the time (or whatever figure you prefer).
And yet, this is exactly what my last job's engineering & product leadership did with their CEO at the helm, before they laid me off.
They vibe-coded a complete rewrite of their products in a few months without any human review. Hundreds of thousands LOC. I feel sorry for the remaining engineers having to learn everything they just generated, and are now having customers use.