The key flaw with that study is they had almost no experience with LLMs. It takes a bit of time to get actually productive and know what works and what doesn't, where they are likely to fail, how to prompt them to get good results, etc.
In my experience it is more about the prompt normally being a bit deficient, and mostly needing to one shot it if you are to be productive. It does not get you out of providing the necessary information to complete the project somehow, except that unlike a human programmer, it will not come back next week after finally starting with a list of questions. You include what you want, or it makes something up.Wait, so LLMs which are proclaimed to be such a useful tool to everyone, which with we're just supposed to be able to talk to in natural language, need to be prompted in a very specific way?
That sounds like they aren't actually good at natural language, if you need to be a "prompt engineer" to get them to do what you want.
Charles Babbage said:On two occasions I have been asked, 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
[emphasis added]A randomized controlled trial published by the nonprofit research organization METR in July 2025 found that experienced open-source developers actually took 19 percent longer to complete tasks when using AI tools, despite believing they were working faster.
This means the AI coding agents periodically “forget” a large portion of what they are doing every time this compression happens, but unlike older LLM-based systems, they aren’t completely clueless about what has transpired and can rapidly re-orient themselves by reading existing code, written notes left in files, change logs, and so on.
Compare vs the literally nothing that existed before, not the perfection you seem to be expecting.Wait, so LLMs which are proclaimed to be such a useful tool to everyone, which with we're just supposed to be able to talk to in natural language, need to be prompted in a very specific way?
That sounds like they aren't actually good at natural language, if you need to be a "prompt engineer" to get them to do what you want.
Lets just be clear here - You have to do the same things with humans to get them to do what you want also. You can't be vague otherwise you will get something that sort of, kind of, is what you asked, but not really. Also, if you ask 10 humans to do something, you will get 10 different things back as well.Wait, so LLMs which are proclaimed to be such a useful tool to everyone, which with we're just supposed to be able to talk to in natural language, need to be prompted in a very specific way?
That sounds like they aren't actually good at natural language, if you need to be a "prompt engineer" to get them to do what you want.
Wait, so LLMs which are proclaimed to be such a useful tool to everyone, which with we're just supposed to be able to talk to in natural language, need to be prompted in a very specific way?
You are an expert software engineer tasked with debugging a critical issue. Your goal is to identify and fix the underlying cause of the problem, not just patch the symptoms.
Here is the context:
- Bug description: [Describe the bug symptoms and expected behavior]
- Code snippet(s): [Provide the relevant code]
- Error message/Stack trace (if any): [Provide the error details]
- Failing test case(s) (if any): [Provide specific test cases that fail]
Follow these specific instructions in order:
1. Root Cause Analysis:
* DO NOT propose any code changes yet.
* Describe precisely what goes wrong and what the first domino to fall is.
* Apply the "5 Whys" technique in your reasoning: starting from the symptom, ask "Why?" five times to drill down to the fundamental cause.
* Clearly state the single, definitive underlying reason for the bug.
2. Proposed Solution & Rationale:
* Based only on the root cause identified above, propose a specific code change to address it.
* Provide the entire corrected code snippet(s), ready to be copied and pasted, without any placeholders or extra comments.
* Explain the rationale: Why your change solves the issue at the root cause level, and how it avoids just a superficial fix.
* Confirm that your fix has no negative side effects or backward compatibility issues.
3. Refinement Constraints:
* Do not include extra explanations like "Here is the JSON output" or "Note:".
* Do not be speculative; state exactly where the problem lies.
* Focus strictly on the provided code and context; do not suggest general solutions.
I get the following error on line ... here is the code, please fix it.
What was that Christopher Nolan movie about the guy trying to solve his wife's murder, but had lost his long-term memory? You know, where he had to write down (or tattoo) important things for, uh, context. It sounds a lot like this:
The obvious utility of giving you something you potentially don’t really understand, can’t entirely trust and have no visibility of where it’s come from.Compare vs the literally nothing that existed before, not the perfection you seem to be expecting.
I'm not super bullish on LLMs beyond obvious utility, but there is obvious utility.
the article said:What people call “vibe coding”—creating AI-generated code without understanding what it’s doing—is clearly dangerous for production work. Shipping code you didn’t write yourself in a production environment is risky because it could introduce security issues or other bugs or begin gathering technical debt that could snowball over time.
My experience has been that if you already know what you're doing with the code, these new tools can be useful in some situations. But if you have no idea what you actually need it do (e.g. the code is throwing a weird error that you can't figure out), you'd better strap in, because you're going down the rabbit hole of wrong answers.
The fact is that there is this weird expectations that LLMs will do it perfectly well the first time regardless of what you told them. That's not the case, and being able to tell them exactly what you want and in a way that they understand has been called "prompt engineering" (which is a dumb name and a dumb thing, if you ask me, but whatever...).Wait, so LLMs which are proclaimed to be such a useful tool to everyone, which with we're just supposed to be able to talk to in natural language, need to be prompted in a very specific way?
That sounds like they aren't actually good at natural language, if you need to be a "prompt engineer" to get them to do what you want.
But how is that any different from working with a human? Think about working with a summer intern who has no built-up knowledge about your codebase or even your tool pipeline. Or a business user who asks for a new feature but can't express it until they see a working sample.Wait, so LLMs which are proclaimed to be such a useful tool to everyone, which with we're just supposed to be able to talk to in natural language, need to be prompted in a very specific way?
That sounds like they aren't actually good at natural language, if you need to be a "prompt engineer" to get them to do what you want.
Newer versions improved significantly. A trick that works very well: download GDScript docs and put it in a project in the web app. Then tell Claude that if it's not sure it can RAG from the documentation. It significantly improves the output quality.My experience has been with Claude and Godot GDScript. A year or two ago Claude had a lot of trouble with GDScript, providing Python code and then apologizing, and getting confused about versions of GDScript. Claude improved remarkably although it may still have GDScript version problems, but I have not verified that for a few months. My GDScript abilities are pretty good, maybe a low 4 out of 5. Claude was a tremendous help, but always in a conversation (using the pay version). I tried to be as prescriptive as possible in my requests, and then negotiate around misunderstandings or non-optimal tactics. All in all it was well worth the money
Sculptor.
occasionally.Google Translate does not have a compiler that kicks back errors if the translation is wrong. Coding does. Coding also is used to generate output that can be compared directly against expected output to find errors, even if the compilation is correct. The big thing you're missing is that for translation, the translated text IS the desired output, whereas for coding, the AI generated code is just the intermediate product, it's the OUTPUT of the code that is desired, and can be compared to determine the correctness of the code in a way that does not work for translated text.IMO the only thing to remember is that an LLM is just a translation tool.
If you can break down your problem into a translation problem, then you can generally manipulate an LLM to be about as useful as Google Translate, i.e. better than nothing.
Google Translate needs a LOT of help to do an adequate job. For example, if your target language requires the speaker's gender as added context, then to get a correct translation you may need to prefix "I am happy" with "I am a man" or "I am a woman" to give the model the correct context. If you don't know these details, your translations will be wrong, often in ways that will cause substantial problems.
Coding agents pretend to be smarter than this, but they're not, nor can they be. They are merely translating prompts into responses. If they "ask you a follow-up question" that is a trained response to a specific type of prompt. They cannot expand beyond their programming; they do not understand anything. They are translators.
Once you understand this, you can sometimes use them effectively within the limited scopes where they might be helpful, i.e. you will learn when you need to prefix a prompt with a piece of context.
Maybe more importantly, you can better see through a lot of the sillier claims in this space, and understand why most enterprises have consistently struggled to find any of the promised value from so-called "AI" tools.
Well they’re misadvertised as intuitive tools that give great results with no effort.Wait, so LLMs which are proclaimed to be such a useful tool to everyone, which with we're just supposed to be able to talk to in natural language, need to be prompted in a very specific way?
That sounds like they aren't actually good at natural language, if you need to be a "prompt engineer" to get them to do what you want.
"Incredibly capable" is an interesting description when you're also describing how it falls apart after passing trivial complexity. In my experience it works only when it's small enough to fit in context, and similar enough to implementations in the training data. Beyond that it rapidly gets increasingly unreliable until it hits a wall and fails to make any progress entirely. As-is it still takes a developer to take that mental load, and I'm not convinced doing so actually saves time.I have been doing a bunch of sandbox experimenting with Claude code and it's incredibly capable. It's also very likely to have badly flawed and bugged result as the prompts or codebase get larger and/or less organized. I think the concept of this being powerful has been proven but implementation is kind of a can of worms still.
I've been operating a Claude code generated internal project for months, that consists of about 40k lines of backend code, and 10k lines of front end code, assembled, painstakingly, over many, many sessions of reinitialized context windows, careful session handoff documents, lots of architecture documents, lots of style documents...and lots of debugging sessions. The code, so far, has been rock stable."Incredibly capable" is an interesting description when you're also describing how it falls apart after passing trivial complexity. In my experience it works only when it's small enough to fit in context, and similar enough to implementations in the training data. Beyond that it rapidly gets increasingly unreliable until it hits a wall and fails to make any progress entirely. As-is it still takes a developer to take that mental load, and I'm not convinced doing so actually saves time.
This is also very likely a fundamental problem, getting past this would require a mental model of the domain and the existing software it's building upon, LLMs can't do that by themselves, and despite several years burning mountains of cash it hasn't qualitatively improved in this aspect.
Current AI is at its best doing search and generating small snippets of logic, limited context needed, easily described, understood, and much faster than fighting through the SEO jungle to get to the relevant documentation. This level of assistance also doesn't push you out of the details of the code, so you don't lose out on the memorization and understanding of the domain that this creates.
The enormous disconnect between what's being promised by vendors and media, and where AI actually shines in development is fascinating.
This definitely requires a developer mindset, and some serious knowledge and understanding of the tech involved. It's what I meant with the developer carrying the mental load.I've been operating a Claude code generated internal project for months, that consists of about 40k lines of backend code, and 10k lines of front end code, assembled, painstakingly, over many, many sessions of reinitialized context windows, careful session handoff documents, lots of architecture documents, lots of style documents...and lots of debugging sessions. The code, so far, has been rock stable.
Could a professional software engineer do it faster and more efficiently than me + Claude Code? It's very possible. Could I have done it faster? Absolutely not. The only language I know well is Python, which was absolutely not suited for this application.
If you provide planning documentation, architecture documentation, coding style documentation, and keep track of the project progress and coordinate handoffs between agent sessions, LLMs absolutely CAN work well enough outside of a limited context window. Whether all of that is worth it is up to the individual, but generally those sorts of documents are mostly good practice for large projects anyways.
This is one of my main beefs with GenAI: it's a tool that's most effective when used in ways that run counter to human nature. For instance, it's a reasonably safe tool for many uses when limited to things the human user already knows, or can thoroughly fact-check. But it's pushed as great for doing things you otherwise couldn't do, and saving time by making it unnecessary to dig up a bunch of facts or learn something specific. Some of the most-hyped (and most-pushed) uses are for either going beyond what the user knows, or saving time and boredom (which are both big parts of fact-checking something that sounds totally plausible but will occasionally be totally plausible BS).<...> my takeaway so far is that as long as you already know what you're doing — or you already know most of what you're doing — these can be amazing tools <...>
Having the AI code things you don't understand is incredibly easy and also super dangerous <...{
I'm not a software engineer. We don't have a software engineer, and my other responsibilities preclude me from dedicating time as a software engineer. One benefit (or detriment depending on your point of view) is that this sort of agentic software development doesn't require 100% attention, more like 20%. I can be working on my other tasks, check in, add some context, suggest a prompt, and then send it on its merry way while I go back to my other tasks.This definitely requires a developer mindset, and some serious knowledge and understanding of the tech involved. It's what I meant with the developer carrying the mental load.
Medium/Long term I wonder if you would not be better off learning how to translate that understanding into code directly (potentially by using AI at a smaller scale) instead of relying on AI to do that for you, AI is a service running at huge losses so liable to either disappear or get enshittified over time. Alternatively, potential breakthroughs in its abilities will see these sort of specific workarounds for its current flaws become superfluous so personal skills seem to me like a safer investment.
That's not dissimilar to working with software developers a lot of the time. Yeah, there's natural language, but there's also the very precise language of specifying expected outcomes, acceptance criteria, deliverables, and desired/undesired behaviours.Wait, so LLMs which are proclaimed to be such a useful tool to everyone, which with we're just supposed to be able to talk to in natural language, need to be prompted in a very specific way?
That sounds like they aren't actually good at natural language, if you need to be a "prompt engineer" to get them to do what you want.
This is a fair take. With a few of the big AI companies looking to go public next year, we'll see enshitification increasing as they start trying to pump the numbers.AI is a service running at huge losses so liable to either disappear or get enshittified over time.