How do AI coding agents work? We look under the hood.

GaidinBDJ · Wednesday at 7:12 AM

I mean, it's all about understanding what you're actually getting.

It's like with chatbots. People need to understand that they're not "saying" anything. They're just putting together words in a vaguely sentence shape.

Same thing with code. You just have to realize you're not getting code, you're getting something code-shaped. Yes, that can be a very helpful tool, but you have to keep that constraint in mind.

As for me, I probably get the most use out of LLMs by dropping Excel formulas into CoPilot. It does wonders at cleaning up formulas that get convoluted. Or even just outright rewriting it using different functions so it works more clearly.

dwrd · Wednesday at 7:44 AM

My experience has been that if you already know what you're doing with the code, these new tools can be useful in some situations. But if you have no idea what you actually need it do (e.g. the code is throwing a weird error that you can't figure out), you'd better strap in, because you're going down the rabbit hole of wrong answers.

ian191 · Wednesday at 7:45 AM

LetterRip said:
The key flaw with that study is they had almost no experience with LLMs. It takes a bit of time to get actually productive and know what works and what doesn't, where they are likely to fail, how to prompt them to get good results, etc.

Wait, so LLMs which are proclaimed to be such a useful tool to everyone, which with we're just supposed to be able to talk to in natural language, need to be prompted in a very specific way?

That sounds like they aren't actually good at natural language, if you need to be a "prompt engineer" to get them to do what you want.

Zool26 · Wednesday at 7:47 AM

All your codebase are belong to us.

pokrface · Wednesday at 7:49 AM

I've been spending time most days for the past month or so with Claude Code and/or GPT Codex in VSCode doing stuff, from nginx log file mining to light python programming to web server tuning, and my takeaway so far is that as long as you already know what you're doing — or you already know most of what you're doing — these can be amazing tools to make work go faster and automate drudgery. But you have to watch the outputs. You have to be aware that the machine is ready at any given moment to run away from you and code some crazy random shit that it thinks you want.

Having the AI code things you don't understand is incredibly easy and also super dangerous, because like Benj's peice says, the tool will run down the first rabbit hole it finds if you don't give it sufficient guidance. It'll happily reinvent the wheel, or code up a solution that meets two of your three requirements and breaks the third, or straight-up confabulate capabilities and make suggestions/write code that can't ever possibly work.

But even with that enormous caveat, I definitely intend to keep using them because they're incredibly useful. Yesterday, tired of fighting with CCZE and logview and other options, I had claude code create a short python-based log colorizer that does exactly what I want in exactly the specific way I want it done, and it works great (screenshot!). Could I have written that colorizer myself? Maybe—I'm not a developer, and it would likely have taken me days and a lot of frustration. Instead, the AI did it in about 20 minutes, most of which was me saying, "Now tweak this one thing slightly..." and it incorporating my feedback. That's the kind of agentic work that helps me and makes my life easier.

cerberusTI · Wednesday at 7:57 AM

ian191 said:
Wait, so LLMs which are proclaimed to be such a useful tool to everyone, which with we're just supposed to be able to talk to in natural language, need to be prompted in a very specific way?

That sounds like they aren't actually good at natural language, if you need to be a "prompt engineer" to get them to do what you want.

In my experience it is more about the prompt normally being a bit deficient, and mostly needing to one shot it if you are to be productive. It does not get you out of providing the necessary information to complete the project somehow, except that unlike a human programmer, it will not come back next week after finally starting with a list of questions. You include what you want, or it makes something up.

You can tweak the result a bit, but if the first shot was not very close you are better off rethinking your prompt and starting over in many cases.

If you do include all necessary information, it tends to do better than junior or intermediate programmers, at least in terms of work to get the code through review and testing.

That has a corollary as well, in that for smaller projects I will only beat the AI if not providing documentation or comments. If I do, I will write that first these days and feed it to the AI to get started in most cases (which also helps in keeping the documentation accurate.)

Or to put it another way, which has been coming to mind a lot as I see people interact with the AI:

Charles Babbage said:
On two occasions I have been asked, 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.

Hypatia · Wednesday at 8:24 AM

A randomized controlled trial published by the nonprofit research organization METR in July 2025 found that experienced open-source developers actually took 19 percent longer to complete tasks when using AI tools, despite believing they were working faster.

[emphasis added]
This is a key problem. Apart from all of the issues inherent to LLMs (confabulation, for instance), there are other problems that are emergent properties of humans using LLMs. At least, so far.

One such problem is the apparent amplification of certain errors in human thinking. The LLMs are not “deceptive” in a literal sense because they are not actual agents/actors in the system. Even so, it’s not quite right to say that this is all a matter of simple “self-deception” for a similar reason we wouldn’t say that the negative effects of, say, social media are all the fault of individual users.

These sort of cognitive bias amplifier effects I call “system-confabulation” or sys-cons (apologies to those using other abbreviations).

TLDR: LLMs don’t just confabulate, they spread “confabulation” to surrounding and interconnected human contributors. These sorts of systems-level effect of this in organizations and communities could become massive.
Caveat emptor.

*edited for a nasty grammar error

VoterFrog · Wednesday at 8:28 AM

I might also recommend that your team elevates its planning documentation game. This was always good engineering practice but it's also extremely useful context for LLMs. At the very least, make sure you have detailed descriptions of the expectations in your stories or task breakdowns.

And for anything that's going to take a couple weeks or more of implementation work, a document describing the planned changes. Start solving some of the high level problems and revealing the unknown unknowns earlier in the process. Like I said, this is valuable - LLM or no - but LLMs are particularly clueless of these sorts of challenges so writing these things down well really help them.

Context, as described in the article, is a precious resource for LLMs but good context is extremely valuable to them. For me, it's often the difference between an agent's output being a useless waste of time and being functionally what I wanted.

Lil' ol' me · Wednesday at 8:39 AM

What was that Christopher Nolan movie about the guy trying to solve his wife's murder, but had lost his long-term memory? You know, where he had to write down (or tattoo) important things for, uh, context. It sounds a lot like this:

This means the AI coding agents periodically “forget” a large portion of what they are doing every time this compression happens, but unlike older LLM-based systems, they aren’t completely clueless about what has transpired and can rapidly re-orient themselves by reading existing code, written notes left in files, change logs, and so on.

odikweos · Wednesday at 8:48 AM

ian191 said:
Wait, so LLMs which are proclaimed to be such a useful tool to everyone, which with we're just supposed to be able to talk to in natural language, need to be prompted in a very specific way?

That sounds like they aren't actually good at natural language, if you need to be a "prompt engineer" to get them to do what you want.

Compare vs the literally nothing that existed before, not the perfection you seem to be expecting.

I'm not super bullish on LLMs beyond obvious utility, but there is obvious utility.

TheOldChevy · Wednesday at 8:49 AM

I use Cursor/Claude to write small tools for the lab, controlling instruments, write parsers... and many such small tasks that use to take quite long to debug as instrument documentation tend to be scarce, and this is very efficient. But these tools are for me. I would not give the results of this little vibes for someone else to use.

Incarnate · Wednesday at 8:57 AM

ian191 said:
Wait, so LLMs which are proclaimed to be such a useful tool to everyone, which with we're just supposed to be able to talk to in natural language, need to be prompted in a very specific way?

That sounds like they aren't actually good at natural language, if you need to be a "prompt engineer" to get them to do what you want.

Lets just be clear here - You have to do the same things with humans to get them to do what you want also. You can't be vague otherwise you will get something that sort of, kind of, is what you asked, but not really. Also, if you ask 10 humans to do something, you will get 10 different things back as well.

I'm all for using AI safely, and ensuring you know what you are doing first (i.e. knowing how to code going back to the subject of the article), but you're being snarky because I know you know that "prompt engineering" is already a thing both in AI, and also in providing enough clear details to a person to complete work as well. So, don't pretend to be surprised and say "Wait, I thought AI solved every problem and read my mind..."

LetterRip · Wednesday at 9:04 AM

ian191 said:
Wait, so LLMs which are proclaimed to be such a useful tool to everyone, which with we're just supposed to be able to talk to in natural language, need to be prompted in a very specific way?

Like using a search engine, people with greater skill can get better results. It doesn't mean they don't provide enormous value to less skilled individuals, but the greater skill acts as a multiplier. It allows one to achieve better results with less time.

For things like bug fixing, the LLM's are often 'lazy' and will potentially 'fix the bug' by eliminating the symptom rather than the underlying cause.

So you may need something like this,

You are an expert software engineer tasked with debugging a critical issue. Your goal is to identify and fix the underlying cause of the problem, not just patch the symptoms.

Here is the context:

Bug description: [Describe the bug symptoms and expected behavior]

Code snippet(s): [Provide the relevant code]

Error message/Stack trace (if any): [Provide the error details]

Failing test case(s) (if any): [Provide specific test cases that fail]

Follow these specific instructions in order:

1. Root Cause Analysis:
* DO NOT propose any code changes yet.
* Describe precisely what goes wrong and what the first domino to fall is.
* Apply the "5 Whys" technique in your reasoning: starting from the symptom, ask "Why?" five times to drill down to the fundamental cause.
* Clearly state the single, definitive underlying reason for the bug.

2. Proposed Solution & Rationale:
* Based only on the root cause identified above, propose a specific code change to address it.
* Provide the entire corrected code snippet(s), ready to be copied and pasted, without any placeholders or extra comments.
* Explain the rationale: Why your change solves the issue at the root cause level, and how it avoids just a superficial fix.
* Confirm that your fix has no negative side effects or backward compatibility issues.

3. Refinement Constraints:
* Do not include extra explanations like "Here is the JSON output" or "Note:".
* Do not be speculative; state exactly where the problem lies.
* Focus strictly on the provided code and context; do not suggest general solutions.

Rather than,

I get the following error on line ... here is the code, please fix it.

Of course the more recent models are better at this, they might well do the equivalent of the first prompt given only the second prompt - but the Claude 3.7 that the paper was based on, needed much better prompting to avoid the issue.

LetterRip · Wednesday at 9:09 AM

Lil' ol' me said:
What was that Christopher Nolan movie about the guy trying to solve his wife's murder, but had lost his long-term memory? You know, where he had to write down (or tattoo) important things for, uh, context. It sounds a lot like this:

Momento, and yes, it is a lot like this.

Kanuck · Wednesday at 9:16 AM

IMO the only thing to remember is that an LLM is just a translation tool.

If you can break down your problem into a translation problem, then you can generally manipulate an LLM to be about as useful as Google Translate, i.e. better than nothing.

Google Translate needs a LOT of help to do an adequate job. For example, if your target language requires the speaker's gender as added context, then to get a correct translation you may need to prefix "I am happy" with "I am a man" or "I am a woman" to give the model the correct context. If you don't know these details, your translations will be wrong, often in ways that will cause substantial problems.

Coding agents pretend to be smarter than this, but they're not, nor can they be. They are merely translating prompts into responses. If they "ask you a follow-up question" that is a trained response to a specific type of prompt. They cannot expand beyond their programming; they do not understand anything. They are translators.

Once you understand this, you can sometimes use them effectively within the limited scopes where they might be helpful, i.e. you will learn when you need to prefix a prompt with a piece of context.

Maybe more importantly, you can better see through a lot of the sillier claims in this space, and understand why most enterprises have consistently struggled to find any of the promised value from so-called "AI" tools.

matchstick_1 · Wednesday at 9:25 AM

odikweos said:
Compare vs the literally nothing that existed before, not the perfection you seem to be expecting.

I'm not super bullish on LLMs beyond obvious utility, but there is obvious utility.

The obvious utility of giving you something you potentially don’t really understand, can’t entirely trust and have no visibility of where it’s come from.

Pino90 · Wednesday at 9:50 AM

Great write up Benj!

the article said:
What people call “vibe coding”—creating AI-generated code without understanding what it’s doing—is clearly dangerous for production work. Shipping code you didn’t write yourself in a production environment is risky because it could introduce security issues or other bugs or begin gathering technical debt that could snowball over time.

I agree with this sentence but with caveats: delegating to agents doesn't mean delegating the ownership of the code. You should ALWAYS review the code before moving on. If you don't, it's not that vibe-coding is dangerous, it's simply that you are a bad developer. It's the modern equivalent of pushing to production without running a single test.

I say this as someone that deployed agents in a team of 35 developers with great success.

As an example, 95% of the time I review the code that agents write, and when the PR is large I often ask a different model (say Gemini instead of Claude) to review it once more looking out for security risks and vulnerabilities.

Also, tangentially related, if you are interested in using AI to code, we have an excellent thread with great insights from professional developers using AI. Adventures coding with AI. It's a great place to ask for advice or to simply learn from others' mistakes.

hisnyc · Wednesday at 10:03 AM

dwrd said:
My experience has been that if you already know what you're doing with the code, these new tools can be useful in some situations. But if you have no idea what you actually need it do (e.g. the code is throwing a weird error that you can't figure out), you'd better strap in, because you're going down the rabbit hole of wrong answers.

I'm probably doing it wrong, but I almost never use the LLM to generate code I use (I may use it to show what it thinks I should do). Frankly, I find what is generated to be of functional, but poor quality.

Where I think they shine is documentation, testing, and code reviews. Keeping documentation up to date for a changing code base is a pain and often dropped under production pressures. It's nice to hand it off to something that doesn't do a half-bad job (and is better than nothing or incorrect documentation).

You have to be careful about test code generation; I see a number of tests that aren't great and, in one case, the LLM disabled a test because it did a faulty analysis on why it was failing. But... It is easy and when you are under the gun, it gets something sane in place.

I really like them for code reviews. Doing code reviews with colleagues has always seemed fraught to me. You have to be careful about egos and mistaking preference for correctness. An LLM both catches errors (i've left bugs in the code to see if they would catch them) and points out structural issues. Is it perfect? No. But it is a 'second set of eyes' and I think it really does have a meaningful impact on code quality.

holmes6 · Wednesday at 10:09 AM

I have been doing a bunch of sandbox experimenting with Claude code and it's incredibly capable. It's also very likely to have badly flawed and bugged result as the prompts or codebase get larger and/or less organized. I think the concept of this being powerful has been proven but implementation is kind of a can of worms still.

Pino90 · Wednesday at 10:10 AM

ian191 said:
Wait, so LLMs which are proclaimed to be such a useful tool to everyone, which with we're just supposed to be able to talk to in natural language, need to be prompted in a very specific way?

That sounds like they aren't actually good at natural language, if you need to be a "prompt engineer" to get them to do what you want.

The fact is that there is this weird expectations that LLMs will do it perfectly well the first time regardless of what you told them. That's not the case, and being able to tell them exactly what you want and in a way that they understand has been called "prompt engineering" (which is a dumb name and a dumb thing, if you ask me, but whatever...).

Reality is that when you ask people to develop something you expect some back and forth, while many (and especially haters) expect that LLMs will get it right at the first attempt.

The tools have a learning curve that is very steep before you can successfully use them in production.

icwhatudidthere · Wednesday at 10:13 AM

ian191 said:
Wait, so LLMs which are proclaimed to be such a useful tool to everyone, which with we're just supposed to be able to talk to in natural language, need to be prompted in a very specific way?

That sounds like they aren't actually good at natural language, if you need to be a "prompt engineer" to get them to do what you want.

But how is that any different from working with a human? Think about working with a summer intern who has no built-up knowledge about your codebase or even your tool pipeline. Or a business user who asks for a new feature but can't express it until they see a working sample.

It's a big reason why requirements gathering and technical writing are fields all on their own.

marsiglio · Wednesday at 10:28 AM

My experience has been with Claude and Godot GDScript. A year or two ago Claude had a lot of trouble with GDScript, providing Python code and then apologizing, and getting confused about versions of GDScript. Claude improved remarkably although it may still have GDScript version problems, but I have not verified that for a few months. My GDScript abilities are pretty good, maybe a low 4 out of 5. Claude was a tremendous help, but always in a conversation (using the pay version). I tried to be as prescriptive as possible in my requests, and then negotiate around misunderstandings or non-optimal tactics. All in all it was well worth the money

Pino90 · Wednesday at 10:31 AM

marsiglio said:
My experience has been with Claude and Godot GDScript. A year or two ago Claude had a lot of trouble with GDScript, providing Python code and then apologizing, and getting confused about versions of GDScript. Claude improved remarkably although it may still have GDScript version problems, but I have not verified that for a few months. My GDScript abilities are pretty good, maybe a low 4 out of 5. Claude was a tremendous help, but always in a conversation (using the pay version). I tried to be as prescriptive as possible in my requests, and then negotiate around misunderstandings or non-optimal tactics. All in all it was well worth the money

Newer versions improved significantly. A trick that works very well: download GDScript docs and put it in a project in the web app. Then tell Claude that if it's not sure it can RAG from the documentation. It significantly improves the output quality.

Another thing that works a lot is to tell it exactly how you like the architecture of your scenes. Since Godot is so different from other engines, LLMs tend to struggle a lot with it (or with coming up with architectural choices that make sense).

RobBiddle · Wednesday at 10:33 AM

The way I like to describe AI coding agents working behind the scenes is something akin to an apprentice sculptor whose chosen media is excrement. i.e. A

Sculptor.

The bad/wrong/ugly/erroneous output is still everpresent in the mix but a good portion of the stinkiest stuff is shaved away and the less offensive mistakes are massaged into something more appealing.

That process why the new coding models are orders of magnitude slower to deliver results, which is a good trade off IMO.

In March 2025 the models were complete trash for coding. In December 2025 they're finally at the point where I feel like they actually make me more productive.

They can sculpt some impressive

occasionally.

Ozy · Wednesday at 10:42 AM

Kanuck said:
IMO the only thing to remember is that an LLM is just a translation tool.

If you can break down your problem into a translation problem, then you can generally manipulate an LLM to be about as useful as Google Translate, i.e. better than nothing.

Google Translate needs a LOT of help to do an adequate job. For example, if your target language requires the speaker's gender as added context, then to get a correct translation you may need to prefix "I am happy" with "I am a man" or "I am a woman" to give the model the correct context. If you don't know these details, your translations will be wrong, often in ways that will cause substantial problems.

Coding agents pretend to be smarter than this, but they're not, nor can they be. They are merely translating prompts into responses. If they "ask you a follow-up question" that is a trained response to a specific type of prompt. They cannot expand beyond their programming; they do not understand anything. They are translators.

Once you understand this, you can sometimes use them effectively within the limited scopes where they might be helpful, i.e. you will learn when you need to prefix a prompt with a piece of context.

Maybe more importantly, you can better see through a lot of the sillier claims in this space, and understand why most enterprises have consistently struggled to find any of the promised value from so-called "AI" tools.

Google Translate does not have a compiler that kicks back errors if the translation is wrong. Coding does. Coding also is used to generate output that can be compared directly against expected output to find errors, even if the compilation is correct. The big thing you're missing is that for translation, the translated text IS the desired output, whereas for coding, the AI generated code is just the intermediate product, it's the OUTPUT of the code that is desired, and can be compared to determine the correctness of the code in a way that does not work for translated text.

twilightomni · Wednesday at 10:46 AM

ian191 said:
Wait, so LLMs which are proclaimed to be such a useful tool to everyone, which with we're just supposed to be able to talk to in natural language, need to be prompted in a very specific way?

That sounds like they aren't actually good at natural language, if you need to be a "prompt engineer" to get them to do what you want.

Well they’re misadvertised as intuitive tools that give great results with no effort.

But there are multiple ways that sentence can be false in reality that still allow for LLMs generating useful results. The question as always is tool costs vs. benefit.

Feone · Wednesday at 10:52 AM

holmes6 said:
I have been doing a bunch of sandbox experimenting with Claude code and it's incredibly capable. It's also very likely to have badly flawed and bugged result as the prompts or codebase get larger and/or less organized. I think the concept of this being powerful has been proven but implementation is kind of a can of worms still.

"Incredibly capable" is an interesting description when you're also describing how it falls apart after passing trivial complexity. In my experience it works only when it's small enough to fit in context, and similar enough to implementations in the training data. Beyond that it rapidly gets increasingly unreliable until it hits a wall and fails to make any progress entirely. As-is it still takes a developer to take that mental load, and I'm not convinced doing so actually saves time.

This is also very likely a fundamental problem, getting past this would require a mental model of the domain and the existing software it's building upon, LLMs can't do that by themselves, and despite several years burning mountains of cash it hasn't qualitatively improved in this aspect.

Current AI is at its best doing search and generating small snippets of logic, limited context needed, easily described, understood, and much faster than fighting through the SEO jungle to get to the relevant documentation. This level of assistance also doesn't push you out of the details of the code, so you don't lose out on the memorization and understanding of the domain that this creates.

The enormous disconnect between what's being promised by vendors and media, and where AI actually shines in development is fascinating.

flunk · Wednesday at 10:58 AM

I like to think of LLMs generating something that looks somewhat like the expected result. An artificial symulcrum of what you're looking for. This can be helpful, it can also be completely wrong and you need to be able to distinguish the difference and I feel like a lot of people can't tell the difference.

I'm currently leaning towards telling everyone I know not to use LLMs for anything you're not personally an expert in, because otherwise how do you judge the validily of the generated output?

Ozy · Wednesday at 11:00 AM

Feone said:
"Incredibly capable" is an interesting description when you're also describing how it falls apart after passing trivial complexity. In my experience it works only when it's small enough to fit in context, and similar enough to implementations in the training data. Beyond that it rapidly gets increasingly unreliable until it hits a wall and fails to make any progress entirely. As-is it still takes a developer to take that mental load, and I'm not convinced doing so actually saves time.

This is also very likely a fundamental problem, getting past this would require a mental model of the domain and the existing software it's building upon, LLMs can't do that by themselves, and despite several years burning mountains of cash it hasn't qualitatively improved in this aspect.

Current AI is at its best doing search and generating small snippets of logic, limited context needed, easily described, understood, and much faster than fighting through the SEO jungle to get to the relevant documentation. This level of assistance also doesn't push you out of the details of the code, so you don't lose out on the memorization and understanding of the domain that this creates.

The enormous disconnect between what's being promised by vendors and media, and where AI actually shines in development is fascinating.

I've been operating a Claude code generated internal project for months, that consists of about 40k lines of backend code, and 10k lines of front end code, assembled, painstakingly, over many, many sessions of reinitialized context windows, careful session handoff documents, lots of architecture documents, lots of style documents...and lots of debugging sessions. The code, so far, has been rock stable.

Could a professional software engineer do it faster and more efficiently than me + Claude Code? It's very possible. Could I have done it faster? Absolutely not. The only language I know well is Python, which was absolutely not suited for this application.

If you provide planning documentation, architecture documentation, coding style documentation, and keep track of the project progress and coordinate handoffs between agent sessions, LLMs absolutely CAN work well enough outside of a limited context window. Whether all of that is worth it is up to the individual, but generally those sorts of documents are mostly good practice for large projects anyways.

Feone · Wednesday at 11:20 AM

Ozy said:
I've been operating a Claude code generated internal project for months, that consists of about 40k lines of backend code, and 10k lines of front end code, assembled, painstakingly, over many, many sessions of reinitialized context windows, careful session handoff documents, lots of architecture documents, lots of style documents...and lots of debugging sessions. The code, so far, has been rock stable.

Could a professional software engineer do it faster and more efficiently than me + Claude Code? It's very possible. Could I have done it faster? Absolutely not. The only language I know well is Python, which was absolutely not suited for this application.

If you provide planning documentation, architecture documentation, coding style documentation, and keep track of the project progress and coordinate handoffs between agent sessions, LLMs absolutely CAN work well enough outside of a limited context window. Whether all of that is worth it is up to the individual, but generally those sorts of documents are mostly good practice for large projects anyways.

This definitely requires a developer mindset, and some serious knowledge and understanding of the tech involved. It's what I meant with the developer carrying the mental load.

Medium/Long term I wonder if you would not be better off learning how to translate that understanding into code directly (potentially by using AI at a smaller scale) instead of relying on AI to do that for you, AI is a service running at huge losses so liable to either disappear or get enshittified over time. Alternatively, potential breakthroughs in its abilities will see these sort of specific workarounds for its current flaws become superfluous so personal skills seem to me like a safer investment.

Robin-3 · Wednesday at 11:28 AM

pokrface said:
<...> my takeaway so far is that as long as you already know what you're doing — or you already know most of what you're doing — these can be amazing tools <...>

Having the AI code things you don't understand is incredibly easy and also super dangerous <...{

This is one of my main beefs with GenAI: it's a tool that's most effective when used in ways that run counter to human nature. For instance, it's a reasonably safe tool for many uses when limited to things the human user already knows, or can thoroughly fact-check. But it's pushed as great for doing things you otherwise couldn't do, and saving time by making it unnecessary to dig up a bunch of facts or learn something specific. Some of the most-hyped (and most-pushed) uses are for either going beyond what the user knows, or saving time and boredom (which are both big parts of fact-checking something that sounds totally plausible but will occasionally be totally plausible BS).

Also, it's like self-driving cars: it's good enough most of the time, but it still expects the humans to be ready to take over at a moment's notice to fix if things go wrong. In an AI the "human, fix this!" issues aren't realtime, at least, but it's also less evident when something has gone down a rabbit hole. And people are not wired to keep checking and rechecking polished-sounding output, when the last 5 references you checked were fine and the boss is asking if that writeup is done yet ("weren't you using the AI tool? That should have made it quicker!").

That's without getting into all the environmental and ethical issues, or the ways this has put an easy-to-use misinformation tool at everyone's fingertips, or the long-term implications to "this is really only useful if you already know your stuff, but at least now we don't have to hire junior staff", etc.

Ozy · Wednesday at 11:40 AM

Feone said:
This definitely requires a developer mindset, and some serious knowledge and understanding of the tech involved. It's what I meant with the developer carrying the mental load.

Medium/Long term I wonder if you would not be better off learning how to translate that understanding into code directly (potentially by using AI at a smaller scale) instead of relying on AI to do that for you, AI is a service running at huge losses so liable to either disappear or get enshittified over time. Alternatively, potential breakthroughs in its abilities will see these sort of specific workarounds for its current flaws become superfluous so personal skills seem to me like a safer investment.

I'm not a software engineer. We don't have a software engineer, and my other responsibilities preclude me from dedicating time as a software engineer. One benefit (or detriment depending on your point of view) is that this sort of agentic software development doesn't require 100% attention, more like 20%. I can be working on my other tasks, check in, add some context, suggest a prompt, and then send it on its merry way while I go back to my other tasks.

But you're absolutely right that it requires a general developer mindset/skillset. The primary benefit is that I no longer have to be a 'specialist'. I don't have to specialize in backend, frontend, database, or any specific language. I have to have enough general knowledge to know what proper security practices are, when I should emphasize concurrency, and so on. But I don't need to know SQL well enough to code it, or Go, or Rust. Heck, I was even able to use Claude to spin up an embedded solution for a little personal project on an ESP32 microcontroller without the requisite 'specialized' knowledge of Arduino programming, flashing, etc... It could do ALL of that from the command line.

Whether AI companies become profitable is an interesting topic in and of itself. What's interesting, is that pretty much all of the large company, closed source models ARE profitable, yet the companies aren't. What do I mean by that? The amount of money it takes to train a model is paid back within 6-9 months of operating that model...so why isn't the company profitable? Because they are burning 2-3x, if not 5x-10x more money training the NEXT model...which will become profitable...and so on. The AI companies are always burning tons of cash training the next model because they have to compete, because everyone wants the latest and greatest.

If we had a situation where there was 1 company with a 'finished' AI model, and relied on servicing users with that already trained model, that company would likely be insanely profitable. What will be the actual endstate? No idea.

Ploroxide · Wednesday at 11:42 AM

I think it’d be useful to hear the experiences of people who have experienced both benefits and disadvantages while using them in a production environment.

chanman819 · Wednesday at 11:42 AM

ian191 said:
Wait, so LLMs which are proclaimed to be such a useful tool to everyone, which with we're just supposed to be able to talk to in natural language, need to be prompted in a very specific way?

That sounds like they aren't actually good at natural language, if you need to be a "prompt engineer" to get them to do what you want.

That's not dissimilar to working with software developers a lot of the time. Yeah, there's natural language, but there's also the very precise language of specifying expected outcomes, acceptance criteria, deliverables, and desired/undesired behaviours.

When talking about software of sufficient complexity, all those product owners, project managers, business analysts, designers etc. are in many ways the 'prompt engineers' that turn end-user requirements into something a development team can actually create.

Look under the hood of a software project that's gone terribly off the rails and you'll often find something dramatically missed in translation at some point.

drengfer · Wednesday at 11:44 AM

So, it’s been (many) years since I’ve wielded a keyboard in (coding) anger. (My last years coding were in Assembly and C.) So, question to those still active in the art: how do you manage the division of labor to best effect? Specifically, for example, do you allow the AIAD tools to create customer requirements? Functional specifications? Test plans and test cases and automation? How do things work out if you allow a single tool to do all of the work, start to finish? I can imagine some not-necessarily-amusing mayhem, but (see above) I’m too removed from the fray to fact-base my imagination here. Thanks for your insights!

icwhatudidthere · Wednesday at 11:46 AM

Feone said:
AI is a service running at huge losses so liable to either disappear or get enshittified over time.

This is a fair take. With a few of the big AI companies looking to go public next year, we'll see enshitification increasing as they start trying to pump the numbers.

Zarsus · Wednesday at 12:01 PM

Always, always, ALWAYS run them in a sandbox of some sort, and if you want to keep the stuff they built, create a git repo for it.
Beyond the "Agent deleted everything in the drive" horror news stories, my friend and I have experienced unwanted changes outside of the scopes specified to the agents (in one instance, one Claude agent skipped its own project and changed stuff in another project that's being actively modified by another Claude agent instead)

How do AI coding agents work? We look under the hood.

Ars Scholae Palatinae

Ars Tribunus Militum

Ars Centurion

Smack-Fu Master, in training

Senior Technology Editor

Ars Tribunus Angusticlavius

Wise, Aged Ars Veteran

Smack-Fu Master, in training

Ars Scholae Palatinae

Ars Praefectus

Ars Scholae Palatinae

Ars Tribunus Angusticlavius

Ars Praefectus

Ars Praefectus

Ars Scholae Palatinae

Ars Centurion

Ars Scholae Palatinae

Smack-Fu Master, in training

Ars Praetorian

Ars Scholae Palatinae

Ars Tribunus Militum

Smack-Fu Master, in training

Ars Scholae Palatinae

Seniorius Lurkius

Ars Tribunus Angusticlavius

Ars Centurion

Seniorius Lurkius

Ars Praefectus

Ars Tribunus Angusticlavius

Seniorius Lurkius

Ars Scholae Palatinae

Ars Tribunus Angusticlavius

Ars Centurion

Ars Tribunus Angusticlavius

Smack-Fu Master, in training

Ars Tribunus Militum

Ars Scholae Palatinae