Brian Lovin
/
Hacker News
Daily Digest email

Get the top HN stories in your inbox every day.

akhrail1996

Genuine question: what's the evidence that the architect → developer → reviewer pipeline actually produces better results than just... talking to one strong model in one session?

The author uses different models for each role, which I get. But I run production agents on Opus daily and in my experience, if you give it good context and clear direction in a single conversation, the output is already solid. The ceremony of splitting into "architect" and "developer" feels like it gives you a sense of control and legibility, but I'm not convinced it catches errors that a single model wouldn't catch on its own with a good prompt.

arialdomartini

This is anecdotal but just a couple days ago, with some colleagues, we conducted a little experiment to gather that evidence.

We used a hierarchy of agents to analyze a requirement, letting agents with different personas (architect, business analyst, security expert, developer, infra etc) discuss a request and distill a solution. They all had access to the source code of the project to work on.

Then we provided the very same input, including the personas' definition, straight to Claude Code, and we compared the result.

They council of agents got to a very good result, consuming about 12$, mostly using Opus 4.6.

To our surprise, going straight with a single prompt in Claude Code got to a similar good result, faster and consuming 0.3$ and mostly using Haiku.

This surely deserves more investigation, but our assumption / hypothesis so far is that coordination and communication between agents has a remarkable cost.

Should this be the case, I personally would not be surprised:

- the reason why we humans do job separation is because we have an inherent limited capacity. We cannot reach the point to be experts in all the needed fields : we just can't acquire the needed knowledge to be good architects, good business analysts, good security experts. Apparently, that's not a problem for a LLM. So, probably, job separation is not a needed pattern as it is for humans.

- Job separation has an inherent high cost and just does not scale. Notably, most of the problems in human organizations are about coordination, and the larger the organization the higher the cost for processes, to the point processed turn in bureaucracy. In IT companies, many problems are at the interface between groups, because the low-bandwidth communication and inherent ambiguity of language. I'm not surprised that a single LLM can communicate with itself way better and cheaper that a council of agents, which inevitably faces the same communication challenges of a society of people.

titanomachy

If it could be done with 30 cents of Haiku calls, maybe it wasn't a complicated enough project to provide good signal?

jimbokun

In that case a $12 program is probably too big to meaningfully review. Probably better to have smaller chunks you can review instead of generating one really large program in one shot.

arialdomartini

Fair point. I could try with a harder problem. This still does not explain why Claude Code felt the need to use Opus, and why Opus felt the need to burn 12$ or such an easy task. I mean, it's 40 times the cost.

throwawayffffas

I think the benefit may be task separation and cleaning the context between tasks. Asking a single session to do all three has a couple of downsides.

1. The context for each task gets longer, which we know degrades performance.

2. In that longer context, implicit decisions are made in the thinking steps, the model is probably more likely to go through with bad decisions that were made 20 steps back.

The way Stavros does it, is Architect -> Dev -> Review. By splitting the task in three sessions, we get a fresh and shorter context for each task. At minimum skipping the thinking messages and intermediary tool output, should increase the chances of a better result.

Using different agent personas and models at least introduces variability at the token generation, whether it's good or bad, I do not know. As far as I know in general it's supposed to help.

Having the sessions communicate I think is a mistake, because you lose all of the benefits of cleaning up the context, and given the chattiness of LLMs you are probably going to fill up the context with multiple thinking rounds over the same message, one from the session that outputs the message and one from the session reading the message, you are probably going to have competing tool uses, each session using it's own tool calls to read the same content, it will probably be a huge mess.

The way I do it is I have a large session that I interact with and task with planning and agent spawning. I don't have dedicated personas or agents. The benefits the way I see them are I have a single session with an extensive context about what we are doing and then a dedicated task handler with a much more focused context.

What I have seen with my setup is, impressively good performance at the beginning that degrades as feedback and tweaks around work pile up.

TimTheTinker

Framing LLM use for dev tasks as "narrative" is powerful.

If you want specific, empirical, targeted advice or work from an LLM, you have to frame the conversation correctly. "You are a tenured Computer Science professor agent being consulted on a data structure problem" goes a very long way.

Similarly, context window length and prior progress exerts significant pressure on how an LLM frames its work. At some point (often around 200k-400k tokens in), they seem to reach a "we're in the conclusion of this narrative" point and will sometimes do crazy stuff to reach whatever real or perceived goal there is.

mikkupikku

Probably the same reason it takes a team of developers and managers 6 months to write what one or two developers can do on their own in one week. The overhead caused by constant meetings and negotiations is massive.

andrekandre

  > The overhead caused by constant meetings and negotiations is massive.
this is my life ngl. i really wish these ai companies would work in automating away all this bullshit instead of just code code code

just the other day i was asked to prepare slides for a presentation about something everyone already knows (among many other useless side-work)... i feel like with "ai" in general we are applying bandages where my real problem is the big machine that gives me paper cuts all day...

jimbokun

Even with humans I’ve found full ownership of a project from architecture to implementation to deployment and operation, produces the best results.

Less context switching and communication overhead. Focus on well thought out and documented APIs to divide work across developers and support communication and collaboration .

Miraste

LLMs also don't have the primary advantage humans get from job separation, diverse perspectives. A council of Opuses are all exploring the exact same weights with the exact same hardware, unlike multiple humans with unique brains and memories. Even with different ones, Codex 5.3 is far more similar to Opus than any two humans are to each other. Telling an Opus agent to focus on security puts it in a different part of the weights, but it's the same graph-- it's not really more of an expert than a general Opus agent with a rule to maintain secure practices.

visarga

You can differentiate by context, one sees the work session, the other sees just the code. Same model, but different perspectives. Or by model, there are at least 7 decent models between the top 3 providers.

visarga

An ensemble can spot more bugs / fixes than a single model. I run claude, codex and gemini in parallel for reviews.

nvardakas

[dead]

theshrike79

Agentic pipelines and systems fall into the same issues as humans who work together, mostly communication.

It's not like they can dump their full context to the "manager" agent, they need to condense stuff, which will result in misinterpreted information or missing information on decisions down the line.

IMO this was more relevant when agents had limited context windows

BloondAndDoom

Absolutely works with frontier models. What do you think about smaller models in these pipelines? That’s literally what I’m working on, with qwen3.5-27b and im splitting the task to 4 steps and not sure if that’s the way to go. Do you have any experience to share?

moduspol

To me, such techniques feel like temporary cudgels that may or may not even help that will be obsolete in 1-6 months.

This is similar to telling Claude Code to write its steps into a separate markdown file, or use separate agents to independently perform many tasks, or some of the other things that were commonly posted about 3-6+ months ago. Now Claude Code does that on its own if necessary, so it's probably a net negative to instruct it separately.

Some prompting techniques seem ageless (e.g. giving it a way to validate its output), but a lot of these feel like temporary scaffolding that I don't see a lot of value in building a workflow around.

TheMuenster

Totally agree - the fundamental concept here of automatically improving context control when writing code is absolutely something that will be baked into agents in 6 months. The reason it hasn't yet is mainly because the improvements it makes seem to be very marginal.

You can contrast this to something like reasoning, which offered very large, very clear improvements in fundamental performance, and as a result was tackled very aggressively by all the labs. Or (like you mentioned) todo lists, which gave relatively small gains but were implemented relatively quickly. Automatic context control is just going to take more time to get it right, and the gains will be quite small.

visarga

Workflow matters too, how you organize your docs, work tasks, reviews. If you do it all by hand you spend a lot of time manually enforcing a process that can be automated.

I think task files with checkable gates are a very interesting animal - they carry intent, plan, work and reviews, at the end of work can become docs. Can be executed, but also passed as value, and reflect on themselves - so they sport homoiconicity and reflexion.

kybernetikos

There's a lot of cargo culting, but it's inevitable in a situation like this where the truth is model dependent and changing the whole time and people have created companies on the premise they can teach you how to use ai well.

_heimdall

Its also inevitable given that we still don't even really know how these models work or what they do at inference time.

We know input/output pairs, when using a reasoning model we can see a separate stream of text that is supposedly insight into what the model is "thinking" during inference, and when using multiple agents we see what text they send to each other. That's it.

andrekandre

  > separate stream of text that is supposedly insight into what the model is "thinking" during inference
taking a look at those streams is almost disturbing and hilarious at the same time... like looking into the mind of a paranoiac.

never_inline

I think this is just anthropomorphism. Sub agents make sense as a context saving mechanism.

Aider did an "architect-editor" split where architect is just a "programmer" who doesn't bother about formatting the changes as diff, then a weak model converts them into diffs and they got better results with it. This is nothing like human teams though.

TheMuenster

Absolutely agree with this. The main reason for this improving performance is simply that the context is being better controlled, not that this approach is actually going going to yield better results fundamentally.

Some people have turned context control into hallucinated anthropomorphic frameworks (Gas Town being perhaps the best example). If that's how they prefer to mentally model context control, that's fine. But it's not the anthropomorphism that's helping here.

jaredklewis

> what's the evidence

What’s the evidence for anything software engineers use? Tests, type checkers, syntax highlighting, IDEs, code review, pair programming, and so on.

In my experience, evidence for the efficacy of software engineering practices falls into two categories:

- the intuitions of developers, based in their experiences.

- scientific studies, which are unconvincing. Some are unconvincing because they attempt to measure the productivity of working software engineers, which is difficult; you have to rely on qualitative measures like manager evaluations or quantitative but meaningless measures like LOC or tickets closed. Others are unconvincing because they instead measure the practice against some well defined task (like a coding puzzle) that is totally unlike actual software engineering.

Evidence for this LLM pattern is the same. Some developers have an intuition it works better.

codemog

My friend, there’s tons of evidence of all that stuff you talked about in hundreds of papers on arxiv. But you dismiss it entirely in your second bullet point, so I’m not entirely sure what you expect.

jaredklewis

I’ve read dozens of them and find them unconvincing for the reasons outlined. If you want a more specific critique, link a paper.

I personally like and use tests, formal verification, and so on. But the evidence for these methods are weak.

edit: To be clear, I am not ragging on the researchers. I think it's just kind of an inherently messy field with pretty much endless variables to control for and not a lot of good quantifiable metrics to rely on.

thesz

You can measure customer facing defects.

Also, lines of code is not completely meaningless metric. What one should measure is lines of code that is not verified by compiler. E.g., in C++ you cannot have unbalanced brackets or use incorrectly typed value, but you still may have off-by-one error.

Given all that, you can measure customer facing defect density and compare different tools, whether they are programming languages, IDEs or LLM-supported workflow.

codeflo

> Also, lines of code is not completely meaningless metric.

Comparing lines of code can be meaningful, mostly if you can keep a lot of other things constant, like coding style, developer experience, domain, tech stack. There are many style differences between LLM and human generated code, so that I expect 1000 lines of LLM code do a lot less than 1000 lines of human code, even in the exact same codebase.

undefined

[deleted]

jacquesm

The proper metric is the defect escape rate.

exidex

Now you have to count defects

slopinthebag

Most developer intuitions are wrong.

See: OOP

vbezhenar

Intuition is subjective. It's hard to convert subjective experience to objective facts.

lbreakjai

The different models is a big one. In my workflow, I've got opus doing the deep thinking, and kimi doing the implementation. It helps manage costs.

Sample size of one, but I found it helps guard against the model drifting off. My different agents have different permissions. The worker can not edit the plan. The QA or planner can't modify the code. This is something I sometimes catch codex doing, modifying unrelated stuff while working.

sigbottle

I recently had a horrible misalignment issue with a 1 agent loop. I've never done RL research, but this kind of shit was the exact kind of thing I heard about in RL papers - shimming out what should be network tests by echoing "completed" with the 'verification' being grepping for "completed", and then actually going and marking that off as "done" in the plan doc...

Admittedly I was using gsdv2; I've never had this issue with codex and claude. Sure, some RL hacking such as silent defaults or overly defensive code for no reason. Nothing that seemed basically actively malicious such as the above though. Still, gsdv2 is a 1-agent scaffolding pipeline.

I think the issue is that these 1-agent pipelines are "YOU MUST PLAN IMPLEMENT VERIFY EVERYTHING YOURSELF!" and extremely aggressive language like that. I think that kind of language coerces the agent to do actively malicious hacks, especially if the pipeline itself doesn't see "I am blocked, shifting tasks" as a valid outcome.

1-agent pipelines are like a horrible horrible DFS. I still somewhat function when I'm in DFS mode, but that's because I have longer memory than a goldfish.

jumploops

After "fully vibecoding" (i.e. I don't read the code) a few projects, the important aspect of this isn't so much the different agents, but the development process.

Ironically, it resembles waterfall much more so than agile, in that you spec everything (tech stack, packages, open questions, etc.) up front and then pass that spec to an implementation stage. From here you either iterate, or create a PR.

Even with agile, it's similar, in that you have some high-level customer need, pass that to the dev team, and then pass their output to QA.

What's the evidence? Admittedly anecdotal, as I'm not sure of any benchmarks that test this thoroughly, but in my experience this flow helps avoid the pitfall of slop that occurs when you let the agent run wild until it's "done."

"Done" is often subjective, and you can absolutely reach a done state just with vanilla codex/claude code.

Note: I don't use a hierarchy of agents, but my process follows a similar design/plan -> implement -> debug iteration flow.

totomz

I think the splitting make sense to give more specific prompts and isolated context to different agents. The "architect" does not need to have the code style guide in its context, that actually could be misleading and contains information that drives it away from the architecture

ako

Wouldn’t skills already solve this? A harness can start a new agent with a specific skill if it thinks that makes sense.

christofosho

I like reading these types of breakdowns. Really gives you ideas and insight into how others are approaching development with agents. I'm surprised the author hasn't broken down the developer agent persona into smaller subagents. There is a lot of context used when your agent needs to write in a larger breadth of code areas (i.e. database queries, tests, business logic, infrastructure, the general code skeleton). I've also read[1] that having a researcher and then a planner helps with context management in the pre-dev stage as well. I like his use of multiple reviewers, and am similarly surprised that they aren't refined into specialized roles.

I'll admit to being a "one prompt to rule them all" developer, and will not let a chat go longer than the first input I give. If mistakes are made, I fix the system prompt or the input prompt and try again. And I make sure the work is broken down as much as possible. That means taking the time to do some discovery before I hit send.

Is anyone else using many smaller specific agents? What types of patterns are you employing? TIA

1. https://github.com/humanlayer/advanced-context-engineering-f...

marcus_holmes

that reference you give is pretty dated now, based on a talk from August which is the Beforetimes of the newer models that have given such a step change in productivity.

The key change I've found is really around orchestration - as TFA says, you don't run the prompt yourself. The orchestrator runs the whole thing. It gets you to talk to the architect/planner, then the output of that plan is sent to another agent, automatically. In his case he's using an architect, a developer, and some reviewers. I've been using a Superpowers-based [0] orchestration system, which runs a brainstorm, then a design plan, then an implementation plan, then some devs, then some reviewers, and loops back to the implementation plan to check progress and correctness.

It's actually fun. I've been coding for 40+ years now, and I'm enjoying this :)

[0] https://github.com/obra/superpowers

indigodaddy

Can you bolt superpowers onto an existing project so that it uses the approach going forward (I'm using Opencode), or would that get too messy?

eclipxe

Yes. But gsd is even better - especially gsd2

marcus_holmes

These are just skills, so you can add the skills to your setup and start using them whenever you like.

stavros

I don't think that splitting into subagents that use the same model will really help. I need to clarify this in the post, but the split is 1) so I can use Sonnet to code and save on some tokens and 2) so I can get other models to review, to get a different perspective.

It seems to me that splitting into subagents that use the same model is kind of like asking a person to wear three different hats and do three different parts of the job instead of just asking them to do it all with one hat. You're likely to get similar results.

chriswarbo

I'm considering using subagents, as a way to manage context and delegate "simple" tasks to cheaper models (if you want to see tokens burn, watch Opus try fixing a misplaced ')' in a Lisp file!).

I see what you mean w.r.t. different hats; but is it useful to have different tools available? For example, a "planner" having Web access and read-only file access, versus a "developer" having write access to files but no Web access?

stavros

Yes, if you want to separate capabilities, definitely.

felixsells

[flagged]

lbreakjai

It's interesting to see some patterns starting to emerge. Over time, I ended up with a similar workflow. Instead of using plan files within the repository, I'm using notion as the memory and source of truth.

My "thinker" agent will ask questions, explore, and refine. It will write a feature page in notion, and split the implementation into tasks in a kanban board, for an "executor" to pick up, implement, and pass to a QA agent, which will either flag it or move it to human review.

I really love it. All of our other documentation lives in notion, so I can easily reference and link business requirements. I also find it much easier to make sense of the steps by checking the tickets on the board rather than in a file.

Reviewing is simpler too. I can pick the ticket in the human review column, read the requirements again, check the QA comments, and then look at the code. Had a lot of fun playing with it yesterday, and I shared it here:

https://github.com/marcosloic/notion-agent-hive

Cthulhu_

No criticism or anything, but it really does feel / sound like you (and others who embraced LLMs and agentic coding) aspire to be more of a product manager than a coder. Thing is, a "real" PM comes with a lot more requirements and there's less demand for them - more requirements in that you need to be a people person and willing to spend at least half your time in meetings, and less demand because one PM will organize the work for half a dozen developers (minimum).

Some people say LLM assisted coding will cost a lot of developers' jobs, but posts like this imply it'll cost (solve?) a lot of management / overhead too.

Mind you I've always thought project managers are kinda wasteful, as a software developer I'd love for Someone Else to just curate a list of tasks and their requirements / acceptance criteria. But unfortunately that's not the reality and it's often up to the developers themselves to create the tasks and fill them in, then execute them. Which of course begs the question, why do we still have a PM?

(the above is anecdotal and not a universal experience I'm sure. I hope.)

lbreakjai

I worked with some excellent PMs in the past, it's an entirely different skillset. This wasn't really meant to replace what they do. I really wanted something with which to work at feature-level. That is, after all the hard work of figuring out _what_ to build has been done.

> as a software developer I'd love for Someone Else to just curate a list of tasks and their requirements / acceptance criteria

That's interesting. In every team I worked in, I always fought really hard against anyone but developers being able to write tickets on the board.

fooster

“one PM will organize the work for half a dozen developer”

That isn’t the job of a PM.

adampunk

This seems more about how you view PMs than anything else.

highfrequency

> I’ll tell the LLM my main goal (which will be a very specific feature or bugfix e.g. “I want to add retries with exponential backoff to Stavrobot so that it can retry if the LLM provider is down”), and talk to it until I’m sure it understands what I want. This step takes the most time, sometimes even up to half an hour of back-and-forth until we finalize all the goals, limitations, and tradeoffs of the approach, and agree on what the end architecture should look like.

This sounds sensible, but also makes me wonder how much time is actually being saved if implementing a "very specific feature or bugfix" still takes an hour of back and forth with an LLM.

Can't help but think that this is still just an awkward intermediate phase of development with adolescent LLMs where we need to think about implementation choices at all.

stavros

Small features or bugfixes generally take a minute or two of conversation.

miguelgrinberg

> One thing I’ve noticed is that different people get wildly different results with LLMs, so I suspect there’s some element of how you’re talking to them that affects the results.

It's always easier to blame the prompt and convince yourself that you have some sort of talent in how you talk to LLMs that other's don't.

In my experience the differences are mostly in how the code produced by the LLM is reviewed. Developers who have experience reviewing code are more likely to find problems immediately and complain they aren't getting great results without a lot of hand holding. And those who rarely or never reviewed code from other developers are invariably going to miss stuff and rate the output they get higher.

zackify

This definitely is the case. I was talking to someone complaining about how llms don't work good.

They said it couldn't fix an issue it made.

I asked if they gave it any way to validate what it did.

They did not, some people really are saying "fix this" instead of saying "x fn is doing y when someone makes a request to it. Please attempt to fix x and validate it by accessing the endpoint after and writing tests"

Its shocking some people don't give it any real instruction or way to check itself.

In addition I get great results doing voice to text with very specific workflows. Asking it to add a new feature where I describe what functions I want changed then review as I go vs wait for the end.

mbesto

> Its shocking some people don't give it any real instruction or way to check itself.

It's not shocking. The tech world is telling them that "Claude will write all of their app easily" with zero instructions/guidelines so of course they're going to send prompts like that.

tracker1

I think the implications of limited to no instructions are a little to way off depending on what you're doing... CRUD APIs, sure... especially if you have a well defined DB schema and API surface/approach. Anything that might get complex, less so.

Two areas I've really appreciated LLMs so far... one is being able to make web components that do one thing well in encapsulation.. I can bring it into my project and just use it... AI can scaffold a test/demo app that exercises the component with ease and testing becomes pretty straight forward.

The other for me has been in bridging rust to wasm and even FFI interfaces so I can use underlying systems from Deno/Bun/Node with relative ease... it's been pretty nice all around to say the least.

That said, this all takes work... lots of design work up front for how things should function... weather it's a ui component or an API backend library. From there, you have to add in testing, and some iteration to discover and ensure there aren't behavioral bugs in place. Actually reviewing code and especially the written test logic. LLMs tend to over-test in ways that are excessive or redundant a lot of the time. Especially when a longer test function effectively also tests underlying functionalities that each had their own tests... cut them out.

There's nothing "free" and it's not all that "easy" either, assuming you actually care about the final product. It's definitely work, but it's more about the outcome and creation than the grunt work. As a developer, you'll be expected to think a lot more, plan and oversee what's getting done as opposed to being able to just bang out your own simple boilerplate for weeks at a time.

mikkupikku

It's surprising they don't learn better after their first hour or two of use. Or maybe they do know better but don't like the thing so they deliberately give it rope to hang itself with, then blame overzealous marketting.

petcat

If you tell a human junior developer just "fix this" then they will spend a week on a wild-goose chase with nothing to show for it.

At least the LLM will only take 5 minutes to tell you they don't know what to do.

ruszki

Do they? I’ve never got a response that something was impossible, or stupid. LLMs are happy to verify that a noop does nothing, if they don’t know how to fix something. They rather make something useless than really tackle a problem, if they can make tests green that way, or they can claim that something “works”.

And’ve I never asked Claude Code something which is really impossible, or even really difficult.

speakingmoistly

To be fair, that happening feels more like poor management and mentorship than "juniors are scatterbrained".

Over time, you build up the right reflexes that avoid a one-week goose chase with them. Heck, since we're working with people, you don't just say " fix this", you earmark time to make sure everyone is aligned on what needs done and what the plan is.

dkersten

> At least the LLM will only take 5 minutes to tell you they don't know what to do.

In my experience, the LLM will happily try the wrong thing over and over for hours. It rarely will say it doesn’t know.

undefined

[deleted]

icedchai

An LLM might take 5 minutes, or 20 minutes, and still do the wrong thing. Rarely have I seen an LLM not "know what to do." A coworker told it to fix some unit tests, it churned away for a while, then changed a bunch of assert status == 200 to 500. Good news, tests pass now!

sobjornstad

There are subtler versions of this too. I've been working on a TUI app for a couple of weeks, and having great success getting it to interactively test by sending tmux commands, but every once in a while it would just deliver code that didn't work. I finally realized it was because the capture tools I gave it didn't capture the cursor location, so it would, understandably, get confused about where it was and what was selected.

I promptly went and fixed this before doing any more work, because I know if I was put in that situation I would refuse to do any more work until I could actually use the app properly. In general, if you wouldn't be able to solve a problem with the tools you give an LLM, it will probably do a bad job too.

raw_anon_1111

I made that mistake when I first started using Claude Code/Codex. Now I give it access to my isolated DEV AWS account with appropriately scoped permissions on the IAM level with temporary credentials and tell it how to validate the code and have in my markdown file to always use $x to test any changes to $y.

It’s gotten a lot better.

tracker1

Yeah, the more time I spend in planning and working through design/api documentation for how I want something to work, the better it does... Similar for testing against your specifications, not the code... once you have a defined API surface and functional/unit tests for what you're trying to do, it's all the harder for AI to actually mess things up. Even more interesting is IMO how well the agents work with Rust vs other languages the more well defined your specifications are.

Ancapistani

> some people really are saying "fix this" instead of saying "x fn is doing y when someone makes a request to it. Please attempt to fix x and validate it by accessing the endpoint after and writing tests"

This works about 85% of the time IME, in Claude Code. My normal workflow on most bugs is to just say “fix this” and paste the logs. The key is that I do it in plan mode, then thoroughly inspect and refine the plan before allowing it to proceed.

rirze

Untested Hypothesis: LLM instruction is usually an intelligence+communication-based skill. I find in my non-authoritative experience that users who give short form instructions are generally ill prepared for technical motivation (whether they're motivating LLMs or humans).

jzig

lol that is still “how you’re talking to them that affects the results” just more specific

raw_anon_1111

I have 30 years of experience delivering code and 10 years of leading architecture. My argument is the only thing that matters is does the entire implementation - code + architecture (your database, networking, your runtime that determines scaling, etc) meet the functional and none functional requirements. Functional = does it meet the business requirements and UX and non functional = scalability, security, performance, concurrency, etc.

I only carefully review the parts of the implementation that I know “work on my machine but will break once I put in a real world scenario”. Even before AI I wasn’t one of the people who got into geek wars worrying about which GOF pattern you should have used.

All except for concurrency where it’s hard to have automated tests, I care more about the unit or honestly integration tests and testing for scalability than the code. Your login isn’t slow because you chose to use a for loop instead of a while loop. I will have my agents run the appropriate tests after code changes

I didn’t look at a line of code for my vibe coded admin UI authenticated with AWS cognito that at most will be used by less than a dozen people and whoever maintains it will probably also use a coding agent. I did review the functionality and UX.

Code before AI was always the grind between my architectural vision and implementation

awakeasleep

Explain how fragility of implementation, like spaghetti code, high coupling low cohesion fit into your world view?

petcat

As human developers, I think we're struggling with "letting go" of the code. The code we write (or agents write) is really just an intermediate representation (IR) of the solution.

For instance, GCC will inline functions, unroll loops, and myriad other optimizations that we don't care about (and actually want!). But when we review the ASM that GCC generates we are not concerned with the "spaghetti" and the "high coupling" and "low cohesion". We care that it works, and is correct for what it is supposed to do.

Source code in a higher-level language is not really different anymore. Agents write the code, maybe we guide them on patterns and correct them when they are obviously wrong, but the code is just the work-item artifact that comes out of extensive specification, discussion, proposal review, and more review of the reviews.

A well-guided, iterative process and problem/solution description should be able to generate an equivalent implementation whether a human is writing the code or an agent.

raw_anon_1111

You did see the part about my unit, integration and scalability testing? The testing harness is what prevents the fragility.

It doesn’t matter to AI whether the code is spaghetti code or not. What you said was only important when humans were maintaining the code.

No human should ever be forced to look at the code behind my vibe coded internal admin portal that was created with straight Python, no frameworks, server side rendered and produced HTML and JS for the front end all hosted in a single Lambda including much of the backend API.

I haven’t done web development since 2002 with Classic ASP besides some copy and paste feature work once in a blue moon.

In my repos - post AI. My Claude/Agent files have summaries of the initial statement of work, the transcripts from the requirement sessions, my well labeled design diagrams , my design review sessions transcripts where I explained it to client and answered questions and a link to the Google NotebookLM project with all of the artifacts. I have separate md files for different implemtation components.

The NotebookLM project can be used for any future maintainers to ask questions about the project based on all of the artifacts.

soulofmischief

I would like to introduce you to the concepts of interfaces and memory safety.

Well-designed interfaces enforce decoupling where it matters most. And believe it or not, you can do review passes after an LLM writes code, to catch bugs, security issues, bad architecture, reduce complexity, etc.

icedchai

In my experience, consulting companies typically have a bunch of low-to-medium skilled developers producing crap, so the situation with AI isn't much different. Some are better than others, of course.

datsci_est_2015

Also developer UX, common antipatterns, etc

This “the only thing that matters about code is whether it meets requirements” is such a tired take and I can’t imagine anyone seriously spouting it has has had to maintain real software.

keeda

I dunno, I have extensive experience reviewing code, and I still review all the AI generated code I own, and I find nothing to complain about in the vast majority of cases. I think it is based on "holding it right."

For instance, I've commented before that I tend to decompose tasks intended for AI to a level where I already know the "shape" of the code in my head, as well as what the test cases should look like. So reviewing the generated code and tests for me is pretty quick because it's almost like reading a book I've already read before, and if something is wrong it jumps out quickly. And I find things jumping out more and more infrequently.

Note that decomposing tasks means I'm doing the design and architecture, which I still don't trust the AI to do... but over the years the scope of tasks has gone up from individual functions to entire modules.

In fact, I'm getting convinced vibe coding could work now, but it still requires a great deal of skill. You have to give it the right context and sophisticated validation mechanisms that help it self-correct as well as let you validate functionality very quickly with minimal looks at the code itself.

arikrahman

"Holding it right" has been one of my biggest problems. Many times I find the output affected by prompt poisoning, and I have to throw away the entire context.

mikkupikku

It's not skill with talking to an LLM, it's the users skill and experience with the problem they're asking the LLM to solve. They work better for problems the prompter knows well and poorly for problems the prompter doesn't really understand.

Try it yourself. Ask claude for something you don't really understand. Then learn that thing, get a fresh instance of claude and try again, this time it will work much better because your knowledge and experience will be naturally embedded in the prompt you write up.

Roxxik

Not only you understanding the how, but you not understanding the goal.

I often use AI successfully, but in a few cases I had, it was bad. That was when I didn't even know the end goal and regularly switched the fundamental assumptions that the LLM tried to build up.

One case was a simulation where I wanted to see some specific property in the convergence behavior, but I had no idea how it would get there in the dynamics of the simulation or how it should behave when perturbed.

So the LLM tried many fundamentally different approaches and when I had something that specifically did not work it immediately switched approaches.

Next time I get to work on this (toy) problem I will let it implement some of them, fully parametrize them and let me have a go with it. There is a concrete goal and I can play around myself to see if my specific convergence criterium is even possible.

FeepingCreature

LLMs massively reduce the cost of "let's just try this". I think trying to migrate your entire repo is usually a fool's errand. Figure out a way to break the load-bearing part of the problem out into a sub-project, solve it there, iterate as much as you like. Claude can give you a test gui in one or two minutes, as often as you like. When you have it reliably working there, make Claude write up a detailed spec and bring that back to the main project.

mikkupikku

Yup, same sort of experience. If I'm fishing for something based on vibes that I can't really visualize or explain, it's going to be a slog. That said, telling the LLM the nature of my dilemma up front, warning it that I'll be waffling, seems to help a little.

__alexs

I review most of the code I get LLMs to write and actually I think the main challenge is finding the right chunk size for each task you ask it to do.

As I use it more I gain more intuition about the kinds of problems it can handle on it's, vs those that I need to work on breaking down into smaller pieces before setting it loose.

Without research and planning agents are mostly very expensive and slow to get things done, if they even can. However with the right initial breakdown and specification of the work they are incredibly fast.

make_it_sure

you are overestimating the skill of code review. Some people have very specific ways of writing code and solving problems which are not aligned what LLMs wrote, but doesn't mean it's wrong.

I know senior developers that are very radical on some nonsense patterns they think are much better than others. If they see code that don't follow them, they say it's trash.

Even so, you can guide the LLM to write the code as you like.

And you are wrong, it's a lot on how people write the prompt.

datsci_est_2015

> you are overestimating the skill of code review.

“You are overestimating the skill of [reading, comprehending, and critically assessing code of a non-guaranteed quality]” is an absurd statement if you properly expand out what “code review” means.

I don’t care if you code review the CSS file for the Bojangles online menu web page, but you better be code reviewing the firmware for my dad’s pacemaker.

This whole back and forth with LLM-generated code makes me think that the marginal utility of a lot of code the strong proponents write is <1¢. If I fuck up my code, it costs our partners $200/hr per false alert, which obliterates the profit margin of using our software in the first place.

AIorNot

By far most of the code LLMs write is for crappy crud apps and webapps not pacemakers and rockets

We can capture enough reliability on what LLMs produce there by guided integration tests and UX tests along with code review and using other LLMs to review along with other strategies to prvent semantic and code drift

Do you know how much crap wordpress ,drupal and Joomla sites I have seen?

Just that work can be automated away

But Ive also worked in high end and mission critical delivery and more formal verification etc - that’s just moving the goalposts on what AI can do- it will get there eventually

Last year you all here were arguing AI Couldn’t code - now everyone has moved the goalposts to formal high end and mission critical ops- yes when money matters we humans are still needed of course - no one denying that- its the utility of the sole human developer against the onslaught of machine aided coding

This profession is changing rapidly- people are stuck in denial

cultofmetatron

> Developers who have experience reviewing code are more likely to find problems immediately and complain they aren't getting great results without a lot of hand holding

this makes me feel better about the amount of disdain I've been feeling about the output from these llms. sometimes it popsout exactly what I need but I can never count on it to not go offrails and require a lot of manual editing.

mcv

Exactly my experience. Sometimes it's brilliant, sometimes it produces crap, often it produces something that's a step in the right direction but requires extra work, and often it switches between these different results, producing great results at first until it gets stuck and desperately starts spewing out increasingly weird garbage.

As a developer, you always have to check the code, and recognise when it's just being stupid.

k3nx

Question: are you manually making those changes to the "stupid" code? I've been having success with Claude using skills. When I see something I wouldn't do I say what I would have done, ask it for why it did it they way it did, then have it update the skills with a better plan. It's like a rubber duck and I understand it better. I have it make the code improvements. Laughing as it goes off the rails is entertaining though.

jjice

I think that's absolutely part of it. Code reviewing has become an even more valuable skill than ever, and I think the industry as a whole still is treating it as low value, despite it always being one of the most important parts of the process.

I think another part (among many others) is not the skill of the individual prompting, but on the quality of the code and documentation (human and agent specific) in the code base. I've seen people run willy-nilly with LLMs that are just spitting out nonsense because there are no examples for how the code should look, not documentation on how it should structure the code, and no human who knows what the code should work reviewing it. A deadly combo to produce bad, unmaintainable code.

If you sort those out though (and review your own damn LLM code), I think that's when LLMs become a powerful programming tool.

I really liked Simon Willison's way of putting it: "Your job is to deliver code you have proven to work".

https://simonwillison.net/2025/Dec/18/code-proven-to-work/

plastic041

I wanted to know how to make softwares with LLM "without losing the benefit of knowing how the entire system works" and "intimately familiar with each project’s architecture and inner workings", while "have never even read most of their code". (Because obviously, you can't.) But OP didn't explain that.

You tell LLM to create something, and then use another LLM to review it. It might make the result safer, but it doesn't mean that YOU understand the architecture. No one does.

ashwinsundar

Hot take: you can't have your cake and eat it too. If you aren't writing code, designing the system, creating architecture, or even writing the prompt, then you're not understanding shit. You're playing slots with stochastic parrots

    The code grows beyond my usual comprehension, I'd have to really read through it for a while. Sometimes the LLMs can't fix a bug so I just work around it or ask for random changes until it goes away. It's not too bad for throwaway weekend projects, but still quite amusing. I'm building a project or webapp, but it's not really coding - I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works.
- Karpathy 2025

simonw

Your Karpathy quote there is out of context. It starts with: https://twitter.com/karpathy/status/1886192184808149383

  There's a new kind of coding I call "vibe
  coding", where you fully give in to the
  vibes, embrace exponentials, and forget
  that the code even exists.
Not all AI-assisted programming is vibe coding. If you're paying attention to the code that's being produced you can guide it towards being just as high quality (or even higher quality) than code you would have written by hand.

ashwinsundar

It's appropriate for the commenter I was replying to, who asked how they can understand things, "while having never even read most of their code."

I like AI-assisted programming, but if I fail to even read the code produced, then I might as well treat it like a no-code system. I can understand the high-levels of how no-code works, but as soon as it breaks, it might as well be a black box. And this only gets worse as the codebase spans into the tens of thousands of lines without me having read any of it.

The (imperfect) analogy I'm working on is a baker who bakes cakes. A nearby grocery store starts making any cake they want, on demand, so the baker decides to quit baking cakes and buy them from the store. The baker calls the store anytime they want a new cake, and just tells them exactly what they want. How long can that baker call themself a "baker"? How long before they forget how to even bake a cake, and all they can do is get cakes from the grocer?

imiric

> Sometimes the LLMs can't fix a bug so I just work around it or ask for random changes until it goes away.

It's insane that this quote is coming from one of the leading figures in this field. And everyone's... OK that software development has been reduced to chance and brute force?

undefined

[deleted]

stavros

There are two ways to approach this. One is a priori: "If you aren't doing the same things with LLMs that humans do when writing code, the code is not going to work".

The other one is a posteriori: "I want code that works, what do I need to do with LLMs?"

Your approach is the former, which I don't think works in reality. You can write code that works (for some definition of "works") with LLMs without doing it the way a human would do it.

ChrisGreenHeur

the hardware you typed this on was designed by hardware architects that write little to no code. just types up a spec to be implemented by verilog coders.

il-b

[dead]

jbergqvist

I've found that spending most of my time on design before any code gets written makes the biggest difference.

The way I think about it: the model has a probability distribution over all possible implementations, shaped by its training data. Given a vague prompt, that distribution is wide and you're likely to get something generic. As you iterate on a design with the model (really just refining the context), the distribution narrows towards a subset of implementations. By the time the model writes code, you've constrained the space enough that most of what it produces is actually what you want.

ugtr3

Yea but that design part is the most expensive part. The code generation is pretty trivial - the advantage of llm’s is the power to search through pre-trained information spaces - much faster than a human could. Issue is again.. probabilistic. So there’s variance.’

jumploops

This is similar to how I use LLMs (architect/plan -> implement -> debug/review), but after getting bit a few times, I have a few extra things in my process:

The main difference between my workflow and the authors, is that I have the LLM "write" the design/plan/open questions/debug/etc. into markdown files, for almost every step.

This is mostly helpful because it "anchors" decisions into timestamped files, rather than just loose back-and-forth specs in the context window.

Before the current round of models, I would religiously clear context and rely on these files for truth, but even with the newest models/agentic harnesses, I find it helps avoid regressions as the software evolves over time.

A minor difference between myself and the author, is that I don't rely on specific sub-agents (beyond what the agentic harness has built-in for e.g. file exploration).

I say it's minor, because in practice the actual calls to the LLMs undoubtedly look quite similar (clean context window, different task/model, etc.).

One tip, if you have access, is to do the initial design/architecture with GPT-5.x Pro, and then take the output "spec" from that chat/iteration to kick-off a codex/claude code session. This can also be helpful for hard to reason about bugs, but I've only done that a handful of times at this point (i.e. funky dynamic SVG-based animation snafu).

lelele

> The main difference between my workflow and the authors, is that I have the LLM "write" the design/plan/open questions/debug/etc. into markdown files, for almost every step. > > This is mostly helpful because it "anchors" decisions into timestamped files, rather than just loose back-and-forth specs in the context window.

Would you please expand on this? Do you make the LLM append their responses to a Markdown file, prefixed by their timestamps, basically preserving the whole context in a file? Or do you make the LLM update some reference files in order to keep a "condensed" context? Thank you.

aix1

Not the GP, but I currently use a hierarchy of artifacts: requirements doc -> design docs (overall and per-component) -> code+tests. All artifacts are version controlled.

Each level in the hierarchy is empirically ~5X smaller than the level below. This, plus sharding the design docs by component, helps Claude navigate the project and make consistent decision across sessions.

My workflow for adding a feature goes something like this:

1. I iterate with Claude on updating the requirements doc to capture the desired final state of the system from the user's perspective.

2. Once that's done, a different instance of Claude reads the requirements and the design docs and updates the latter to address all the requirements listed in the former. This is done interactively with me in the loop to guide and to resolve ambiguity.

3. Once the technical design is agreed, Claude writes a test plan, usually almost entirely autonomously. The test plan is part of each design doc and is updated as the design evolves.

3a. (Optionally) another Claude instance reviews the design for soundness, completeness, consistency with itself and with the requirements. I review the findings and tell it what to fix and what to ignore.

4. Claude brings unit tests in line with what the test plan says, adding/updating/removing tests but not touching code under test.

4a. (Optionally) the tests are reviewed by another instance of Claude for bugs and inconsistencies with the test plan or the style guide.

5. Claude implements the feature.

5a. (Optionally) another instance reviews the implementation.

For complex changes, I'm quite disciplined to have each step carried out in a different session so that all communinications are done via checked-in artifacts and not through context. For simple changes, I often don't bother and/or skip the reviews.

From time to time, I run standalone garbage collection and consistency checks, where I get Claude to look for dead code, low-value tests, stale parts of the design, duplication, requirements-design-tests-code drift etc. I find it particularly valuable to look for opportunities to make things simpler or even just smaller (fewer tokens/less work to maintain).

Occasionally, I find that I need to instruct Claude to write a benchmark and use it with a profiler to opimise something. I check these in but generally don't bother documenting them. In my case they tend to be one-off things and not part of some regression test suite. Maybe I should just abandon them & re-create if they're ever needed again.

I also have a (very short) coding style guide. It only includes things that Claude consistently gets wrong or does in ways that are not to my liking.

stavros

I don't know if I explained this clearly enough in the article, but I have the LLM write the plan to a file as well. The architect's end result is a plan file in the repo, and the developer reads that.

You can see one here: https://github.com/skorokithakis/sleight-of-hand/blob/master...

Havoc

Yeah same. The markdown thing also helps with the multi model thing. Can wipe context and have another model look at the code and markdown plan with fresh eyes easily

gehsty

When I use Claude code to work on a hobby project it feels like doom scrolling…

I can’t get my head around if the hobby is the making or the having, but fair to say I’ve felt quite dissatisfied at the end of my hobby sessions lately so leaning towards the former.

Levitating

Agreed, I code for fun. But I am not sure if I still find it fun if the LLM just makes what I want.

thenthenthen

Haha love the Sleight of hand irregular wall clock idea. I once had a wall clock where the hand showing the seconds would sometimes jump backwards, it was extremely unsettling somehow because it was random. It really did make me question my sanity.

kqr

This used to be one of my recurring nightmares when I was a child. The three I remember were (1) clocks suddenly starting to go backwards, either partially or completely; (2) radio turning on without being able to turn it off, and (3) house fire. There really is something about clocks.

cpt_sobel

In the plethora of all these articles that explain the process of building projects with LLMs, one thing I never understood it why the authors seem to write the prompts as if talking to a human that cares how good their grammar or syntax is, e.g.:

> I'd like to add email support to this bot. Let's think through how we would do this.

and I'm not not even talking about the usage of "please" or "thanks" (which this particular author doesn't seem to be doing).

Is there any evidence that suggests the models do a better job if I write my prompt like this instead of "wanna add email support, think how to do this"? In my personal experience (mostly with Junie) I haven't seen any advantage of being "polite", for lack of a better word, and I feel like I'm saving on seconds and tokens :)

dgb23

I can't speak for everyone, but to me the most accurate answer is that I'm role-playing, because it just flows better.

In the back of my head I know the chatbot is trained on conversations and I want it to reflect a professional and clear tone.

But I usually keep it more simple in most cases. Your example:

> I'd like to add email support to this bot. Let's think through how we would do this.

I would likely write as:

> if i wanted to add email support, how would you go about it

or

> concise steps/plan to add email support, kiss

But when I'm in a brainstorm/search/rubber-duck mode, then I write more as if it was a real conversation.

xnorswap

I agree, it's just easier to write requirements and refine things as if writing with a human. I no longer care that it risks anthropomorphising it, as that fight has long been lost. I prefer to focus on remembering it doesn't actually think/reason than not being polite to it.

Keeping everything generally "human readable" also the advantage of it being easier for me to review later if needed.

alkonaut

I also always imagine that if I'm joined by a colleague on this task they might have to read through my conversation and I want to make it clear to a human too.

As you said, that "other person" might be me too. Same reason I comment code. There's another person reading it, most likely that other person is "me, but next week and with zero memory of this".

We do like anthropomorphising the machines, but I try to think they enjoy it...

jstanley

How can you use these models for any length of time and walk away with the understanding that they do not think or reason?

What even is thinking and reasoning if these models aren't doing it?

kqr

I think it mattered a lot more a few years ago, when the user's prompts were almost all context the LLM had to go by. A prompt written in a sloppy style would cause the LLM to respond in a sloppy style (since it's a snazzy autocomplete at its core). LLMs reason in tokens, so a sloppy style leads it to mimic the reasoning that it finds in the sloppy writing of its training data, which is worse reasoning.

These days, the user prompt is just a tiny part of the context it has, so it probably matters less or not at all.

I still do it though, much like I try to include relevant technical terminology to try to nudge its search into the right areas of vector space. (Which is the part of the vector space built from more advanced discourse in the training material.)

tarsinge

The reasoning is by being polite the LLM is more likely to stay on a professional path: at its core a LLM try to make your prompt coherent with its training set, and a polite prompt + its answer will score higher (gives better result) than a prompt that is out of place with the answer. I understand to some people it could feel like anthropomorphising and could turn them off but to me it's purely about engineering.

Edit: wording

wiseowise

> The reasoning is by being polite the LLM is more likely to stay on a professional path

So no evidence.

cpt_sobel

> If the result of your prompt + its answer it's more likely to score higher i.e. gives better result that a prompt that feels out of place with the answer

Sure seems like this could be the case with the structure of the prompt, but what about capitalizing the first letter of sentence, or adding commas, tag questions etc? They seem like semantics that will not play any role at the end

spudlyo

Writing is what gives my thinking structure. Sloppy writing feels to me like sloppy thinking. My fingers capitalize the first letter of words, proper nouns and adjectives, and add punctuation without me consciously asking them to do so.

TheDong

Why wouldn't capitalization, commas, etc do well?

These are text completion engines.

Punctuation and capitalization is found in polite discussion and textbooks, and so you'd expect those tokens to ever so slightly push the model in that direction.

Lack of capitalization pushes towards text messages and irc perhaps.

We cannot reason about these things in the same way we can reason about using search engines, these things are truly ridiculous black boxes.

pegasus

That's orthography, not semantics, but it's still part of the professional style steering the model on the "professional path" as GP put it.

vitro

For me it is just a good habit that I want to keep.

mrbungie

I remember studies that showed that being mean with the LLM got better answers, but by the other hand I also remember an study showing that maximizing bug-related parameters ended up with meaner/malignant LLMs.

cpt_sobel

Surely this could depend on the model, and I'm only hypothesizing here, but being mean (or just having a dry tone) might equal a "cut the glazing" implicit instruction to the model, which would help I guess.

raincole

Because some people like to be polite? Is it this hard to understand? Your hand-written prompts are unlikely to take significant chunk of context window anyway.

cpt_sobel

Polite to whom?

qsera

I think it is easier to be polite always and not switch between polite and non-polite mode depending on who you are talking to.

northzen

To machine. It just easier to be polite by default than split our language into two forms "I speak to human" and "I speak to machine". Because the chat interface is really close to what we see when we speak to human. Well, exactly the same.

raincole

To the computer? Many cultures pay respect to mountains, rivers and fields. Many people act differently round churches and monuments. I see nothing wrong being polite to a machine. Especially a machine that makes you money.

jstummbillig

Anything or anyone. Being polite to your surroundings reflects in your surroundings.

trq01758

My view is that when some "for bots only" type of writing becomes a habit, communication with humans will atrophy. Tokens be damned, but this kind of context switch comes at much too high a cost.

vikramkr

For models that reveal reasoning traces I've seen their inner nature as a word calculator show up as they spend way too many tokens complaining about the typo (and AI code review bots also seem obsessed with typos to the point where in a mid harness a few too many irrelevant typos means the model fixates on them and doesn't catch other errors). I don't know if they've gotten better at that recently but why bother. Plus there's probably something to the model trying to match the user's style (it is auto complete with many extra steps) resulting in sloppier output if you give it a sloppier prompt.

movpasd

I prompt politely for two reasons: I suspect it makes the model less likely to spiral (but have no hard evidence either way), and I think it's just good to keep up the habit for when I talk to real people.

stavros

I write "properly" (and I do say "please" and "thank you"), just because I like exercising that muscle. The LLM doesn't care, but I do.

takwatanabe

We build and run a multi-agent system. Today Cursor won. For a log analysis task — Cursor: 5 minutes. Our pipeline: 30 minutes.

Still a case for it: 1. Isolated contexts per role (CS vs. engineering) — agents don't bleed into each other 2. Hard permission boundaries per agent 3. Local models (Qwen) for cheap routine tasks

Multi-agent loses at debugging. But the structure has value.

Daily Digest email

Get the top HN stories in your inbox every day.