My AI Adoption Journey

Daily Digest email

Get the top HN stories in your inbox every day.

libraryofbabel

This is such a lovely balanced thoughtful refreshingly hype-free post to read. 2025 really was the year when things shifted and many first-rate developers (often previously AI skeptics, as Mitchell was) found the tools had actually got good enough that they could incorporate AI agents into their workflows.

It's a shame that AI coding tools have become such a polarizing issue among developers. I understand the reasons, but I wish there had been a smoother path to this future. The early LLMs like GPT-3 could sort of code enough for it to look like there was a lot of potential, and so there was a lot of hype to drum up investment and a lot of promises made that weren't really viable with the tech as it was then. This created a large number of AI skeptics (of whom I was one, for a while) and a whole bunch of cynicism and suspicion and resistance amongst a large swathe of developers. But could it have been different? It seems a lot of transformative new tech is fated to evolve this way. Early aircraft were extremely unreliable and dangerous and not yet worthy of the promises being made about them, but eventually with enough evolution and lessons learned we got the Douglas DC-3, and then in the end the 747.

If you're a developer who still doesn't believe that AI tools are useful, I would recommend you go read Mitchell's post, and give Claude Code a trial run like he did. Try and forget about the annoying hype and the vibe-coding influencers and the noise and just treat it like any new tool you might put through its paces. There are many important conversations about AI to be had, it has plenty of downsides, but a proper discussion begins with close engagement with the tools.

keyle

Architects went from drawing everything on paper, to using CAD products over a generation. That's a lot of years! They're still called architects.

Our tooling just had a refresh in less than 3 years and it leaves heads spinning. People are confused, fighting for or against it. Torn even between 2025 to 2026. I know I was.

People need a way to describe it from 'agentic coding' to 'vibe coding' to 'modern AI assisted stack'.

We don't call architects 'vibe architects' even though they copy-paste 4/5th of your next house and use a library of things in their work!

We don't call builders 'vibe builders' for using earth-moving machines instead of a shovel...

When was the last time you reviewed the machine code produced by a compiler? ...

The real issue this industry is facing, is the phenomenal speed of change. But what are we really doing? That's right, programming.

atomicnumber3

"When was the last time you reviewed the machine code produced by a compiler?"

Compilers will produce working output given working input literally 100% of my time in my career. I've never personally found a compiler bug.

Meanwhile AI can't be trusted to give me a recipe for potato soup. That is to say, I would under no circumstances blindly follow the output of an LLM I asked to make soup. While I have, every day of my life, gladly sent all of the compiler output to the CPU without ever checking it.

The compiler metaphor is simply incorrect and people trying to say LLMs compile English into code insult compiler devs and English speakers alike.

LiamPowell

> Compilers will produce working output given working input literally 100% of my time in my career.

In my experience this isn't true. People just assume their code is wrong and mess with it until they inadvertently do something that works around the bug. I've personally reported 17 bugs in GCC over the last 2 years and there are currently 1241 open wrong-code bugs.

Here's an example of a simple to understand bug (not mine) in the C frontend that has existed since GCC 4.7: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105180

rootnod3

Absolutely this. I am tired of that trope.

Or the argument that "well, at some point we can come up with a prompt language that does exactly what you want and you just give it a detailed spec." A detailed spec is called code. It's the most round-about way to make a programming language that even then is still not deterministic at best.

andai

This is obviously besides the point but I did blindly follow a wiener schnitzel recipe ChatGPT made me and cooked for a whole crew. It turned out great. I think I got lucky though, the next day I absolutely massacred the pancakes.

bostik

Everything more complex than a hello-world has bugs. Compiler bugs are uncommon, but not that uncommon. (I must have debugged a few ICEs in my career, but luckily have had more skilled people to rely on when code generation itself was wrong.)

Compilers aren't even that bad. The stack goes much deeper and during your career you may be (un)lucky enough to find yourself far below compilers: https://bostik.iki.fi/aivoituksia/random/developer-debugging...

NB. I've been to vfs/fs depths. A coworker relied on an oscilloscope quite frequently.

pcl

”I've never personally found a compiler bug.”

I remember the time I spent hours debugging a feature that worked on Solaris and Windows but failed to produce the right results on SGI. Turns out the SGI C++ compiler silently ignored the `throw` keyword! Just didn’t emit an opcode at all! Or maybe it wrote a NOP.

All I’m saying is, compilers aren’t perfect.

I agree about determinism though. And I mitigate that concern by prompting AI assistants to write code that solves a problem, instead of just asking for a new and potentially different answer every time I execute the app.

idopmstuff

> Meanwhile AI can't be trusted to give me a recipe for potato soup.

This just isn't true any more. Outside of work, my most common use case for LLMs is probably cooking. I used to frequently second guess them, but no longer - in my experience SOTA models are totally reliable for producing good recipes.

I recognize that at a higher level we're still talking about probabilistic recipe generation vs. deterministic compiler output, but at this point it's nonetheless just inaccurate to act as though LLMs can't be trusted with simple (e.g. potato soup recipe) tasks.

bayindirh

Compilers and processors are deterministic by design. LLMs are non-deterministic by design.

It's not apples vs. oranges. They are literally opposite of each other.

anematode

I'm trying to track down a GCC miscompilation right now ;)

allworms

> We don't call architects 'vibe architects' even though they copy-paste 4/5th of your next house and use a library of things in their work!

> We don't call builders 'vibe builders' for using earth-moving machines instead of a shovel...

> When was the last time you reviewed the machine code produced by a compiler?

Sure, because those are categorically different. You are describing shortcuts of two classes: boilerplate (library of things) and (deterministic/intentional) automation. Vibe coding doesn't use either of those things. The LLM agents involved might use them, but the vibe coder doesn't.

Vibe coding is delegation, which is a completely different class of shortcut or "tool" use. If an architect delegates all their work to interns, directs outcomes based on whims not principals, and doesn't actually know what the interns are delivering, yeah, I think it would be fair to call them a vibe architect.

We didn't have that term before, so we usually just call those people "arrogant pricks" or "terrible bosses". I'm not super familiar but I feel like Steve Jobs was pretty famously that way - thus if he was an engineer, he was a vibe engineer. But don't let this last point detract from the message, which is that you're describing things which are not really even similar to vibe coding.

tjr

Delegation, yes.

I do not see LLM coding as another step up on the ladder of programming abstraction.

If your project is in, say, Python, then by using LLMs, you are not writing software in English; you are having an LLM write software for you in Python.

This is much more like delegation of work to someone else, than it is another layer in the machine-code/assembly/C/Python sort of hierarchy.

In my regular day job, I am a project manager. I find LLM coding to be effectively project management. As a project manager, I am free to dive down to whatever level of technical detail I want, but by and large, it is others on the team who actually write the software. If I assign a task, I don't say "I wrote that code", because I didn't; someone else did, even if I directed it.

And then, project management, delegating to the team, is most certainly nondeterministic behavior. Any programmer on the team might come up with a different solution, each of which works. The same programmer might come up with more than one solutions, all of which work.

I don't expect the programmers to be deterministic. I do expect the compiler to be deterministic.

djhn

I think you are right in placing emphasis on delegation.

There’s been a hypothesis floating around that I find appealing. Seemingly you can identify two distinct groups of experienced engineers. Manager, delegator, or team lead style senior engineers are broadly pro-AI. The craftsman, wizard, artist, IC style senior engineers are broadly anti-AI.

But coming back to architects, or most professional services and academia to be honest, I do think the term vibe architect as you define it is exactly how the industry works. An underclass of underpaid interns and juniors do the work, hoping to climb higher and position themselves towards the top of the ponzi-like pyramid scheme.

tehnub

Totally on point, except I'm pretty sure Jobs was not like that. From what I've read he'd be more of a hands on "agentic engineer". Baby-sitting his engineers and designers and steering them.

barrenko

Architects still need to learn to draw manually quite well to pass exams and stuff.

dns_snek

> We don't call architects 'vibe architects' even though they copy-paste 4/5th of your next house and use a library of things in their work!

Architect's copy-pasting is equivalent to a software developer reusing a tried and tested code library. Generating or writing new code is fundamentally different and not at all comparable.

> We don't call builders 'vibe builders' for using earth-moving machines instead of a shovel...

We would call them "vibe builders" if their machines threw bricks around randomly and the builders focused all of their time on engineering complex scaffolding around the machines to get the bricks flying roughly in the right direction.

But we don't because their machines, like our compilers and linters, do one job and they do it predictably. Most trades spend obscene amounts of money on tools that produce repeatable results.

> That's a lot of years! They're still called architects.

Because they still architect, they don't subcontract their core duties to architecture students overseas and just sign their name under it.

I find it fitting and amusing that people who are uncritical towards the quality of LLM-generated work seem to make the same sorts of reasoning errors that LLMs do. Something about blind spots?

Applejinx

Very likely, yes. One day we'll have a clearer understanding of how minds generalize concepts into well-trodden paths even when they're erroneous, and it'll probably shed a lot of light onto concepts like addiction.

AlotOfReading

Don't take this as criticizing LLMs as a whole, but architects also don't call themselves engineers. Engineers are an entirely distinct set of roles that among other things validate the plan in its totality, not only the "new" 1/5th. Our job spans both of these.

"Architect" is actually a whole career progression of people with different responsibilities. The bottom rung used to be the draftsmen, people usually without formal education who did the actual drawing. Then you had the juniors, mid-levels, seniors, principals, and partners who each oversaw different aspects. The architects with their name on the building were already issuing high level guidance before the transition instead of doing their own drawings.

    When was the last time you reviewed the machine code produced by a compiler?

Last week, to sanity check some code written by an LLM.

throwup238

> Engineers are an entirely distinct set of roles that among other things validate the plan in its totality, not only the "new" 1/5th. Our job spans both of these.

Where this analogy breaks down is that the work you’re describing is done by Professional Engineers that have strict licensing and are (criminally) liable for the end result of the plans they approve.

That is an entirely different role from the army of civil, mechanical, and electrical engineers (some who are PEs and some who are not) who do most of the work for the principal engineer/designated engineer/engineer of record, that have to trust building codes and tools like FEA/FEM that then get final approval from the most senior PE. I don’t think the analogy works, as software engineers rarely report to that kind of hierarchy. Architects of Record on construction projects are usually licensed with their own licensing organization too, with layers of licensed and unlicensed people working for them.

rhubarbtree

Reasoning by analogy is usually a bad idea, and nowhere is this worse than talking about software development.

It’s just not analogous to architecture, or cooking, or engineering. Software development is just its own thing. So you can’t use analogy to get yourself anywhere with a hint of rigour.

The problem is, AI is generating code that may be buggy, insecure, and unmaintainable. We have as a community spent decades trying to avoid producing that kind of code. And now we are being told that productivity gains mean we should abandon those goals and accept poor quality, as evidenced by MoltBook’s security problems.

It’s a weird cognitive dissonance and it’s still not clear how this gets resolved.

Applejinx

Now then, Moltbook is a pathological case. Either it remains a pathological case or our whole technological world is gonna stumble HARD as all the fundamental things collapse.

I prefer to think Moltbook is a pathological case and unrepresentative, but I've also been rethinking a sort of game idea from computer-based to entirely paper/card based (tariffs be damned) specifically for this reason. I wish to make things that people will have even in the event that all these nice blinky screens are ruined and go dark.

samiv

It's not about the tooling it's about the reasoning. An architect copy pasting existing blueprints is still in charge and has to decide what the copy paste and where. Same as programmer slapping a bunch of code together, plumbing libraries or writing fresh code. They are the ones who drive the logical reasoning and the building process.

The ai tooling reverses this where the thinking is outsourced to the machine and the user is borderline nothing more than a spectator, an observer and a rubber stamp on top.

Anyone who is in this position seriously need to think their value added. How do they plan to justify their position and salary to the capital class. If the machine is doing the work for you, why would anyone pay you as much as they do when they can just replace you with someone cheaper, ideally with no-one for maximum profit.

Everyone is now in a competition not only against each other but also against the machine. And any specialized. Expert knowledge moat that you've built over decades of hard work is about to evaporate.

This is the real pressing issue.

And the only way you can justify your value added, your position, your salary is to be able to undermine the AI, find flaws in it's output and reasoning. After all if/when it becomes flawless you have no purpose to the capital class!

radarsat1

> The ai tooling reverses this where the thinking is outsourced to the machine and the user is borderline nothing more than a spectator, an observer and a rubber stamp on top.

I find it a bit rare that this is the case though. Usually I have to carefully review what it's doing and guide it. Either by specific suggestions, or by specific tests, etc. I treat it as a "code writer" that doesn't necessarily understand the big picture. So I expect it to fuck up, and correcting it feels far less frustrating if you consider it a tool you are driving rather than letting it drive you. It's great when it gets things right but even then it's you that is confirming this.

moregrist

> When was the last time you reviewed the machine code produced by a compiler? ...

Any time I’m doing serious optimization or knee-deep in debugging something where the bug emerged at -O2 but not at -O0.

Sometimes just for fun to see what the compiler is doing in its optimization passes.

You severely limit what you can do and what you can learn if you never peek underneath.

borroka

Architects went from drawing everything on paper to using CAD, not over a generation, but over a few years, after CAD and computers got good enough.

It therefore depends on where we place the discovery/availability of the product. If we place it at the time of prototype production (in the early 1960s for CAD), it took a generation (20-30 years), since by the early and mid-1990s, all professionals were already using CAD.

But if we place it at the time when CAD and personal computers became available to the general public (e.g., mid-1980s), it took no more than 5-10 years. I attended a technical school in the 1990s, and we started with hand drawing in the first two years and used CAD systems in the remaining three years of school.

The same can be said for AI. If we place the beginning of AI in the mid-1980s, the wider adoption of AI took more than a generation. If we place it at the time OpenAI developed GPT, it took 5-10 years.

datsci_est_2015

I skimmed over it, and didn’t find any discussion of:

  - Pull requests
  - Merge requests
  - Code review

I feel like I’m taking crazy pills. Are SWE supposed to move away from code review, one of the core activities for the profession? Code review is as fundamental for SWE as double entry is for accounting.

Yes, we know that functional code can get generated at incredible speeds. Yes, we know that apps and what not can be bootstrapped from nothing by “agentic coding”.

We need to read this code, right? How can I deliver code to my company without security and reliability guarantees that, at their core, come from me knowing what I’m delivering line-by-line?

bthornbury

Either really comprehensive tests (that you read) or read it. Usually i find you can skim most of it, but like in core sections like billing or something you gotta really review it. The models still make mistakes.

mattmanser

You can't skim over AI code.

For even mid-level tasks it will make bad assumptions, like sorting orders or timezone conversions.

Basic stuff really.

You've probably got a load of ticking time bomb bugs if you've just been skimming it.

QuiEgo

You read it. You now have an infinite army of overconfident slightly drunken new college grads to throw at any problem.

Some times you’re gonna want to slowly back away from them and write things yourself. Sometimes you can farm out work to them.

Code review their work as you would any one else’s, in fact more so.

My rule of thumb has been it takes a senior engineer per every 4 new grads to mentor them and code review their work. Or put another way bringing on a new grad gets you +1 output at the cost of -0.25 a senior.

Also, there are some tasks you just can’t give new college grads.

Same dynamic seems to be shaping up here. Except the AI juniors are cheap and work 24*7 and (currently) have no hope of growing into seniors.

kaibee

> Same dynamic seems to be shaping up here. Except the AI juniors are cheap and work 24*7 and (currently) have no hope of growing into seniors.

Each individual trained model... sure. But otoh you can look at it as a very wide junior with "infinite (only limited by your budget)" willpower. Sure, three years ago they were GPT-3.5, basically useless. And now they're Opus 4.6. I wonder what the next few years will bring.

saghm

I've only recently started trying out using LLMs to help me write code (as in, within the last two weeks), and the workflow that makes the most sense to me is to not let the LLM anywhere close to PRs/MRs/CRs, or even version control at all. I've found it useful to give it a fairly constrained task (something that might be a 100-200 line modification of my current code), literally watch the output of Claude Code's "thinking" as it goes to potentially interrupt it if it's going down the wrong path or if it gives me a better idea, wait for it to present the code, and then read through all of it to make sure it's what I want. After making whatever small changes I might want, I commit, and then move onto the next thing. So far, this has pretty much all been for personal side projects outside of work, so there is no code review, but approaching it from the standpoint that the goal is produce the same code and version control history I would want if I created it by hand and just using the LLM as way of automating the typing, I've been pretty surprised that it's already been a net gain in efficiency for a lot of things I've been working on. Ideally, the code I'm generating shouldn't be distinguishable from what I'm already writing, because I would change it if I saw that it was. At that point, either it's high-quality enough to be merged, or it's not and should be rejected, and that's already how things work in the first place. If someone makes an MR that their coworkers find sloppy and annoying to review, there needs to be pushback, and how it was generated should be irrelevant if everyone is on the same page about where the bar for quality is and is acting in good faith. (If you're working in an environment where there's no bandwidth to care about quality or people are acting in bad faith, LLM code will probably not be much of an improvement, but you're also probably going to have a bad time regardless, and unfortunately I don't think there's a magic bullet for fixing that).

AloysB

Give it a read, he mentions briefly how he uses for PR triages and resolving GH issues.

He doesn't go in details, but there is a bit:

> Issue and PR triage/review. Agents are good at using gh (GitHub CLI), so I manually scripted a quick way to spin up a bunch in parallel to triage issues. I would NOT allow agents to respond, I just wanted reports the next day to try to guide me towards high value or low effort tasks.

> More specifically, I would start each day by taking the results of my prior night's triage agents, filter them manually to find the issues that an agent will almost certainly solve well, and then keep them going in the background (one at a time, not in parallel).

This is a short excerpt, this article is worth reading. Very grounded and balanced.

datsci_est_2015

Okay I think this somewhat answers my question. Is this individual a solo developer? “Triaging GitHub issues” sounds a bit like open source solo developer.

Guess I’m just desperate for an article about how organizations are actually speeding up development using agentic AI. Like very practical articles about how existing development processes have been adjusted to facilitate agentic AI.

I remain unconvinced that agentic AI scales beyond solo development, where the individual is liable for the output of the agents. More precisely, I can use agentic AI to write my code, but at the end of the day when I submit it to my org it’s my responsibility to understand it, and guarantee (according to my personal expertise) its security and reliability.

Conversely, I would fire (read: reprimand) someone so fast if I found out they submitted code that created a vulnerability that they would have reasonably caught if they weren’t being reckless with code submission speed, LLM or not.

AI will not revolutionize SWE until it revolutionizes our processes. It will definitely speed us up (I have definitely become faster), but faster != revolution.

Quarrelsome

we're talking about _this_ post? He specifically said he only runs one agent, so sure he probably reviews the code or as he stated finds means of auto-verifying what the agent does (giving the agent a way to self-verify as part of its loop).

eikenberry

The primary point behind code reviews is to let author to know that someone else will look at their code. They are a psychological tool and that, AFAIK, don't work well with the AI models. If the code is important enough that you want to review it then you should probably be using a different, more interactive flow.

Mitchell talks about this in a round about way... in the "Reproduce your own work" section he obviously reviewed that code as that was the point. In the "End-of-day agents" section he talks about what he found them good for (so far). He previously wrote about how he preferred an interactive style and this article aligns with that with his progress understanding how code agents can be useful.

tptacek

So read the code.

datsci_est_2015

Cool, code review continues to be one of the biggest bottlenecks in our org, with or without agentic AI pumping out 1k LOC per hour.

IhateAI

[flagged]

codyb

I think this is the crux of why, when used as an enhancement to solo productivity, you'll have a pretty strict upper bound on productivity gains given that it takes experienced engineers to review code that goes out at scale.

That being said, software quality seems to be decreasing, or maybe it's just cause I use a lot of software in a somewhat locked down state with adblockers and the rest.

Although, that wouldn't explain just how badly they've murdered the once lovely iTunes (now Apple Music) user interface. (And why does CMD-C not pick up anything 15% of the time I use it lately...)

Anyways, digressions aside... the complexity in software development is generally in the organizational side. You have actual users, and then you have people who talk to those users and try to see what they like and don't like in order to distill that into product requirements which then have to be architected, and coordinated (both huge time sinks) across several teams.

Even if you cut out 100% of the development time, you'd still be left with 80% of the timeline.

Over time though... you'll probably see people doing what I do all day (which is move around among many repositories (although I've yet to use the AI much, got my Cursor license recently and am gonna spin up some POCs that I want to see soon)), enabled by their use of AI to quickly grasp what's happening in the repo, and the appropriate places to make changes.

Enabling developers to complete features from tip to tail across deep, many pronged service architectures would could bring project time down drastically and bring project management, and cross team coordination costs down tremendously.

Similarly, in big companies, the hand is often barely aware at best of the foot. And space exploration is a serious challenge. Often folk know exactly one step away, and rely on well established async communication channels which also only know one step further. Principal engineers seem to know large amounts about finite spaces and are often in the dark small hops away to things like the internal tooling for the systems they're maintaining (and often not particularly great at coming in to new spaces and thinking with the same perspective... no we don't need individual micro services for every 12 request a month admin api group we want to set up).

Once systems can take a feature proposal and lay out concrete plans which each little kingdom can give a thumbs up or thumbs down to for further modifications, you can again reduce exploration, coordination, and architecture time down.

Sadly, seems like User Experience design is an often terribly neglected part of our profession. I love the memes about an engineer building the perfect interface like a water pitcher only for the person to position it weirdly in order to get a pour out of the fill hole or something. Lemme guess how many users you actually talked to (often zero), and how many layers of distillation occurred before you received a micro picture feature request that ends up being build and taking input from engineers with no macro understanding of a user's actual needs, or day to day.

And who often are much more interested in perfecting some little algorithm thank thinking about enabling others.

So my money is on money flowing to... - People who can actually verify system integrity, and can fight fires and bugs (but a lot of bug fixing will eventually becoming prompting?) - Multi-talented individuals who can say... interact with users well enough to understand their needs as well as do a decent job verifying system architecture and security

It's outside of coding where I haven't seen much... I guess people use it to more quickly scaffold up expense reports, or generate mocks. So, lots of white collar stuff. But... it's not like the experience of shopping at the supermarket has changed, or going to the movies, or much of anything else.

svilen_dobrev

let me ask a stupid/still-ignorant question - about repeatability.

If one asks this generator/assistant same request/thing, within same initial contexts, 10 times, would it generate same result ? in different sessions and all that.

because.. if not, then it's for once-off things only..

Robin_Message

If I asked you for the same thing 10 times, wiping your memory each time, would you generate the same result?

And why does it matter anyway? I'd the code passes the tests and you like the look of it, it's good. It doesn't need to be existentially complicated.

lins1909

A pretty bad comparison. If I gave you the correct answer once, it's unlikely that I'll give you a wrong answer the next time. Also, aren't computers supposed to be more reliable than us? If I'm going to use a tool that behaves just like humans, why not just use my brain instead?

undefined

[deleted]

beoberha

Your sentiment resonates with me a lot. I wonder what we’ll consider the inflection point 10 years from now. It seemed like the zeitgeist was screaming about scaling limits and running out of training data, then we got Claude code, sonnet 4.5, then Opus 4.5 and no ones looked back since.

libraryofbabel

I wonder too. It might be that progress on the underlying models is going to plateau, or it might be that we haven't yet reached what in retrospect will be the biggest inflection point. Technological developments can seem to make sense in hindsight as a story of continuous progress when the dust has settled and we can write and tell the history, but when you go back and look at the full range of voices in the historical sources you realize just how deeply nothing was clear to anyone at all at the time it was happening because everyone was hurtling into the unknown future with a fog of war in front of them. In 1910 I'd say it would have been perfectly reasonable to predict airplanes would remain a terrifying curiosity reserved for daredevils only (and people did); or conversely, in the 1960s a lot of commentators thought that the future of passenger air travel in the 70s and 80s would be supersonic jets. I keep this in mind and don't really pay too much attention to over-confident predictions about the technological future.

tmtvl

I will give Claude Code a trial run if I can run it locally without an internet connection. AI companies have procured so much training data through illegal means you have to be insane to trust them in even the smallest amount.

wiether

You can run OpenCode in a container restricted to local network only and communicating with local/self-hosted models.

Claude Code is linked to Anthropic's hosted models so you can't achieve this.

fullstackchris

this is such a strawman argument. what are they going to take from you? your triple forloop? they literally own the weights for a neural net that scores 77% on SWE. they dont need, nor care, about your code

alternatex

They are trained on our code. Perhaps not if you don't have any of it open sourced, but it's so jarring to see someone say they don't care about our code.

zamadatix

Should AI tools use memory safe tabs or spaces for indentation? :)

It is a shame it's become such a polarized topic. Things which actually work fine get immediately bashed by large crowds at the same time things that are really not there get voted to the moon by extremely eager folks. A few years from now I expect I'll be thinking "man, there was some really good stuff I missed out on because the discussions about it were so polarized at the time. I'm glad that has cleared up significantly!"

majormajor

GPT-4 showed the potential but the automated workflows (context management, loops, test-running) and pure execution speed to handle all that "reasoning"/workflows (remember watching characters pop in slowly in GPT-4 streaming API response calls) are gamechangers.

The workflow automation and better (and model-directed) context management are all obvious in retrospect but a lot of people (like myself) were instead focused on IDE integration and such vs `grep` and the like. Maybe multi-agent with task boards is the next thing, but it feels like that might also start to outrun the ability to sensibly design and test new features for non-greenfield/non-port projects. Who knows yet.

I think it's still very valuable for someone to dig in to the underlying models periodically (insomuch as the APIs even expose the same level of raw stuff anymore) to get a feeling for what's reliable to one-shot vs what's easily correctable by a "ran the tests, saw it was wrong, fixed it" loop. If you don't have a good sense of that, it's easy to get overambitious and end up with something you don't like if you're the sort of person who cares at all about what the code looks like.

a456463

It is perfectly valid that this issue is polarizing. on the one hand we have blind cargo culters and on the other hand we have "luddites". Being in one or the other "tribe" is cause for getting insulted or called out. Because the cargo culters want everyone to do what they are doing, just like the RTO crowd. The skeptics want to take a more reasonable pace. One side is we are done this is the future and the other side doesn't see the same results happening to them but the cargo culters think in absolutes and 100% only. It is all or nothing. All these other posts waxing and waning and insulting the skeptics are frankly insulting

mjr00

> Break down sessions into separate clear, actionable tasks. Don't try to "draw the owl" in one mega session.

This is the key one I think. At one extreme you can tell an agent "write a for loop that iterates over the variable `numbers` and computes the sum" and they'll do this successfully, but the scope is so small there's not much point in using an LLM. On the other extreme you can tell an agent "make me an app that's Facebook for dogs" and it'll make so many assumptions about the architecture, code and product that there's no chance it produces anything useful beyond a cool prototype to show mom and dad.

A lot of successful LLM adoption for code is finding this sweet spot. Overly specific instructions don't make you feel productive, and overly broad instructions you end up redoing too much of the work.

sho_hn

This is actually an aspect of using AI tools I really enjoy: Forming an educated intuition about what the tool is good at, and tastefully framing and scoping the tasks I give it to get better results.

It cognitively feels very similar to other classic programming activities, like modularization at any level from architecture to code units/functions, thoughtfully choosing how to lay out and chunk things. It's always been one of the things that make programming pleasurable for me, and some of that feeling returns when slicing up tasks for agents.

bandrami

"Become better at intuiting the behavior of this non-deterministic black box oracle maintained by a third party" just isn't a strong professional development sell for me, personally. If the future of writing software is chasing what a model trainer has done with no ability to actually change that myself I don't think that's going to be interesting to nearly as many people.

mjr00

It sounds like you're talking more about "vibe coding" i.e. just using LLMs without inspecting the output. That's neither what the article nor the people to whom you're replying are saying. You can (and should) heavily review and edit LLM generated code. You have the full ability to change it yourself, because the code is just there and can be edited!

dcre

I think this is underrating the role of intuition in working effectively with deterministic but very complex software systems like operating systems and compilers. Determinism is a red herring.

chii

Whether it's interesting or not is irrelevant to whether it produces usable output that could be economically valuable.

allenu

I agree that framing and scoping tasks is becoming a real joy. The great thing about this strategy is there's a point at which you can scope something small enough that it's hard for the AI to get it wrong and it's easy enough for you as a human to comprehend what it's done and verify that it's correct.

I'm starting to think of projects now as a tree structure where the overall architecture of the system is the main trunk and from there you have the sub-modules, and eventually you get to implementations of functions and classes. The goal of the human in working with the coding agent is to have full editorial control of the main trunk and main sub-modules and delegate as much of the smaller branches as possible.

Sometimes you're still working out the higher-level architecture, too, and you can use the agent to prototype the smaller bits and pieces which will inform the decisions you make about how the higher-level stuff should operate.

audience_mem

[Edit: I may have been replying to another comment in my head as now I re-read it and I'm not sure I've said the same thing as you have. Oh well.]

I agree. This is how I see it too. It's more like a shortcut to an end result that's very similar (or much better) than I would've reached through typing it myself.

The other day I did realise that I'm using my experience to steer it away from bad decisions a lot more than I noticed. It feels like it does all the real work, but I have to remember it's my/our (decades of) experience writing code playing a part also.

I'm genuinely confused when people come in at this point and say that it's impossible to do this and produce good output and end results.

meowface

I feel the same, but, also, within like three years this might look very different. Maybe you'll give the full end-to-end goal upfront and it just polls you when it needs clarification or wants to suggest alternatives, and it self-manages cleanly self-delegating.

Or maybe something quite different but where these early era agentic tooling strategies still become either unneeded or even actively detrimental.

zxor

> it just polls you when it needs clarification

I think anyone who has worked on a serious software project would say, this means it would be polling you constantly.

Even if we posit that an LLM is equivalent to a human, humans constantly clarify requirements/architecture. IMO on both of those fronts the correct path often reveals itself over time, rather than being knowable from the start.

So in this scenario it seems like you'd be dealing with constant pings and need to really make sure you're understanding of the project is growing with the LLM's development efforts as well.

To me this seems like the best-case of the current technology, the models have been getting better and better at doing what you tell it in small chunks but you still need to be deciding what it should be doing. These chunks don't feel as though they're getting bigger unless you're willing to accept slop.

mapontosevenths

> Break down sessions into separate clear, actionable tasks.

What this misses, of course, is that you can just have the agent do this too. Agent's are great at making project plans, especially if you give them a template to follow.

Vinnl

It sounds to me like the goal there is to spell out everything you don't want the agent to make assumptions about. If you let the agent make the plan, it'll still make those assumptions for you.

swordsith

If you've got a plan for the plan, what else could you possibly need!

mlrtime

You joke, but the more I iterate on a plan before any code, the more successful the first pass is.

1) Tell claude my idea with as much as I know, ask it to ask me questions. This could go on for a few rounds. (Opus)

2) Run a validate skill on the plan, reviewer with a different prompt (Opus)

3) codex reviews the plan, always finds a few small items after the above 2.

4) claude opus implements in 1 shot, usually 99% accurate, then I manually test.

If I stay on target with those steps I always have good outcomes, but it is time consuming.

apercu

I actually enjoy writing specifications. So much so that I made it a large part of my consulting work for a huge part of my career. SO it makes sense that working with Gen-AI that way is enjoyable for me.

The more detailed I am in breaking down chunks, the easier it is for me to verify and the more likely I am going to get output that isn't 30% wrong.

iamacyborg

> On the other extreme you can tell an agent "make me an app that's Facebook for dogs" and it'll make so many assumptions about the architecture, code and product that there's no chance it produces anything useful beyond a cool prototype to show mom and dad.

Amusingly, this was my experience in giving Lovable a shot. The onboarding process was literally just setting me up for failure by asking me to describe the detailed app I was attempting to build.

Taking it piece by piece in Claude Code has been significantly more successful.

oulipo2

Exactly. The LLMs are quite good at "code inpainting", eg "give me the outline/constraints/rules and I'll fill-in the blanks"

But not so good at making (robust) new features out of the blue

jedbrooke

so many times I catch myself asking a coding agent e.g “please print the output” and it will update the file with “print (output)”.

Maybe there’s something about not having to context switch between natural language and code just makes it _feel_ easier sometimes

undefined

[deleted]

kcorbitt

And lately, the sweet spot has been moving upwards every 6-8 weeks with the model release cycle.

EastLondonCoder

This matches my experience, especially "don’t draw the owl" and the harness-engineering idea.

The failure mode I kept hitting wasn’t just "it makes mistakes", it was drift: it can stay locally plausible while slowly walking away from the real constraints of the repo. The output still sounds confident, so you don’t notice until you run into reality (tests, runtime behaviour, perf, ops, UX).

What ended up working for me was treating chat as where I shape the plan (tradeoffs, invariants, failure modes) and treating the agent as something that does narrow, reviewable diffs against that plan. The human job stays very boring: run it, verify it, and decide what’s actually acceptable. That separation is what made it click for me.

Once I got that loop stable, it stopped being a toy and started being a lever. I’ve shipped real features this way across a few projects (a git like tool for heavy media projects, a ticketing/payment flow with real users, a local-first genealogy tool, and a small CMS/publishing pipeline). The common thread is the same: small diffs, fast verification, and continuously tightening the harness so the agent can’t drift unnoticed.

protocolture

>The failure mode I kept hitting wasn’t just "it makes mistakes", it was drift: it can stay locally plausible while slowly walking away from the real constraints of the repo. The output still sounds confident, so you don’t notice until you run into reality (tests, runtime behaviour, perf, ops, UX).

Yeah I would get patterns where, initial prototypes were promising, then we developed something that was 90% close to design goals, and then as we try to push in the last 10%, drift would start breaking down, or even just forgetting, the 90%.

So I would start getting to 90% and basically starting a new project with that as the baseline to add to.

ricardobeat

No harm meant, but your writing is very reminiscent of an LLM. It is great actually, there is just something about it - "it wasn't.. it was", "it stopped being.. and started". Claude and ChatGPT seem to love these juxtapositions. The triplets on every other sentence. I think you are a couple em-dashes away from being accused of being a bot.

These patterns seem to be picking up speed in the general population; makes the human race seem quite easily hackable.

pixl97

>makes the human race seem quite easily hackable.

If the human race were not hackable then society would not exist, we'd be the unchanging crocodiles of the last few hundred million years.

Have you ever found yourself speaking a meme? Had a catchy toon repeating in your head? Started spouting nation state level propaganda? Found yourself in crowd trying to burn a witch at the stake?

Hacking the flow of human thought isn't that hard, especially across populations. Hacking any one particular humans thoughts is harder unless you have a lot of information on them.

direwolf20

How do I hack the human population to give me money, and simultaneously, hack law enforcement to not arrest me?

bdangubic

This is the most common answer from people that are rocking and rolling with AI tools but I cannot help but wonder how is this different from how we should have built software all along. I know I have been (after 10+ years…)

EastLondonCoder

I think you are right, the secret is that there is no secret. The projects I have been involved with thats been most successful was using these techniques. I also think experience helps because you develop a sense that very quickly knows if the model wants to go in a wonky direction and how a good spec looks like.

With where the models are right now you still need a human in the loop to make sure you end up with code you (and your organisation) actually understands. The bottle neck has gone from writing code to reading code.

sksisksbbs

> The bottle neck has gone from writing code to reading code.

This has always been the bottleneck. Reviewing code is much harder and gets worse results than writing it, which is why reviewing AI code is not very efficient. The time required to understand code far outstrips the time to type it.

Most devs don’t do thorough reviews. Check the variable names seem ok, make sure there’s no obvious typos, ask for a comment and call it good. For a trusted teammate this is actually ok and why they’re so valuable! For an AI, it’s a slot machine and trusting it is equivalent to letting your coworkers/users do your job so you can personally move faster.

miyuru

This is what I experienced as well.

these are some ticks I use now.

1. Write a generic prompts about the project and software versions and keep it in the folder. (I think this getting pushed as SKIILS.md now)

2. In the prompt add instructions to add comments on changes, since our main job is to validate and fix any issues, it makes it easier.

3. Find the best model for the specific workflow. For example, these days I find that Gemini Pro is good for HTML UI stuff, while Claude Sonnet is good for python code. (This is why subagents are getting popluar)

apitman

Would love to hear more about your geneology app.

senko

For those wondering how that looks in practice, here's one of OP's past blog posts describing a coding session to implement a non-trivial feature: https://mitchellh.com/writing/non-trivial-vibing (covered on HN here: https://news.ycombinator.com/item?id=45549434)

kyoji

This was a great post, one of the best I've seen on this topic at HN.

But why is the cost never discussed or disclosed in these conversations? I feel like I'm going crazy, there is so much written extolling the virtues of these tools but with no mention of what it costs to run them now. It will surely only get more expensive from here!

wtetzner

> But why is the cost never discussed or disclosed in these conversations?

And not just the monetary cost of accessing the tools, but the amount of time it takes to actually get good results out. I strongly suspect that even though it feels more productive, in many cases things just take longer than they would if done manually.

I think there are really good uses for LLMs, but I also think that people are likely using them in ways that feel useful, but end up being more costly than not.

quarkz14

Indeed, most of us are probably limited with what our companies let us use and also not to mention not everyone can afford to use AI tooling in their own time without thinking about the cost assuming you want to build something your company doesn't claim as their own IP.

lysace

The current realistic lower bound for actual work is the $100/€90/month Claude Max ("5x") plan. It allows roughly enough usage for a typical working month (4.25 x 40-50h). "Single-threaded", interactive usage with normal human breaks, sort of.

There are two usage quota windows to be aware of: 5h and 7d. I use https://github.com/richhickson/claudecodeusage (Mac) to keep track of the status. It shows green/yellow/red and a percentage in the menu bar.

mi_lk

is there a guidance on when an API v.s. a subscription is a better deal?

fusslo

the first time I did work as the article suggests I used my monthly allowance in a day.

Apparently out of 3-5k people with access to our AI tools, there's fewer than a handful of us REALLY using it. Most are asking questions in the chatbot style.

Anyway, I had to ask my manager, the AI architect, and the Tooling Manager for approval to increase my quota.

I asked everyone in the chain how much equivalent dollars I am allocated, and how much the increase was and no one could tell me.

mbesto

Honestly, the costs are so minimal and vary wildly relative to the cost of a developer that it's frankly not worth the discussion...yet. The reality is the standard deviation of cost is going to oscillate until there is a common agreed upon way to use these tools.

lysace

Yes, but the lack of clear pricing probably makes people think it's more expensive than it actually is. (It did so to me.)

There is nothing quantifiable here: https://claude.com/pricing

Pro: "Everything in Free, plus: More usage"

Max: "Choose 5x or 20x more usage than Pro"

Wow, 5x or 20x more of "more". That's some masterful communication right there.

bsder

> Honestly, the costs are so minimal and vary wildly relative to the cost of a developer that it's frankly not worth the discussion...yet

Is it? Sure, the chatbot style maxes at $200/month. I consider that ... not unreasonable ... for a professional tool. It doesn't make me happy, but it's not horrific.

The article, however, explicitly pans the chatbot style and is extolling the API style being accessed constantly by agents, and that has no upper bound. Roughly $10-ish per Megatokens. $10-ish per 1K web searches. etc.

This doesn't sound "minimal" to me. This sounds like every single "task" I kick off is $10. And it can kick those tasks and costs off very quickly in an automated fashion. It doesn't take many of those tasks before I'm paying more than an actual full developer.

Ref: https://claude.com/pricing#api

sho_hn

Much more pragmatic and less performative than other posts hitting frontpage. Good article.

alterom

Finally, a step-by-step guide for even the skeptics to try to see what spot the LLM tools have in their workflows, without hype or magic like I vibe-coded an entire OS, and you can too!.

noisy_boy

I still use the chatbot but like to do it outside-in. Provide what I need, and instruct it to not write any code except the api (signatures of classes, interfaces, hierarchy, essential methods etc). We keep iterating about this until it looks good - still no real code. Then I ask it to do a fresh review of the broad outline, any issues it foresees etc. Then I ask it to write some demonstrator test cases to see how ergonomic and testable the code is - we fine tune the apis but nothing is fleshed out yet. Once this is done, we are done with the most time consuming phase.

After that is basically just asking it to flesh out the layers starting from zero dependencies to arriving at the top of the castle. Even if we have any complexities within the pieces or the implementation is not exactly as per my liking, the issues are localised - I can dive in and handle it myself (most of the time, I don't need to).

I feel like this approach works very well for me having a mental model of how things are connected because the most of the time I spent was spent on that model.

scarrilho

With so much noise in the AI world and constant model updates (just today GPT-5.3-Codex and Claude Opus 4.6 were announced), this was a really refreshing read. It’s easy to relate to his phased approach to finding real value in tooling and not just hype. There are solid insights and practical tips here. I’m increasingly convinced that the best way not to get overwhelmed is to set clear expectations for what you want to achieve with AI and tailor how you use it to work for you, rather than trying to chase every new headline. Very refreshing.

keyle

It's amusing how everyone seems to be going through the same journey.

I do run multiple models at once now. On different parts of the code base.

I focus solely on the less boring tasks for myself and outsource all of the slam dunk and then review. Often use another model to validate the previous models work while doing so myself.

I do git reset still quite often but I find more ways to not get to that point by knowing the tools better and better.

Autocompleting our brains! What a crazy time.

i_love_retros

How much does it cost per day to have all these agents running on your computer?

Is your company paying for it or you?

What is your process of the agent writes a piece of code, let's say a really complex recursive function, and you aren't confident you could have come up with the same solution? Do you still submit it?

paracyst

The guy who wrote the post is a billionaire

gh0stcat

I thought this was a joke ie you need to be a billionaire to be able to use agents like this, but you are correct.

I think we need to stop listening to billionaires. The article is well thought out and well written, but his perspective is entirely biased by never having to think about money at all... all of this stuff is incredibly expensive.

i_love_retros

Billionaires also tend to have a vested interest in the tech being hyped and adopted, after all one doesn't become a billionaire without investments.

eikenberry

Is he? Source? I didn't think he made that much from the Hashicorp sale.

i_love_retros

Oh, never heard of him!

sublimefire

Very much the same experience. But it does not talk much about the project setup and the influence of it on the session success. In the narrow scoped projects it works really well, especially when tests are easy to execute. I found that this approach melts down when facing enterprise software with large repositories and unconventional layouts. Then you need to do a bunch of context management upfront, and verbose instructions for evaluations. But we know what it needs is a refactor thats all.

And the post touches on a next type of a problem, how to plan far ahead of time to utilise agents when you are away. It is a difficult problem but IMO we’re going in a direction of having some sort of shared “templated plans”/workflows and budgeted/throttled task execution to achieve that. It is like you want to give a little world to explore so that it does not stop early, like a little game to play, then you come back in the morning and check how far it went.

hollowturtle

I don't understand how Agents make you feel productive. Single/Multiple agents reading specs, specs often produced with agents itself and iterated over time with human in the loop, a lot of reviewing of giant gibberish specs. Never had a clear spec in my life. Then all the dancing for this apperantly new paradigm, of not reviewing code but verifying behaviour, and so many other things. All of this to me is a total UNproductive mess. I use Cursor autocomplete from day one till to this day, I was super productive before LLMs, I'm more productive now, I'm capable, I have experience, product is hard to maintain but customers are happy, management is happy. So I can't really relate anymore to many of the programmers out there, that's sad, I can count on my hands devs that I can talk to that have hard skills and know-how to share instead of astroturfing about AI Agents

wiether

> Never had a clear spec in my life.

To me part of our job has always been about translating garbage/missing specs in something actionnable.

Working with agents don't change this and that's why until PM/business people are able to come up with actual specs, they'll still need their translators.

Furthermore, it's not because the global spec is garbage that you, as a dev, won't come up with clear specs to solve technical issues related to the overall feature asked by stakeholders.

One funny thing I see though, is in the AI presentations done to non-technical people, the advice: "be as thorough as possible when describing what you except the agent to solve!". And I'm like: "yeah, that's what devs have been asking for since forever...".

hollowturtle

With "Never had a clear spec in my life" what I mean is also that I don't how something should come out till I'm actually doing it. Writing code for me lead to discovery, I don't know what to produce till I see it in the wrapping context, like what a function should accept, for example a ref or a copy. Only at that point I have the proper intuition to make a decision that has to be supported long term. I don't want cheap code now I want a solit feature working tomorrow and not touching it for a long a time hopefully

maqnius

In my real life bubble, AI isn't a big deal either, at least for programmers. They tend to be very sceptical about it for many reasons, perceived productivity being only one of them. So, I guess it's much less of a thing than you would expect from media coverage and certain internet communities.

hollowturtle

Are you hiring?

maqnius

Open to applications I would say, but not completely remote. So unless you're Python or c/c++ Dev living in nrw, Germany..

elAhmo

> Never had a clear spec in my life.

Just because you haven't or you work in a particular way, doesn't mean everyone does things the same way.

Likewise, on your last point, just because someone is using AI in their work, doesn't mean they don't have hard skills and know-how. Author of this article Mitchell is a great example of that - someone who proved to be able to produce great software and, when talking about individuals who made a dent in the industry, definitely had/has an impactful career.

hollowturtle

Never mentioned Mitchell I'm generally speaking, 95% of industry is not Mitchell

elAhmo

Well, you are commenting on a post he wrote.

Daily Digest email

Get the top HN stories in your inbox every day.