LLMs work best when the user defines their acceptance criteria first

blog.katanaquant.com

Daily Digest email

Get the top HN stories in your inbox every day.

pornel

Their default solution is to keep digging. It has a compounding effect of generating more and more code.

If they implement something with a not-so-great approach, they'll keep adding workarounds or redundant code every time they run into limitations later.

If you tell them the code is slow, they'll try to add optimized fast paths (more code), specialized routines (more code), custom data structures (even more code). And then add fractally more code to patch up all the problems that code has created.

If you complain it's buggy, you can have 10 bespoke tests for every bug. Plus a new mocking framework created every time the last one turns out to be unfit for purpose.

If you ask to unify the duplication, it'll say "No problem, here's a brand new metamock abstract adapter framework that has a superset of all feature sets, plus two new metamock drivers for the older and the newer code! Let me know if you want me to write tests for the new adapters."

unlikelytomato

This is why I'm confused when people say it isn't ready to replace most of the programmer workforce.

sonofhans

I love that you’re getting straightforward replies to this absolutely sick burn. The blade is so sharp that some people aren’t even feeling it.

Foobar8568

LLM code is higher quality than any codes I have seen in my 20 years in F500. So yeah you need to "guide" it, and ensure that it will not bypass all the security guidance for ex...But at least you are in control, although the cognitive load is much higher as well than just "blind trust of what is delivered".

But I can see the carnage with offshoring+LLM, or "most employees", including so call software engineer + LLM.

_0ffh

Huh, that explains a lot about the F500, and their buzzword slogans like "culture of excellence".

LLM code is still mostly absurdly bad, unless you tell it in painstaking detail what to do and what to avoid, and never ask it to do a bigger job at a time than a single function or very small class.

Edit: I'll admit though that the detailed explanation is often still much less work than typing everything yourself. But it is a showstopper for autonomous "agentic coding".

112233

Uhuh. Let me present you Rudolph. For the next 15 minutes, he will paste pieces of top rated SO answers and top starred GH repos. Then he will suffer complete amnesia. He might not understand your question or remember what he just did, but the code he pastes is higher quality than any codes you have seen in your 20 years in F500! For 20$ a month, he's all yours, he just needs a 4 hour break every 5 hours. But he runs on money, like gumball machine, so you can wake him with a donation. Oh, you are responsible for giving him precise instructions, that he often ignores in favour of other instructions from uncle Sam. No, you can't see them.

thesz

  > LLM code is higher quality than any codes I have seen in my 20 years in F500.

"Any codes"?

m3kw9

Offshoring pretty much guarantees a couple vibe coders will be there to operate

mettamage

Giving it prompts of the Shannon project helps for security

queenkjuul

You've worked at some shitty places. Nothing I've seen from Claude matches even my worst coworker (and my last job was an F500)

empath75

If you a) know what you are doing and b) know what an llm is capable of doing, c) can manage multiple llm agents at a time, you can be unbelievably productive. Those skills I think are less common than people assume.

You need to be technical, have good communication skills, have big picture vision, be organized, etc. If you are a staff level engineer, you basically feel like you don’t need anyone else.

OTOH i have been seeing even fairly technical engineering managers struggle because they can’t get the LLMs to execute because they don’t know how to ask it what to do.

jurgenburgen

> can manage multiple llm agents at a time

How is that supposed to work? Humans are notoriously poor at multi-tasking. If you spend all day context switching between agents you’re going to have a bad time.

awinter-py

it's like that '11 rules for showrunning' doc where you need to operate at a level where you understand the product being made, and the people making it, and their capabilities, in order to make things come out well without touching them directly.

(https://okbjgm.weebly.com/uploads/3/1/5/0/31506003/11_laws_o...)

if you can do every job + parallelize + read fast, and you are only limited by the time it takes to type, claude is remarkable. I'm not superhuman in those ways but in the small domains where I am it has helped a lot; in other domains it has ramped me to 'working prototype' 10x faster than I could have alone, but the quality of output seems questionable and I'm not smart enough to improve it

danparsonson

Yeah that describes most legacy codebases I've worked on XD

lwansbrough

For me, I'll do the engineering work of designing a system, then give it the specific designs and constraints. I'll let it plan out the implementation, then I give it notes if it varies in ways I didn't expect. Once we agree on a solution, that's when I set it free. The frontier models usually do a pretty good job with this work flow at this point.

m3kw9

That’s vibe coding and you won’t read more than 20% of the code written that way. You really can’t build complex software that way

iLoveOncall

Really? Because this perfectly explains why it will never replace them: it needs an exact language listing everything required to function as you expect it.

You need code to get it to generate proper code.

abm53

I think GP was a joke about the ability of a typical programmer.

I certainly read it as one and found it funny.

YesBox

Heh, people like to have someone else to blame.

jaredklewis

This comment is why I read HN

stingraycharles

> If you ask to unify the duplication, it'll say "No problem, here's a brand new metamock abstract adapter framework that has a superset of all feature sets, plus two new metamock drivers for the older and the newer code! Let me know if you want me to write tests for the new adapters."

Nevermind the fact that it only migrated 3 out of 5 duplicated sections, and hasn’t deleted any now-dead code.

Mavvie

Sounds like my coworkers.

GeoAtreides

people also piss in rivers, yet dumping raw sewage by million m^3 in the same rivers is generally (less so in uk) frowned upon...

lelanthran

Maybe, but I'd bet a large sum of money that each of your coworkers aren't turning out this drivel at a rate of 3kLoC per hour.

Can you imagine working with someone who produces 100k lines of unmaintainable code in a single sprint?

This is your future.

Foobar8568

That's the reality nobody really wants to say.

0x008

The problem is that you are looking at the code. /s

marginalia_nu

My sense is that the code generation is fast, but then you always need to spend several hours making sure the implementation is appropriate, correct, well tested, based on correct assumptions, and doesn't introduce technical debt.

You need to do this when coding manually as well, but the speed at which AI tools can output bad code means it's so much more important.

ehnto

Well when you write it manually you are doing the review and sanity checking in real time. For some tasks, not all but definitely difficult tasks, the sanity checking is actually the whole task. The code was never the hard part, so I am much more interested in the evolving of AIs real world problem solving skills over code problems.

I think programming is giving people a false impression on how intelligent the models are, programmers are meant to be smart right so being able to code means the AI must be super smart. But programmers also put a huge amount of their output online for free, unlike most disciplines, and it's all text based. When it comes to problem solving I still see them regularly confused by simple stuff, having to reset context to try and straighten it out. It's not a general purpose human replacement just yet.

LPisGood

And it’s slower to review because you didn’t do the hard part of understanding the code as it was being written.

Implicated

You're holding it wrong.

Set the boundaries and guidelines before it starts working. Don't leave it space to do things you don't understand.

ie: enforce conventions, set specific and measurable/verifiable goals, define skeletons of the resulting solutions if you want/can.

To give an example. I do a lot of image similarity stuff and I wanted to test the Redis VectorSet stuff when it was still in beta and the PHP extension for redis (the fastest one, which is written in C and is a proper language extension not a runtime lib) didn't support the new commands. I cloned the repo, fired up claude code and pointed it to a local copy of the Redis VectorSet documentation I put in the directory root telling it I wanted it to update the extension to provide support for the new commands I would want/need to handle VectorSets. This was, idk, maybe a year ago. So not even Opus. It nailed it. But I chickened out about pushing that into a production environment, so I then told it to just write me a PHP run time client that mirrors the functionality of Predis (pure-php implementation of redis client) but does so via shell commands executed by php (lmao, I know).

Define the boundaries, give it guard rails, use design patterns and examples (where possible) that can be used as reference.

xeromal

The same as asking one of your JRs to do something except now it follows instructions a little bit better. Coding has never been about line generation and now you can POC something in a few hours instead of a few days / weeks to see if an idea is dumb.

theshrike79

"Several hours"? How big are your change sets?

If a human dropped a PR on me that took "several hours" to go through (10k+ lines or non-trivial changes), I'd jump in my car and drive to the office just to specifically slap them on the back of the head ffs.

marginalia_nu

This was like 1K LOC? It's not the review that was slow, but the wrestling with the model to get the code to not suck.

vannevar

I'd highly recommend working top down, getting it to outline a sane architecture before it starts coding. Then if one of the modules starts getting fouled up, start with a clean sheet context (for that module) incorporating any cautions or lessons learned from the bad experience. LLMs are not yet good at working and reworking the same code, for the reasons you outline. But they are pretty good at a "Groundhog Day" approach of going through the implementation process over and over until they get it right.

coolius

+1 if you are vibe coding projects from scratch. if the architecture you specify doesn't make sense, the llm will start struggling, the only way out of their misery is mocking tests. the good thing is that a complete rewrite with proper architecture and lessons learned is now totally affordable.

disgruntledphd2

I think the best thing about LLMs is how incredibly easy they make it to build one to throw away.

I've definitely built the same thing a few times, getting incrementally better designs each time.

Implicated

Not trying to be snarky, with all due respect... this is a skill issue.

It's a tool. It's a wildly effective and capable tool. I don't know how or why I have such a wildly different experience than so many that describe their experiences in a similar manner... but... nearly every time I come to the same conclusion that the input determines the output.

> If they implement something with a not-so-great approach, they'll keep adding workarounds or redundant code every time they run into limitations later.

Yes, when the prompt/instructions are overly broad and there's no set of guardrails or guidelines that indicate how things should be done... this will happen. If you're not using planning mode, skill issue. You have to get all this stuff wrapped up and sorted before the implementation begins. If the implementation ends up being done in a "not-so-great" approach - that's on you.

> If you tell them the code is slow

Whew. Ok. You don't tell it the code is slow. Do you tell your coworker "Hey, your code is slow" and expect great results? You ask it to benchmark the code and then you ask it how it might be optimized. Then you discuss those options with it (this is where you do the part from the previous paragraph, where you direct the approach so it doesn't do "no-so-great approach") until you get to a point where you like the approach and the model has shown it understands what's going on.

Then you accept the plan and let the model start work. At this point you should have essentially directed the approach and ensured that it's not doing anything stupid. It will then just execute, it'll stay within the parameters/bounds of the plan you established (unless you take it off the rails with a bunch of open ended feedback like telling it that it's buggy instead of being specific about bugs and how you expect them to be resolved).

> you can have 10 bespoke tests for every bug. Plus a new mocking framework created every time the last one turns out to be unfit for purpose.

This is an area I will agree that the models are wildly inept. Someone needs to study what it is about tests and testing environments and mocking things that just makes these things go off the rails. The solution to this is the same as the solution to the issue of it keeping digging or chasing it's tail in circles... Early in the prompt/conversation/message that sets the approach/intent/task you state your expectations for the final result. Define the output early, then describe/provide context/etc. The earlier in the prompt/conversation the "requirements" are set the more sticky they'll be.

And this is exactly the same for the tests. Either write your own tests and have the models build the feature from the test or have the model build the tests first as part of the planned output and then fill in the functionality from the pre-defined test. Be very specific about how your testing system/environment is setup and any time you run into an issue testing related have the model make a note about that and the solution in a TESTING.md document. In your AGENTS.md or CLAUDE.md or whatever indicate that if the model is working with tests it should refer to the TESTING.md document for notes about the testing setup.

Personally, I focus on the functionality, get things integrated and working to the point I'm ready to push it to a staging or production (yolo) environment and _then_ have the model analyze that working system/solution/feature/whatever and write tests. Generally my notes on the testing environment to the model are something along the lines of a paragraph describing the basic testing flow/process/framework in use and how I'd like things to work.

The more you stick to convention the better off you'll be. And use planning mode.

riffraff

> Whew. Ok. You don't tell it the code is slow. Do you tell your coworker "Hey, your code is slow" and expect great results?

Yes? Why don't you?

They are capable people that just didn't notice something, id I notice some telemetry and tell them "hey this is slow" they are expected to understand the reason(s).

Implicated

So, you observed some telemetry - which would have been some sort of specific metric, right? Wouldn't you communicate that to them as well, not just "it's slow"?

"Hey, I saw that metric A was reporting 40% slower, are you aware already or have any ideas as to what might be causing that?"

Those two approaches are going to produce rather distinctly different results whether you're speaking to a human or typing to a GPU.

bryanrasmussen

Yeah if my co-worker can't start figuring out why the code is slow, with a reasonable reference to what the code in question is, that is a knock against their skills. I would actually expect some ideas as to what the problem is just off the top of their heads, but that the coding agent can't do that isn't a hit against it specifically, this is now a good part of what needs to be done differently.

The suggestion to tell the agent to do performance analysis of the part of the code you think is problematic, and offer suggestions for improvements seems like the proper way to talk to a machine, whereas "hey your code is slow" feels like the proper way to talk to a human.

crazygringo

...no?

"Your code is slow" is essentially meaningless.

A normal human conversation would specify which code/tasks/etc., how long it's currently taking, how much faster it needs to be, and why. And then potentially a much longer conversation about the tradeoffs involved in making in faster. E.g. a new index on the database that will make it gigabytes larger, a lookup table that will take up a ton more memory, etc. Does the feature itself need to be changed to be less capable in order to achieve the speed requirements?

If someone told me "hey your code is slow" and walked away, I'd just laugh, I think. It's not a serious or actionable statement.

zabzonk

Well, I would say something like "We seem to be having some performance issues the business has noticed in the XYZ stuff. Shall we sit down together and see if we can work out if we can improve things?"

scotty79

There was a 20+ person team of well paid, smart (mostly Java) programmers that dealt for months with slow application they were building, that everyone knew was slow. I nagged them for weeks to set up indexes even for small, 100 row tables. Once they did things started running orders of magnitude faster.

Your expectations for people (and LLMs) are way too high.

pornel

My comment was a summary of the situation, not literal prompts I use. I absolutely realize the work needs to be adequately described and agents must be steered in the right direction. The results also vary greatly depending on the task and the model, so devs see different rates of success.

On non-trivial tasks (like adding a new index type to a db engine, not oneshotting a landing page) I find that the time and effort required to guide an LLM and review its work can exceed the effort of implementing the code myself. Figuring out exactly what to do and how to do it is the hard part of the task. I don't find LLMs helpful in that phase - their assessments and plans are shallow and naive. They can create todo lists that seemingly check off every box, but miss the forest for the trees (and it's an extra work for me to spot these problems).

Sometimes the obvious algorithm isn't the right one, or it turns out that the requirements were wrong. When I implement it myself, I have all the details in my head, so I can discover dead-ends and immediately backtrack. But when LLM is doing the implementation, it takes much more time to spot problems in the mountains of code, and even more effort to tell when it's a genuinely a wrong approach or merely poor execution.

If I feed it what I know before solving the problem myself, I just won't know all the gotchas yet myself. I can research the problem and think about it really hard in detail to give bulletproof guidance, but that's just programming without the typing.

And that's when the models actually behave sensibly. A lot of the time they go off the rails and I feel like a babysitter instructing them "no, don't eat the crayons!", and it's my skill issue for not knowing I must have "NO eating crayons" in AGENTS.md.

queenkjuul

Don't worry, Claude ignores my CLAUDE.md and eats crayons anyway

brabel

Great answer, and the reason some people have bad experiences is actually patently clear: they don’t work with the AI as a partner, but as a slave. But even for them, AI is getting better at automatically entering planning mode, asking for clarification (what exactly is slow, can you elaborate?), saying some idea is actually bad (I got that a few times), and so on… essentially, the AI is starting to force people to work as a partner and give it proper information, not just tell them “it’s broken, fix it” like they used to do on StackOverflow.

girvo

I absolutely tell a coworker their code is slow and expect them to fix it…

Bayko

I too tell my boss to promote me and expect him to do so.

otabdeveloper4

It is not a tool. It is an oracle.

It can be a tool, for specific niche problems: summarization, extraction, source-to-source translation -- if post-trained properly.

But that isn't what y'all are doing, you're engaging in "replace all the meatsacks AGI ftw" nonsense.

Implicated

If I was on the "replace all the meatsacks AGI ftw" team then I would have referred to it as an oracle, by your own logic, wouldn't I have?

It's a tool. It's good for some things, not for others. Use the right tool for the job and know the job well enough to know which tools apply to which tasks.

More than anything it's a learning tool. It's also wildly effective at writing code, too. But, man... the things that it makes available to the curious mind are rather unreal.

I used it to help me turn a cat exercise wheel (think huge hamster wheel) into a generator that produces enough power to charge a battery that powers an ESP32 powered "CYD" touchscreen LCD that also utilizes a hall effect sensor to monitor, log and display the RPMs and "speed" (given we know the wheel circumference) in real time as well as historically.

I didn't know anything about all this stuff before I started. I didn't AGI myself here. I used a learning tool.

But keep up with your schtick if that's what you want to do.

raincole

> Do you tell your coworker "Hey, your code is slow" and expect great results? You ask it to benchmark the code and then you ask it how it might be optimized.

...Really? I think 'hey we have a lot of customers reporting the app is laggy when they do X, could you take a look' is a very reasonable thing to tell your coworker who implemented X.

hkonte

[dead]

5o1ecist

[dead]

joquarky

Don't let it deteriorate so far that it can't recover in one session.

Perform regular sessions dedicated to cleaning up tech debt (including docs).

MattGaiser

> If they implement something with a not-so-great approach, they'll keep adding workarounds or redundant code every time they run into limitations later.

Are you using plan mode? I used to experience the do a poor approach and dig issue, but with planning that seems to have gone away?

bryanrasmussen

maybe there should be an LLM trained on a corpus of a deletions and cleanup of code.

krackers

I'm guessing there's a very strong prior to "just keep generating more tokens" as opposed to deleting code that needs to be overcome. Maybe this is done already but since every git project comes with its own history, you could take a notable open-source project (like LLVM) and then do RL training against against each individual patch committed.

movedx01

Perhaps the problem is that you RL on one patch a time, failing to capture the overarching long term theme, an architecture change being introduced gradually over many months, that exists in the maintainer’s mental model but not really explicitly in diffs.

bryanrasmussen

right, it would have to a specialized tool that you used to do analysis of codebase every now and then, or parts that you thought should be cleaned up.

Obviously there is a just keep generating more tokens bias in software management, since so many developer metrics over the years do various lines of code style analysis on things.

But just as experience and managerial programs have over time developed to say this is a bad bias for ranking devs, it should be clear it is a bad bias for LLMs to have.

ashdksnndck

I think this is in the training data since they use commit data from repos, but I imagine code deletions are rarer than they should be in the real data as well.

bryanrasmussen

deleting and code cleanup is perhaps more an expression of seniority, and personal preferences. Maybe there should be the same kind style transfer with code that you see with graphical generative AI, "rewrite this code path in the style of Donald Knuth"

treetalker

This is my experience with how LLMs "draft" legal arguments: at first glance, it's plausible — but may be, and often is, invalid, unsound, and/or ill-advised.

The catch is that many judges lack the time, energy, or willingness to not only read the documents in detail, but also roll up their sleeves and dig into the arguments and cited authorities. (Some lack the skills, but those are extreme cases.) So the plausible argument (improperly and unfortunately) carries the day.

LLM use in litigation drafting is thus akin to insurgent/guerilla warfare: it take little time, energy, or thinking to create, yet orders of magnitude more to analyze and refute. (It's a species of Brandolini's Law / The Bullshit Asymmetry Principle.) Thus justice suffers.

I imagine that this is analogous to the cognitive, technical, and "sub-optimal code" debt that LLM-produced code is generating and foisting upon future developers who will have to unravel it.

roarcher

> LLM use in litigation drafting is thus akin to insurgent/guerilla warfare: it take little time, energy, or thinking to create, yet orders of magnitude more to analyze and refute.

The same goes for coding. I have coworkers who use it to generate entire PRs. They can crank out two thousand lines of code that includes tests "proving" that it works, but may or may not actually be nonsense, in minutes. And then some poor bastard like me has to spend half a day reviewing it.

When code is written by a human that I know and trust, I can assume that they at least made reasonable, if not always correct, decisions. I can't assume that with AI, so I have to scrutinize every single line. And when it inevitably turns out that the AI has come up with some ass-backwards architecture, the burden is on me to understand it and explain why it's wrong and how to fix it to the "developer" who hasn't bothered to even read his own PR.

I'm seriously considering proposing that if you use AI to generate a PR at my company, the story points get credited to the reviewer.

patrakov

Evil voice: "I don't mind not getting credits for the story points. The story was AI-generated anyway."

theshrike79

If code smells like LLM, then you walk to said coworker and ask them to explain it for you. Play dumb if necessary.

Or you use YOUR LLM to review the PR :D

...and wtf, you get "credited" story points for finishing tasks? That sounds completely insane.

roarcher

> you get "credited" story points for finishing tasks? That sounds completely insane.

Developers' names are attached to stories, and stories have points on them. Why is that insane, and how does your company track who did what?

I propose that the name on the story should be that of the reviewer since they did the work.

undefined

[deleted]

deaux

> This is my experience with how LLMs "draft" legal arguments: at first glance, it's plausible — but may be, and often is, invalid, unsound, and/or ill-advised.

Correct, and this of course extends past just laws, into the whole scope of rules and regulations described in human languages. It will by its nature imply things that aren't explicitly stated nor can be derived with certainty, just because they're very plausible. And those implications can be wrong.

Now I've had decent success with having LLMs then review these LLM-generated texts to flag such occurences where things aren't directly supported by the source material. But human review is still necessary.

The cases I've been dealing with are also based on relatively small sets of regulations compared the scope of the law involved with many legal cases. So I imagine that in the domain you're working on, much more needs flagging.

FpUser

>" justice suffers"

Possible. It also suffers when majority simply can not afford proper representation

basch

"Reasoning" needs to go back to the drawing board.

Reasonable tasks need to be converted into formal logic, calculated and computed like a standard evaluation, and then translated back into english or language of choice.

LLMs are being used to think when really they should be the interpret and render steps with something more deterministic in the middle.

Translate -> Reason -> Store to Database. Rinse Repeat. Now the context can call from the database of facts.

otterley

As an attorney, I’m interested in this theory. Do you have any examples that illustrate the phenomenon you describe?

treetalker

Since it's too late to edit my other reply — here is a description of a recent case involving several of the categories I mentioned: https://reason.com/volokh/2026/03/06/california-appeals-cour...

And you will find many more by reading (or subscribing to the RSS feed of) the Volokh Conspiracy blog's "AI in Court" tag: https://reason.com/category/law/ai-in-court/

treetalker

Sure, but which part: opposing counsel using LLMs; opposing counsel simply using bullshit asymmetry to befuddle (nothing new); or judges not always reading and looking deeply into the arguments and authorities (also nothing new)?

If the first category, there have been plenty of examples that have even made their way onto the HN front page in the last half year or so. There have even been instances of judges using LLMs to draft orders containing confabulated authorities.

otterley

> opposing counsel simply using bullshit asymmetry to befuddle (nothing new)

This one in particular.

undefined

[deleted]

grey-area

This is a fascinating look into code generated by an LLM that is correct in one sense (passes tests) but doesn't meet requirements (painfully slow). Doesn't use is_ipk to identify primary keys, uses fsync on every statement. The problem with larger projects like this even if you are competent is that there are just too many lines of code to read it properly and understand it all. Bravo to the author for taking the time to read this project, most people never will (clearly including the author of it).

I find LLMs at present work best as autocomplete -

The chunks of code are small and can be carefully reviewed at the point of writing

Claude normally gets it right (though sometimes horribly wrong) - this is easier to catch in autocomplete

That way they mostly work as designed and the burden on humans is completely manageable, plus you end up with a good understanding of the code generated. They make mistakes I'd say 30% of the time or so when autocompleting, which is significant (mistakes not necessarily being bugs but ugly code, slow code, duplicate code or incorrect code.

Having the AI produce the majority of the code (in chats or with agents) takes lots of time to plan and babysit, and is harder to review, maintain and diagnose; it doesn't seem like much of a performance boost, unless you're producing code that is already in the training data and just want to ignore the licensing of the original code.

theshrike79

> This is a fascinating look into code generated by an LLM that is correct in one sense (passes tests) but doesn't meet requirements (painfully slow).

Why isn't requirements testing automated? Benchmarking the speed isn't rocket science. At worst a nightly build should run a benchmark and log it so you can find any anomalies.

KatanaLarp

[dead]

consumer451

Nitpick/question: the "LLM" is what you get via raw API call, correct?

If you are using an LLM via a harness like claude.ai, chatgpt.com, Claude Code, Windsurf, Cursor, Excel Claude plug-in, etc... then you are not using an LLM, you are using something more, correct?

An example I keep hearing is "LLMs have no memory/understanding of time so ___" - but, agents have various levels of memory.

I keep trying to explain this in meetings, and in rando comments. If I am not way off-base here, then what should be the term, or terms, be? LLM-based agents?

dragonwriter

> Nit pick/question: The LLM is what you get via raw API call, correct?

You always need a harness of some kind to interact with an LLM. Normal web APIs (especially for hosted commercial systems) wrapped around LLMs are non-minimal harnesses, that have built in tools, interpretation of tool calls, application of what is exposed in local toolchains as “prompt templates” to transform the context structure in the API call into a prompt (in some cases even supporting managing some of the conversation state that is used to construct the prompt on the backend.)

> If you are using an LLM via a harness like claude.ai, chatgpt.com, Claude Code, Windsurf, Cursor, Excel Claude plug-in, etc... then you are not using an LLM, you are using something more, correct?

You are essentially always using something more than an LLM (unless “you” are the person writing the whole software stack, and the only thing you are consuming is the model weights, or arguably a truly minimal harness that just takes setting and a prompt that is not transformed in any way before tokenization, and returns the result after no transformations or filtering other than mapping back from tokens to text.)

But, yes, if you are using an elaborate frontend of the type you enumerate (whether web or CLI or something else), you are probably using substantially more stuff on top of the LLM than if you are using the providers web API.

consumer451

In meetings, I try to explain the roles of system prompts, agentic loops, tool calls, etc in the products I create, to the stakeholders.

However, they just look at the whole thing as "the LLM," which carries specific baggage. If we could all spread the knowledge of what is actually going on to the wider public, it would make my meetings easier, and prevent many very smart folks who are not practitioners from saying inaccurate stuff.

staplers

  If we could all spread the knowledge of what is actually going on to the wider public, it would make my meetings easier, and prevent very smart folks from outside the field from saying dumb-sounding stuff.

This is an example of why LLMs won't displace engineers as severely as many think. There are very old solved processes and hyper-efficient ways of building things in the real world that still require a level of understanding many simply don't care or want to achieve.

xlth

You're not off-base at all. The way I think about it:

- LLM = the model itself (stateless, no tools, just text in/text out) - LLM + system prompt + conversation history = chatbot (what most people interact with via ChatGPT, Claude, etc.) - LLM + tools + memory + orchestration = agent (can take actions, persist state, use APIs)

When someone says "LLMs have no memory" they're correct about the raw model, but Claude Code or Cursor are agents - they have context, tool access, and can maintain state across interactions.

The industry seems to be settling on "agentic system" or just "agent" for that last category, and "chatbot" or "assistant" for the middle one. The confusion comes from product names (ChatGPT, Claude) blurring these boundaries - people say "LLM" when they mean the whole stack.

simonw

I like to use the term "coding agents" for LLM harnesses that have the ability to directly execute code.

This is an important distinction because if they can execute the code they can test it themselves and iterate on it until it works.

The ChatGPT and Claude chatbot consumer apps do actually have this ability now so they technically class as "coding agents", but Claude Code and Codex CLI are more obvious examples as that's their key defining feature, not a hidden capability that many people haven't spotted yet.

alexhans

> The vibes are not enough. Define what correct means. Then measure.

Pretty much. I've been advocating this for a while. For automation you need intent, and for comparison you need measurement. Blast radius/risk profile is also important to understand how much you need to cover upfront.

The Author mentions evaluations, which in this context are often called AI evals [1] and one thing I'd love to see is those evals become a common language of actually provable user stories instead of there being a disconnect between different types of roles, e.g. a scientist, a business guy and a software developer.

The more we can speak a common language and easily write and maintain these no matter which background we have, the easier it'll be to collaborate and empower people and to move fast without losing control.

- [1] https://ai-evals.io/ (or the practical repo: https://github.com/Alexhans/eval-ception )

D-Machine

This article is great. And the blog-article headline is interesting, but wrong. LLM's don't in general write plausible code (as a rule) either.

They just write code that is (semantically) similar to code (clusters) seen in its training data, and which haven't been fenced off by RLHF / RLVR.

This isn't that hard to remember, and is a correct enough simplification of what generative LLMs actually do, without resorting to simplistic or incorrect metaphors.

ozozozd

Exactly. It’s also easy to find yourself in the out-of-distribution territory. Just ask for some tree-sitter queries and watch Gemini 3, Opus 4.5 and GLM 5 hallucinate new directives.

ehnto

I think this could be the key difference in how people are experiencing the tools. Using Claude in industries full of proprietary code is a totally different experience to writing some React components, or framework code in C#, PHP or Java. It's shockingly good at the later, but as you get into proprietary frameworks or newer problem domains it feels like AI in 2023 again, even with the benefit of the agentic harnesses and context augments like memory etc.

2god3

You’ve hit the nail on the head.

I characterise llm’s as being black boxes that are filled with a dense pool of digital resources - that with the correct prompt you can draw out a mix of resources to produce an output.

But if the mix of resources you need isn’t there - it won’t work. This isn’t limited to just text. This also applies with video models - llms work better for prompts in which you are trying to get material that is widely available on the internet.

empath75

I think in the long term, if an LLM can’t use a tool, people won’t stop using LLM’s, they’ll stop using the tool.

We are building everything right now with LLM agents as a primary user in mind and one of our principles is “hallucination driven development”. If LLMs hallucinate an interface to your product regularly, that is a desire path and you should create that interface.

simianwords

Any example of how I can get it to hallucinate?

kubb

IIRC, the most code in its training data is Python. Closely followed by Web technologies (HTML, JS/TS, CSS). This corresponds to the most abundant developers. Many of them dedicated their entire careers to one technology.

We stubbornly use the same language to refer to all software development, regardless of the task being solved. This lets us all be a part of the same community, but is also a source of misunderstanding.

Some of us are prone to not thinking about things in terms of what they are, and taking the shortcut of looking at industry leaders to tell us what we should think.

These guys consistently, in lockstep, talk about intelligent agents solving development tasks. Predominately using the same abstract language that gives us an illusion of unity. This is bound to make those of us solving the common problems believe that the industry is done.

jmull

> They just write code that is (semantically) similar to code (clusters) seen in its training data, and which haven't been fenced off by RLHF / RLVR.

"Plausible" sounds like the right word to me. (It would be a mistake to digress into these features of LLMs in an article where it isn't needed.)

HarHarVeryFunny

I agree - I took "plausible" here to mean plausible-looking, no different than similar-looking.

The trouble of course is that similar/plausible isn't good enough unless the LLM has seen enough similar-but-different training samples to refine it's notion of similarity to the point where it captures the differences that are critical in a given case.

I'd rather just characterize it as a lack of reasoning, since "add more data" can't be the solution to a world full of infinite variety. You can keep playing whack a mole to add more data to fix each failure, and I suppose it's an interesting experiment to see how far that will get you, but in the end the LLM is always going to be brittle and susceptible to stupid failure cases if it doesn't have the reasoning capability to fully analyze problems it was not trained on.

comex

Based on a search, the SQLite reimplementation in question is Frankensqlite, featured on Hacker News a few days ago (but flagged):

https://news.ycombinator.com/item?id=47176209

scotty79

Flagging on HN is getting insane.

flerchin

Yes plausible text prediction is exactly what it is. However, I wonder if the author included benchmarking in their prompt. It's not exactly fair to keep hidden requirements.

g947o

Attributing these to "hidden requirements" is a slippery slope.

My own experience using Claude Code and similar tools tells me that "hidden requirements" could include:

* Make sure DESIGN.md is up to date

* Write/update tests after changing source, and make sure they pass

* Add integration test, not only unit tests that mock everything

* Don't refactor code that is unrelated to the current task

...

These are not even project/language specific instructions. They are usually considered common sense/good practice in software engineering, yet I sometimes had to almost beg coding agents to follow them. (You want to know how many times I have to emphasize don't use "any" in a TypeScript codebase?)

People should just admit it's a limitation of these coding tools, and we can still have a meaningful discussion.

grey-area

The training data is full of ‘any’ so you will keep getting ‘any’ because that is the code the models have seen.

An interesting example of the training data overriding the context.

theshrike79

Then you add a biome rule to say "no any ever" and the LLM will fix it before claiming the job is done.

flerchin

Yeah I agree generally that the most banal things must be specified, but I do think that a single sentence in the prompt "Performance should be equivalent" would likely have yielded better results.

seanmcdirmid

Ok, I’ll bite: how is that different from humans?

strken

Human behaviour is goal-directed because humans have executive function. When you turn off executive function by going to sleep, your brain will spit out dreams. Dream logic is famous for being plausible but unhinged.

I have the feeling that LLMs are effectively running on dream logic, and everything we've done to make them reason properly is insufficient to bring them up to human level.

seanmcdirmid

Isn’t a modern LLM with thinking tokens fairly goal directed? But yes, we hallucinate in our sleep while LLMs will hallucinate details if the prompt isn’t grounded enough.

zarzavat

The thing about dream logic is that it can be a completely rational series of steps, but there's usually a giant plot hole which you only realise the second you wake up.

This definitely matches my experience of talking to AI agents and chatbots. They can be extremely knowledgeable on arcane matters yet need to have obvious (to humans) assumptions pointed out to them, since they only have book smarts and not street smarts.

tovej

Assuming this is not a rhetorical question: no, it is not. The only "goal" is to maximize plausibility.

tsunamifury

It’s amazing how much you get wrong here. As LLM attention layers are stacked goal functions.

What they lack is multi turn long walk goal functions — which is being solved to some degree by agents.

strken

I don't argue that thinking and attention are missing. I argue that they are trying to do the job of human executive function but aren't as good at it.

nemo44x

LLMs are literally goal machines. It’s all they do. So it’s important that you input specific goals for them to work towards. It’s also why logically you want to break the problem into many small problems with concrete goals.

andai

Do you only mean instruct-tuned LLMs? Or the base (pretrained) model too?

satvikpendem

A prompt for an LLM is also a goal direction and it'll produce code towards that goal. In the end, it's the human directing it, and the AI is a tool whose code needs review, same as it always has been.

basch

Id argue humans have some sort of parallelness going on that machines dont yet. Thoughts happening at multiple abstraction levels simultaneously. As I am doing something, I am also running the continuous improvement cycle in my head, at all four steps concurrently. Is this working, is this the right direction, does this validate?

You could build layers and layers of LLMs watching the output of each others thoughts and offering different commentary as they go, folding all the thoughts back together at the end. Currently, a group of agents acts more like a discussion than something somewhat omnipotent or omnitemporal.

whoamii

Some of my best code comes from my dreams though.

spiderfarmer

And yet LLM’s are incredibly useful as they are right now.

strken

And yet they're going to be better in a decade, which will require understanding why they aren't perfect today.

apical_dendrite

The volume is different. Someone submitted a PR this week that was 3800 lines of shell script. Most of it was crap and none of it should have been in shell script. He's submitting PRs with thousands of lines of code every day. He has no idea how any of it actually works, and it completely overwhelms my ability to review.

Sure, he could have submitted a ill-considered 3800 line PR five years ago, but it would have taken him at least a week and there probably would have been opportunities to submit smaller chunks along the way or discuss the approach.

switchbak

It’s harder when the person doing what you describe has the ability to have you fired. Power asymmetry + irresponsible AI use + no accountability = a recipe for a code base going right to hell in a few months.

I think we’re going to see a lot of the systems we depend on fail a lot more often. You’d often see an ATM or flight staus screen have a BSOD - I think we’re going to see that kind of thing everywhere soon.

satvikpendem

Just block that user, that seems to be the way.

somewhereoutth

Humans have a 'world model' beyond the syntax - for code, an idea of what the code should do and how it does it. Of course, some humans are better than others at this, they are recognized as good programmers.

satvikpendem

Papers show that AI also has a world model, so I don't think that's the right distinction.

tovej

Could you please cite these papers. If by AI you mean LLMs, that is not supported by what I know. If you mean a theoretical world-model-based AI, that's just a tautological statement.

detourdog

What I'm surprises me about the current development environment is the acceleration of technical debt. When I was developing my skills the nagging feeling that I didn't quite understand the technology was a big dark cloud. I felt this clopud was technical debt. This was always what I was working against.

I see current expectations that technical debt doesn't matter. The current tools embrace superficial understand. These tools to paper over the debt. There is no need for deeper understanding of the problem or solution. The tools take care of it behind the scenes.

wood_spirit

It’s not. LLMs are just averaging their internet snapshot, after all.

But people want an AI that is objective and right. HN is where people who know the distinction hang out, but it’s not what the layperson things they are getting when they use this miraculous super hyped tool that everybody is raving about?

mrwh

The etiquette, even at the bigtech place I work, has changed so quickly. The idea that it would be _embarrassing_ to send a code review with obvious or even subtle errors is disappearing. More work is being put on the reviewer. Which might even be fine if we made the further change that _credit goes to the reviewer_. But if anything we're heading in the opposite direction, lines of code pumped out as the criterion of success. It's like a car company that touts how _much_ gas its cars use, not how little.

wood_spirit

Review is usually delegated to an AI too

satvikpendem

By now, a few years after ChatGPT released, I don't think anyone is thinking AI is objective and right, all users have seen at least one instance of hallucination and simply being wrong.

wood_spirit

Sorry I can think of so many counter examples. I also detect a lot of “well it hallucinates about subject X (that the person knows well, so can spot the hallucination)” but continue to trust it on subjects Y and Z (which the person knows less well so can’t spot the hallucinations).

YMMV.

seanmcdirmid

There are a lot of binary thinkers on HN, but they shouldn’t make up a majority.

rDr4g0n

It's much easier to fire an employee which produces low quality/effort work than to convince leadership to fire Claude.

satvikpendem

You can fire employees who don't review code generated though, because ultimately it's their responsibility to own their code, whether they hand wrote it or an LLM did.

It seems to me that it's all a matter of company culture, as it has always been, not AI. Those that tolerate bad code will continue to tolerate it, at their peril.

andai

It writes statistically represented code, which is why (unless instructed otherwise) everything defaults to enterprisey, OOP, "I installed 10 trendy dependencies, please hire me" type code.

Shyaamal11

One thing I’ve noticed while working with data/AI workflows is that the “acceptance criteria first” idea applies even more strongly once you move beyond code generation into data pipelines and analytics.

LLMs can generate queries, transformations, or even Spark jobs that look reasonable but if the underlying data contracts, schema expectations, or evaluation criteria aren’t defined, you end up with something that looks correct but is semantically wrong.

In practice, the teams that get the most value from AI-assisted development tend to have: clearly defined datasets reproducible data pipelines well-defined outputs / metrics Once those pieces are in place, AI becomes much more useful because it’s operating inside a structured system instead of guessing context. That’s also why there’s been a lot of interest lately in lakehouse-style platforms that combine data engineering, analytics, and AI workflows in one place (e.g. platforms like IOMETE).

When the data layer is structured and reproducible, AI tooling becomes far more reliable. Curious if others here have seen the same pattern when using LLMs for data engineering or analytics work.

siliconc0w

Just a recent anecdote, I asked the newest Codex to create a UI element that would persist its value on change. I'm using Datastar and have the manual saved on-disk and linked from the AGENTS.md. It's a simple html element with an annotation, a new backend route, and updating a data model. And there are even examples of this elsewhere in the page/app.

I've asked it to do why harder things so I thought it'd easily one-shot this but for some reason it absolutely ate it on this task. I tried to re-prompt it several times but it kept digging a hole for itself, adding more and more in-line javascript and backend code (and not even cleaning up the old code).

It's hard to appreciate how unintuitive the failure modes are. It can do things probably only a handful of specialists can do but it can also critical fail on what is a straightforward junior programming task.

undefined

[deleted]

Daily Digest email

Get the top HN stories in your inbox every day.