When AI writes the software, who verifies it?

leodemoura.github.io

Daily Digest email

Get the top HN stories in your inbox every day.

roadbuster

> The Claude C Compiler illustrates the other side: it optimizes for

> passing tests, not for correctness. It hard-codes values to satisfy

> the test suite. It will not generalize.

This is one of the pain points I am suffering at work: workers ask coding agents to generate some code, and then to generate test coverage for the code. The LLM happily churns out unit tests which are simply reinforcing the existing behaviour of the code. At no point does anyone stop and ask whether the generated code implements the desired functional behaviour for the system ("business logic").

The icing on the cake is that LLMs are producing so much code that humans are just rubber stamping all of it. Off to merge and build it goes.

I have no constructive recommendations; I feel the industry will keep their foot on the pedal until something catastrophic happens.

WhyNotHugo

This is why you write the tests first and then the code. Especially when fixing bugs, since you can be sure that the test properly fails when the bug is present.

pmontra

When fixing bugs, yes. When designing an app not so much because you realize many unexpected things while writing the code and seeing how it behaves. Often the original test code would test something that is never built. It's obvious for integration tests but it happens for tests of API calls and even for unit tests. One could start writing unit tests for a module or class and eventually realize that it must be implemented in a totally different way. I prefer experimenting with the implementation and write tests only when it settles down on something that I'm confident it will go to production.

Karrot_Kream

Where I'm at currently (which may change) is that I lay down the foundation of the program and its initial tests first. That initial bit is completely manual. Then when I'm happy that the program is sufficiently "built up", I let the LLM go crazy. I still audit the tests though personally auditing tests is the part of programming I like the very least. This also largely preserves the initial architectural patterns that I set so it's just much easier to read LLM code.

In a team setting I try to do the same thing and invite team members to start writing the initial code by hand only. I suspect if an urgent deliverable comes up though, I will be flexible on some of my ideas.

koonsolo

> When fixing bugs, yes.

One thing I want to mention here is that you should try to write a test that not only prevents this bug, but also similar bugs.

In our own codebase we saw that regression on fixed bugs is very low. So writing a specific test for it, isn't the best way to spend your resources. Writing a broad test when possible, does.

Not sure how LLM's handle that case to come up with a proper test.

boristane007

[dead]

pipecmd

I'd argue the AI writing the tests shouldn't even know about the implementation at all. You only want to pass it the interface (or function signatures) together with javadocs/docstrings/equivalent.

GuB-42

I don't think it addresses the problem.

Writing the tests first and then writing code to pass the tests is no better than writing the code first then writing tests that pass. What matter is that both the code and the tests are written independently, from specs, not from one another.

I think that it is better not to have access to tests when first writing code, as to make sure to code the specs and not code the tests that test the specs as something may be lost in translation. It means that I have a preference for code first, but the ideal case would be for different people to do it in parallel.

Anyway, about AI, in an AI writes both the tests and the code, it will make sure they match no matter what comes first, it may even go back and forth between the tests and code, but it doesn't mean it is correct.

9rx

Tests are your spec. You write them first because that is the stage when you are still figuring out what you need to write.

Although TDD says that you should only write one test before implementing it, encouraging spec writing to be an iterative process.

Writing the spec after implementation means that you are likely to have forgotten the nuance that went into what you created. That is why specs are written first. Then the nuance is captured up front as it comes to mind.

usefulcat

Agreed 1000%. But that can be a lot of work; creating a good set of tests is nearly as much or often even more effort than implementing the thing being tested.

When LLMs can assist with writing useful tests before having seen any implementation, then I’ll be properly impressed.

byzantinegene

from experience, AI is bad at TDD. they can infer tests based on written code, but are bad at writing generalised test unless a clear requirement is given, so you the engineer is doing most of the work anyway.

dustingetz

try this for a UI

porphyra

At my job we have a requirement for 100% test coverage. So everyone just uses AI to generate 10,000 line files of unit tests and nobody can verify anything.

harimau777

Exactly! It's frustrating how much developers get blamed for the outcomes of incompetent management.

philipallstar

> everyone just uses AI to generate 10,000 line files of unit tests and nobody can verify anything

This is not a guaranteed outcome of requiring 100% coverage. Not that that's a good requirement, but responding badly to a bad requirement is just as bad.

IAmGraydon

Yeah this is the exact kind of ridiculousness I've noticed as well - everything that comes out of an LLM is optimized to give you what you want to hear, not what's correct.

littlestymaar

Long time ago in France the mainstream view by computer people was that code or compute weren't what's important when dealing with computers, it is information that matters and how you process it in a sensible way (hence the name of computer science in French: informatique. And also the name for computer: “ordinateur”, literally: what sets things into order).

As a result, computer students were talked a lot (too much for most people's taste, it seems) about data modeling and not too much about code itself, which was viewed as mundane and uninteresting until the US hacker culture finally took over in the late 2000th.

Turns out that the French were just right too early, like with the Minitel.

msh

"Computer science is no more about computers than astronomy is about telescopes." -Dijkstra

scotty79

> The LLM happily churns out unit tests which are simply reinforcing the existing behaviour of the code.

I always felt like that's the main issue with unit testing. That's why I used it very rarely.

Maybe keeping tests in the separate module and not letting th Agent see the source during writing tests and not letting agent see the tests while writing implemntation would help? They could just share the API and the spec.

And in case of tests failing another agent with full context could decide if the fix should be delegated to coding agent or to testing agent.

Herring

> At no point does anyone stop and ask whether the generated code implements the desired functional behaviour for the system ("business logic").

Obvious question: why not? Let’s say you have competent devs, fair assumption. Maybe it’s because they don’t have enough time for solid QA? Lots of places are feature factories. In my personal projects I have more lines of code doing testing than implementation.

sarchertech

It’s because people will do what they’re incentivized to do. And if no one cares about anything but whether the next feature goes out the door, that’s what programmers will focus on.

Honestly I think the other thing that is happening is that a lot of people who know better are keeping their mouths shut and waiting for things to blow up.

We’re at the very peak of the hype cycle right now, so it’s very hard to push back and tell people that maybe they should slow down and make sure they understand what the system is actually doing and what it should be doing.

shigawire

Or if you say we should slow down your competence is questioned by others who are going very fast (and likely making mistakes we won't find until later).

And there is an element of uncertainty. Am I just bad at using these new tools? To some degree probably, but does that mean I'm totally wrong and we should be going this fast?

citizenpaul

oh yeah, let them dig a hole and charge sweet consultant rates to fix it. the the healing can begin

harimau777

Developers aren't given time to test and aren't rewarded if they do, but management will rain down hellfire upon their heads if they don't churn out code quickly enough.

ojo-rojo

How about a subsequent review where a separate agent analyzes the original issue and resultant code and approves it if the code meets the intent of the issue. The principle being to keep an eye out for manual work that you can describe well enough to offload.

Depending on your success rate with agents, you can have one that validates multiple criteria or separate agents for different review criteria.

g947o

You are fighting nondeterministic behavior with more nondeterministic behavior, or in other words, fighting probability with probability. That doesn't necessarily make things any better.

pyridines

In my experience, an agent with "fresh eyes", i.e., without the context of being told what to write and writing it, does have a different perspective and is able to be more critical. Chatbots tend to take the entire previous conversational history as a sort of canonical truth, so removing it seems to get rid of any bias the agent has towards the decisions that were made while writing the code.

I know I'm psychologizing the agent. I can't explain it in a different way.

hex4def6

Aren't human coders also nondeterministic?

Assigning different agents to have different focuses has worked for me. Especially when you task a code reviewer agent with the goal of critically examining the code. The results will normally be much better than asking the coder agent who will assure you it's "fully tested and production ready"

tbossanova

Probably true

(Sorry.)

samrus

Slop on slop. Who watches rhe watchman?

cmrdporcupine

My only hope is that all of this push leads in the end to the adoption of more formal verification languages and tools.

If people are having to specify things in TLA+ etc -- even with the help of an LLM to write that spec -- they will then have something they can point the LLM at in order for it to verify its output and assumptions.

madrox

I encourage everyone to RTFA and not just respond to the headline. This really is a glimpse into where the future is going.

I've been saying "the last job to be automated will be QA" and it feels more true every day. It's one thing to be a product engineer in this era. It's another to be working at the level the author is, where code needs to be verifiable. However, once people stop vibing apps and start vibing kernels, it really does fundamentally change the game.

I also have another saying: "any sufficiently advanced agent is indistinguishable from a DSL." I hadn't considered Lean in this equation, but I put these two ideas together and I feel like we're approaching some world where Lean eats the entire agentic framework stack and the entire operating system disappears.

If you're thinking about building something today that will still be relevant in 10 years, this is insightful.

fmbb

There are still no successful useful vibe codes apps. Kernels are pretty far away I think.

bonoboTP

This is a very strange statement. People don't always announce when they use AI for writing their software since it's a controversial topic. And it's a sliding scale. I'm pretty sure a large fraction of new software has some AI involved in its development.

qsera

> new software has some AI involved in its development.

A large part of it is probably just using it as a better search. Like "How do I define a new data type in go?".

dehrmann

Apps are a strange measure because there aren't really any new, groundbreaking ones. PCs and smartphones have mostly done what people have wanted them to do for a while.

shimman

There are plenty of ground breaking apps but they aren't making billions of advertising revenue, nor do they have large numbers. I honestly think torrent applications (and most peer to peer type of stuff) are very cool and very useful for small-medium groups but it'll never scale to a billion user thing.

Do agree it's a weird metric to have, but can't think of a better one outside of "business" but that still seems like a poor rubric because the vast majority of people care about things that aren't businesses and if this "life altering" technology basically amounts to creating digital slaves then maybe we as a species shouldn't explore the stars.

tempaccount5050

I think this might miss the point. We put off upgrading to an new RMM at work because I was able to hack together some dashboards in a couple days. It's not novel and does exactly what we need it to do, no more. We don't need to pay 1000's of dollars a month for the bloated Solarwinds stack. We aren't saving lives, we're saving PDFs so any arguments about 5 9s and maintainability are irrelevant. LLMs are going to give us on demand, one off software. I think the SaaS market is terrified right now because for decades they've gouged customers for continual bloat and lock in that now we can escape from. In a single day I was able to build an RMM that fits our needs exactly. We don't need to hire anyone to maintain it because it's simple, like most business applications should be, but SV needs to keep complicating their offerings with bloat to justify crazy monthly costs that should have been a one time purchase from the start. SV shot itself in the face with AI.

theshrike79

Define "successful"?

Does it need to be HN-popular or a household name? Be in the news?

Or something that saves 50% of time by automating inane manual work from a team?

raincole

Name 3 apps that are

1. widely considered successful 2. made by humans from scratch in 2025

It looks like humans and AI are on par in this realm.

GoatInGrey

To be fair, Claude Code is vibe-coded. It's a terrible piece of software from an engineering (and often usability) standpoint, and the problems run deeper than just the choice of JavaScript. But it is good enough for people to get what they want out of it.

bunderbunder

But also, based on what I have heard of their headcount, they are not necessarily saving any money by vibecoding it - it seems like their productivity per programmer is still well within the historical range.

That isn’t necessarily a hit against them - they make an LLM coding tool and they should absolutely be dogfooding it as hard as they can. They need to be the ones to figure out how to achieve this sought-after productivity boost. But so far it seems to me like AI coding is more similar to past trends in industry practice (OOP, Scrum, TDD, whatever) than it is different in the only way that’s ever been particularly noteworthy to me: it massively changes where people spend their time, without necessarily living up to the hype about how much gets done in that time.

jmathai

> But it is good enough for people to get what they want out of it.

This is the ONLY point of software unless you’re doing it for fun.

gwern

> I encourage everyone to RTFA and not just respond to the headline.

This is an example of an article which 'buries the lede'†.

It should have started with the announcement of the new zlib autoformalization (!) https://leodemoura.github.io/blog/2026/02/28/when-ai-writes-... to get you excited.

Then it should have talked about the rest - instead of starting with rather graceless and ugly LLM-written generic prose about AI topics that to many readers is already tiresomely familiar and doubtless was tldr for even the readers who aren't repelled automatically by that.

† or in my terms, fails to 'make you care': https://gwern.net/blog/2026/make-me-care

bwestergard

I am as enthusiastic about formal methods as the next guy, but I very much doubt any LLM-based technique will make it economical to write a substantial fraction of application software in Lean. The LLM can play a powerful heuristic role in searching for proof-bearing code in areas where there is good training data. Unfortunately those areas are few and far between.

Moreover, humans will still need to read even rigorously proved code if only to suss out performance issues. And training people to read Lean will continue to be costly.

Though, as the OP says, this is a very exciting time for developing provably correct systems programming.

zozbot234

LLMs are writing non-trivial math proofs in Lean, and software proofs tend to be individually easier than proofs in math, just more tedious because there's so much more of them in any non-trivial development.

Some performance issues (asymptotics) can be addressed via proof, others are routinely verified by benchmarking.

madrox

This assumes everything about current capabilities stay static, and it wasn't long ago before LLMs couldn't do math. Many were predicting the genAI hype had peaked this time last year.

If you want it to be a question of economics, I think the answer is in whether this approach is more economical than the alternative, which is having people run this substrate. There's a lot of enthusiasm here and you can't deny there has been progress.

I wouldn't be so quick to doubt. It costs nothing to be optimistic.

candiddevmike

> and it wasn't long ago before LLMs couldn't do math

They still can't do math.

charlieflowers

> "any sufficiently advanced agent is indistinguishable from a DSL."

I don't quite follow but I'd love to hear more about that.

madrox

If you give an agent a task, the typical agentic pattern is that it calls tools in some non-deterministic loop, feeding the tool output back into the LLM, until it deems the task complete. The LLM internalizes an algorithm.

Another way of doing it is the agent just writes an algorithm to perform the task and runs it. In this world, tools are just APIs and the agent has to think through its entire process end to end before it even begins and account for all cases.

Only latter is turing complete, but the former approaches the latter as it improves.

esafak

https://en.wikipedia.org/wiki/Clarke's_three_laws

charlieflowers

No i get the clarke reference. But how is an agent a dsl?

thinkling

My read was roughly that agents require constraining scaffolding (CLAUDE.md) and careful phrasing (prompt engineering) which together is vaguely like working in a DSL?

jpollock

If the llm is able to code it, there is enough training data that youight be better off in a different language that removes the boilerplate.

kubanczyk

> RTFA

Sigh. Is there any LLM solution for HN reader to filter out all top-level commenters that hadn't RTFA? I don't need the (micro-)shitstorms that these people spawn, even if the general HN algo scores these as "interesting".

wyum

I believe there is a Verification Complexity Barrier

As you add components to a system, the time it takes to verify that the components work together increases superlinearly.

At a certain point, the verification complexity takes off. You literally run out of time to verify everything.

AI coding agents hit this barrier faster than ever, because of how quickly they can generate components (and how poorly they manage complexity).

I think verification is now the problem of agentic software engineering. I think formal methods will help, but I don't see how they will apply to messy situations like end-to-end UI testing or interactions between the system and the real world.

I posted more detailed thoughts on X: https://x.com/i/status/2027771813346820349

codethief

Hi William, thank you for the interesting post!

> At a certain point, the verification complexity takes off. You literally run out of time to verify everything.

Could you elaborate on this? Your post makes it sound as if the verification complexity diverged as the number of components n approaches a certain finite value n_0, but that seems unlikely to me. If, in contrast, the verification complexity remains finite at n_0, then verification should still be possible in finite time, shouldn't it? Yes, it might be a considerable amount of time but I assume your theorem doesn't predict lower bounds for the involved constants?

Either way, this entire discussion assumes n will increase as more and more software gets written by AI. Couldn't it also be the opposite, though? AI might also lead us to removing unnecessarily complex dependencies from our software supply chain or stripping them down to the few features we need.

wyum

Thank you for reading and the very thoughtful observations.

>> At a certain point, the verification complexity takes off. You literally run out of time to verify everything. > Could you elaborate on this?

I plan to publish a thorough post with an interactive model. Whether human or AI, you are capacity constrained, and I glossed over `C` (capacity within a given timeframe) in the X post.

You are correct that verification complexity remains finite at n_0. The barrier is practical: n_0 is where V(n) exceeds your available capacity C. If V(n) = n^(1+k), then n_0 = C^(1/(1+k)). Doubling your capacity doesn't double n_0. It increases by a factor of 2^(1/(1+k)), which is always less than 2.

So the barrier always exists for, say, a given "dev year" or "token budget," and the cost to push it further grows superlinearly. It's not absolutely immovable, but moving it gets progressively harder. That's what I mean by "literally run out of time." At any given capacity, there is a finite n beyond which complete verification is not possible. Expanding capacity buys diminishing returns.

> Either way, this entire discussion assumes n will increase as more and more software gets written by AI. Couldn't it also be the opposite, though?

You are getting at my core motivation for exploring this question.

Verification requires a definition of "done" and I wonder if it will ever be possible (or desirable) for AI to define done on its own, let alone verify it and simplify software based on its understanding of our needs.

You make a great point that we are not required to add more components and "go right" along the curve. We can choose to simplify, and that is absolutely the right takeaway. AI has made many people believe that by generating more code at a faster pace they are accomplishing more. But that's not how software productivity should be judged.

To answer your question about assumptions, while AI can certainly be prompted to help reduce n or k in isolated cases where "done" is very clear, I don't think it's realistic to expect this in aggregate for complex systems where "done" is subjective and dynamic.

I'm speaking mainly in the context of commercial software dev here, informed by my lived experience building hundreds of apps. I often say software projects have a fractal complexity. We're constantly identifying new needs and broader scope the deeper we go, not to mention pivots and specific customer asks. You rarely get to stand still.

I don't mean to be pessimistic, but my hunch is that complexity growth outpaces the rate of simplification in almost every software project. This model attempts to explain why that is so. And notably, simplification itself requires verification and so it is in a sense part of the verification cost, too.

kaicianflone

Thank you for the post. It's a good read. I'm working on governance/validation layers for n-LLMs and making them observable so your comments on runaway AIs resonated with me. My research is pointing me to reputation and stake consensus mechanisms being the validation layer either pre inference or pre-execution, and the time to verify decisions can be skipped with enough "decision liquidity" via reputation alone aka decision precedence.

thorn

Very deep post on the problem. AI seems to worsen the issue of software correctness and given the nature of business, it won’t be ever solved.

neya

The fundamental problem is the verification loop for the average developer is grounded not in tests, but with results. Write code, reload browser, check output. Does it work the way I want? Good. We're done here.

Not write code, write tests, ensure all test-cases are covered. Now, imagine such a flaky foundation is used to build on top of even more untested code. That's how bad quality software (that's usually unfixable without a major re-write) is born.

Also, most vibe-coders don't have enough experience/knowledge to figure out what is wrong with the code generated by the AI. For that, you need to know more than the AI and have a strong foundation in the domain you're working on. Here is an example: You ask the AI to write the code for a comment form. It generates the backend and frontend code for you (let's say React/Svelte/Vue/whatever). The vibe-coder sees the UI - most likely written in Tailwind CSS and thinks "wow, that looks really good!" and they click approve. However, an experienced person might notice the form does not have CSRF protection in place. The vibe-coder might not even be aware of the concept of CSRF (let alone the top 10 OSWAP security risks).

Hence, the fundamental problem is not knowing about the domain more than the AI to pick up the flaw. Unless this fundamental problem is solved - which I don't think it will anytime soon because everyone can generate code + UI these days, I don't see a solution to the verification problem.

However, this is good news for consultants and the like because it creates more work down the line to fix the vide-coded mess because they got hacked the very next day and we can charge them a rush fee on top of it, too. So, it's not all that bad.

Spivak

My impression has been with frontend tests is they come in two flavors: extremely hard to write to the point of impracticality, and useless. Most orgs settle for the latter and end up testing easily testable things like the final rendered dom and rely on human qa for all the hard bits like "does the page actually look like it's supposed to, does interacting with the UI elements behave correctly, and do the flows actually work."

All largely stemming from the fact that tests can't meaningfully see and interact with the page like the end user will.

neya

> extremely hard to write to the point of impracticality, and useless > Most orgs settle for the latter and end up testing easily testable things like the final rendered dom and rely on human qa for all the hard bits like "does the page actually look like it's supposed to

Not disagreeing with you here, but what ends up happening is the frontend works flawlessly in the browser/device being tested but has tons of bugs in the others. Best examples of this are most banking apps, corporate portals, etc. But honestly, you can get away with the frontend without writing tests. But the backend? That directly affects the security aspect of the software and cannot afford to skip important tests atleast.

krackers

>can't meaningfully see and interact with the page like the end user will

Isn't this a great use case for LLM tests? Have a "computer use agent" and then describe the parameters of the test as "load the page, then navigate to bar, expect foo to happen". You don't need the LLM to generate a test using puppeteer or whatever which is coupled to the specific dom, you just describe what should happen.

orbital-decay

Depends. Does it represent end users well enough? Does it hit the same edge cases as a million users would (especially considering poor variety of heavily post-trained models)? Does it generalize?

8note

> computer use agent

they arent good enough yet at all.

i got an agent to use the windows UIA with success for a feedback loop, and it got the code from not wroking very well to basically done overnight, but without the mcp having good feedback and tagged/id-ed buttons and so on, the computer use was just garbage

globular-toast

This is actually how a lot of software is written, sadly. I used to call it "trial and error programming". I've observed people writing C++ who do not have a mental model of memory but just try stuff until it compiles. For some classes of software, like games, that is acceptable, but for others it would be horrifying.

Now, it is actually completely possible to write UI code without any unit tests in a completely safe way. You use the functional core, imperative shell approach. When all your domain logic is in a fully tested, functional core, you can just go ahead and write "what works" in a thin UI shell. Good luck getting an LLM to rigidly conform to such an architecture, though.

muraiki

The article says that AWS's Cedar authorization policy engine is written in Lean, but it's actually written in Dafny. Writing Dafny is a lot closer to writing "normal" code rather than the proofs you see in Lean. As a non-mathematician I gave up pretty early in the Lean tutorial, while in a recent prototype I learned enough Dafny to be semi-confident in reviewing Claude's Dafny code in about half a day.

The Dafny code formed a security kernel at the core of a service, enforcing invariants like that an audit log must always be written to prior to a mutating operation being performed. Of course I still had bugs, usually from specification problems (poor spec / design) or Claude not taking the proof far enough (proving only for one of a number of related types, which could also have been a specification problem on my part).

In the end I realized I'm writing a bunch of I/O bound glue code and plain 'ol test driven development was fine enough for my threat model. I can review Python code more quickly and accurately than Dafny (or the Go code it eventually had to link to), so I'm back to optimizing for humans again...

dranov

Cedar used to be written in Dafny, but AWS abandoned that implementation and rewrote it in Lean.

https://aws.amazon.com/blogs/opensource/lean-into-verified-s...

muraiki

Oh whoops, thank you for the correction! I didn't realize that.

cpeterso

Looks like LLMs also find Dafny easier to write than Lean. This study, “A benchmark for vericoding: formally verified program synthesis”, reports:

> We present and test the largest benchmark for vericoding, LLM-generation of formally verified code from formal specifications … We find vericoding success rates of 27% in Lean, 44% in Verus/Rust and 82% in Dafny using off-the-shelf LLMs.

https://arxiv.org/html/2509.22908v1

nextos

Not surprising, as Dafny is a bit less expressive (refinement instead of dependent types) and therefore easier to write. IMHO, it hits a very nice sweet spot. The disadvantage of Dafny is the lack of manual tactics to prove things when SAT/SMT automation fails. But this is getting fixed.

JuniperMesos

There's multiple Lean tutorials, some of which are more mathy than others. One of the things I like about Lean is precisely that it's an ordinary, Haskell-style functional programming language in addition to having all the Curry-Howard-isomorphism-based mathematical proof machinery. You can write `cat` in Lean.

maltalex

Maybe I'm missing something, but isn't this the same as writing code, but with extra steps?

Currently, engineers work with loose specifications, which they translate into code. With the proposed approach, they would need to first convert those specifications into a formally verifiable form before using LLMs to generate the implementation.

But to be production-ready, that spec would have to cover all possible use-cases, edge cases, error handling, performance targets, security and privacy controls, etc. That sounds awfully close to being an actual implementation, only in a different language.

hwayne

The key bit is that specifications don't need to be "obviously computable", so they can be a lot simpler than the code that implements them. Consider the property "if some function has a reference to a value, that value will not change unless that function explicitly changes it". It's simple enough to express, but to implement it Rust needs the borrow checker, which is a pretty heavy piece of engineering. And proving the implementation actually guarantees that property isn't easy, either!

undefined

[deleted]

nextos

Formal specifications can be easier to write than code. Take refinement types in Dafny, for example. Because they are higher-level, you don't need to bother with tedious implementation details. By shifting our focus to formal specifications rather than manual coding, we could possibly leverage automation to not only develop software faster but also achieve a higher degree of assurance.

Of course, this remains largely theoretical for now, but it is an exciting possibility. Note high-level specifications often overlook performance issues, but they are likely sufficient for most scenarios. Regardless, we have formal development methodologies able to decompose problems to an arbitrary level of granularity since the 1990s, all while preserving correctness. It is likely that many of these ideas will be revisited soon.

_pdp_

I think the issue goes even deeper than verification. Verification is technically possible. You could, in theory, build a C compiler or a browser and use existing tests to confirm it works.

The harder problem is discovery: how do you build something entirely new, something that has no existing test suite to validate against?

Verification works because someone has already defined what "correct" looks like. There is possible a spec, or a reference implementation, or a set of expected behaviours. The system just has to match them.

But truly novel creation does not have ground truth to compare against and no predefined finish line. You are not just solving a problem. You are figuring out what the problem even is.

Avshalom

Well that's a problem the software industry has been building for itself for decades.

Software has, since at least the adoption of "agile" created an industry culture of not just refusing to build to specs but insisting that specs are impossible to get from a customer.

daveguy

Agile hasn't been insisting that specs are impossible to get from a customer. They have been insisting that getting specs from a customer is best performed as a dynamic process. In my opinion, that's one of agile's most significant contributions. It lines up with a learning process that doesn't assume the programmer or the customer knows the best course ahead of time.

bunderbunder

I have found that it works well as an open-endlessly dynamic process when you are doing the kind of work that the people who came up with Scrum did as their bread and butter: limited-term contract jobs that were small enough to be handled by a single pizza-sized team and whose design challenges mostly don’t stray too far outside the Cynefn clear domain.

The less any of those applies, the more costly it is to figure it out as you go along, because accounting for design changes can become something of a game of crack the whip. Iterative design is still important under such circumstances, but it may need to be a more thoughtful form of iteration that’s actively mindful about which kinds of design decisions should be front-loaded and which ones can be delayed.

skydhash

And good luck when getting misaligned specs (communication issues customer side, docs that are not aligned with the product,...). Drafting specs and investigating failure will require both a diplomat hat and a detective hat. Maybe with the developer hat, we will get DDD being meaningful again.

user3939382

I don’t want to put words in your mouth but I think I agree. It’s called requirements engineering. It’s hard, but it’s possible and waterfall works fine for many domains. Agile teams I see burning resources doing the same thing 2-3x or sprinting their way into major, costly architectural mistakes that would have been easily avoided by upfront planning and specs.

bigfishrunning

I always try to get the customer to provide specs, and failing that, to agree to specs before we start working. It's usually very difficult.

pydry

Agile is a pretty badly defined beast at the best of times but even the most twisted interpretation doesnt mean that. It's mainly just a rejection of BDUF.

galaxyLogic

There's two ways of thinking about tests:

A) They let you verify that implementation passes its spec, more or less.

B) They are a (trustworthy) description of how the system behaves, they allow you to understand the system better.

randusername

> Most people think of verification as a cost, a tax on development, justified only for safety-critical systems. That framing is outdated.

> The value is not in the verification workforce. It is in what verified delivery enables

These takes misrepresent safety-critical software verification. The verification workforce is indispensable. Whereas the software engineers are usually highly specialized, the verification team owns the domain and project knowledge to make sure the whole thing integrates. They consult on requirements, anticipate oversights, build infrastructure and validate it, and execute the verification.

So when the business needs a go / no-go call, they ask the verification lead.

Automation and proofs have their place, but I don't see how verification without human accountability is workable.

voxleone

What’s striking here is the convergence on a minimal axiomatic kernel (Lean) as the only scalable way to guarantee coherent reasoning. Some of us working on foundational physics are exploring the same methodological principle. In the “Functional Universe” framework[0], for example, we start from a small set of axioms and attempt to derive physical structure from that base.

The domains are different, but the strategy is similar: don’t rely on heuristics or empirical patching; define a small trusted core of axioms and generate coherent structure compositionally from there.

[0] https://voxleone.github.io/FunctionalUniverse

SurvivorForge

The uncomfortable truth is that most teams were already bad at verification before AI entered the picture. The difference is that AI-generated code comes in faster, so the verification bottleneck becomes painfully obvious. I think the real opportunity here is that AI forcing us to think harder about specifications and correctness might actually improve the quality of human-written code too — it's making us confront the fact that "it works on my machine" was never real verification.

phyzix5761

I found my best use of AI for writing code is with a TDD approach. I write 1 test, watch it fail, let the AI implement the solution, and then if I like it I commit. Write the next test and repeat.

It keeps me in the loop, I'm testing actual functionality rather than code, and my code is always in a state where I can open a PR and merge it back to main.

faitswulff

Someone actually did fuzz the claude c compiler:

https://john.regehr.org/writing/claude_c_compiler.html

Daily Digest email

Get the top HN stories in your inbox every day.