How I use Claude Code: Separation of planning and execution

Daily Digest email

Get the top HN stories in your inbox every day.

haolez

> Notice the language: “deeply”, “in great details”, “intricacies”, “go through everything”. This isn’t fluff. Without these words, Claude will skim. It’ll read a file, see what a function does at the signature level, and move on. You need to signal that surface-level reading is not acceptable.

This makes no sense to my intuition of how an LLM works. It's not that I don't believe this works, but my mental model doesn't capture why asking the model to read the content "more deeply" will have any impact on whatever output the LLM generates.

nostrademons

It's the attention mechanism at work, along with a fair bit of Internet one-up-manship. The LLM has ingested all of the text on the Internet, as well as Github code repositories, pull requests, StackOverflow posts, code reviews, mailing lists, etc. In a number of those content sources, there will be people saying "Actually, if you go into the details of..." or "If you look at the intricacies of the problem" or "If you understood the problem deeply" followed by a very deep, expert-level explication of exactly what you should've done differently. You want the model to use the code in the correction, not the one in the original StackOverflow question.

Same reason that "Pretend you are an MIT professor" or "You are a leading Python expert" or similar works in prompts. It tells the model to pay attention to the part of the corpus that has those terms, weighting them more highly than all the other programming samples that it's run across.

manmal

I don’t think this is a result of the base training data („the internet“). It’s a post training behavior, created during reinforcement learning. Codex has a totally different behavior in that regard. Codex reads per default a lot of potentially relevant files before it goes and writes files.

Maybe you remember that, without reinforcement learning, the models of 2019 just completed the sentences you gave them. There were no tool calls like reading files. Tool calling behavior is company specific and highly tuned to their harnesses. How often they call a tool, is not part of the base training data.

spagettnet

Modern LLM are certainly fine tuned on data that includes examples of tool use, mostly the tools built into their respective harnesses, but also external/mock tools so they dont overfit on only using the toolset they expect to see in their harnesses.

xscott

Of course I can't be certain, but I think the "mixture of experts" design plays into it too. Metaphorically, there's a mid-level manager who looks at your prompt and tries to decide which experts it should be sent to. If he thinks you won't notice, he saves money by sending it to the undergraduate intern.

Just a theory.

victorbjorklund

Notice that MOE isn’t different experts for different types of problems. It’s per token and not really connect to problem type.

So if you send a python code then the first one in function can be one expert, second another expert and so on.

aakresearch

This is a very useful take, thank you. Really helped me to adjust my mental model without "antropomorphising" the machinery. Upvoted.

If I may, I would re-phrase/expand the last sentence of yours in a way that makes it even more useful for me, personally. Maybe it could help other people too. I think it is fair to say that in presence of hints like "Pretend you are X" or "Take a deeper look" the inference mechanism (driven by it's training weights, and now influenced by those hints via "attention math") is not "satisfied" until it pulls more relevant tokens into "working context" ("more" and "relevant" being modulated by the particular hint).

undefined

[deleted]

r0b05

This is such a good explanation. Thanks

hbarka

>> Same reason that "Pretend you are an MIT professor" or "You are a leading Python expert" or similar works in prompts.

This pretend-you-are-a-[persona] is cargo cult prompting at this point. The persona framing is just decoration.

A brief purpose statement describing what the skill [skill.md] does is more honest and just as effective.

rescbr

I think it does more harm than good on recent models. The LLM has to override its system prompt to role-play, wasting context and computing cycles instead of working on the task.

LEDThereBeLight

It’s not cargo culting, it does make a difference and there are papers on arxiv discussing it. The trouble is that it’s hard to tell whether it’ll help or hurt - telling it to act as an expert in one field may improve your result, or may make it lose some of the other perspectives it has which might be more important for solving the problem.

dakolli

You will never convince me that this isn't confirmation bias, or the equivalent of a slot machine player thinking the order in which they push buttons impacts the output, or some other gambler-esque superstition.

These tools are literally designed to make people behave like gamblers. And its working, except the house in this case takes the money you give them and lights it on fire.

nubg

Your ignorance is my opportunity. May I ask which markets you are developing for?

FuckButtons

That’s because it’s superstition.

Unless someone can come up with some kind of rigorous statistics on what the effect of this kind of priming is it seems no better than claiming that sacrificing your first born will please the sun god into giving us a bountiful harvest next year.

Sure, maybe this supposed deity really is this insecure and needs a jolly good pep talk every time he wakes up. or maybe you’re just suffering from magical thinking that your incantations had any effect on the random variable word machine.

The thing is, you could actually prove it, it’s an optimization problem, you have a model, you can generate the statistics, but no one as far as I can tell has been terribly forthcoming with that , either because those that have tried have decided to try to keep their magic spells secret, or because it doesn’t really work.

If it did work, well, the oldest trick in computer science is writing compilers, i suppose we will just have to write an English to pedantry compiler.

stingraycharles

I actually have a prompt optimizer skill that does exactly this.

https://github.com/solatis/claude-config

It’s based entirely off academic research, and a LOT of research has been done in this area.

One of the papers you may be interested in is “emotion prompting”, eg “it is super important for me that you do X” etc actually works.

“Large Language Models Understand and Can be Enhanced by Emotional Stimuli”

https://arxiv.org/abs/2307.11760

bavell

Thanks for sharing! I've been gravitating towards this sort of workflow already - just seems like the right approach for these tools.

majormajor

> If it did work, well, the oldest trick in computer science is writing compilers, i suppose we will just have to write an English to pedantry compiler.

"Add tests to this function" for GPT-3.5-era models was much less effective than "you are a senior engineer. add tests for this function. as a good engineer, you should follow the patterns used in these other three function+test examples, using this framework and mocking lib." In today's tools, "add tests to this function" results in a bunch of initial steps to look in common places to see if that additional context already exists, and then pull it in based on what it finds. You can see it in the output the tools spit out while "thinking."

So I'm 90% sure this is already happening on some level.

GrinningFool

But can you see the difference if you only include "you are a senior engineer"? It seems like the comparison you're making is between "write the tests" and "write the tests following these patterns using these examples. Also btw you’re an expert. "

FuckButtons

Today’s llms have had a tonne of deep rl using git histories from more software projects than you’ve ever even heard of, given the latency of a response I doubt there’s any intermediate preprocessing, it’s just what the model has been trained to do.

imiric

> That’s because it’s superstition.

This field is full of it. Practices are promoted by those who tie their personal or commercial brand to it for increased exposure, and adopted by those who are easily influenced and don't bother verifying if they actually work.

This is why we see a new Markdown format every week, "skills", "benchmarks", and other useless ideas, practices, and measurements. Consider just how many "how I use AI" articles are created and promoted. Most of the field runs on anecdata.

It's not until someone actually takes the time to evaluate some of these memes, that they find little to no practical value in them.[1]

[1]: https://news.ycombinator.com/item?id=47034087

oblio

> This field is full of it. Practices are promoted by those who tie their personal or commercial brand to it for increased exposure, and adopted by those who are easily influenced and don't bother verifying if they actually work.

Oh, the blasphemy!

So, like VB, PHP, JavaScript, MySQL, Mongo, etc? :-)

onion2k

i suppose we will just have to write an English to pedantry compiler.

A common technique is to prompt in your chosen AI to write a longer prompt to get it to do what you want. It's used a lot in image generation. This is called 'prompt enhancing'.

rzmmm

I think "understand this directory deeply" just gives more focus for the instruction. So it's like "burn more tokens for this phase than you normally would".

jcdavis

Its a wild time to be in software development. Nobody(1) actually knows what causes LLMs to do certain things, we just pray the prompt moves the probabilities the right way enough such that it mostly does what we want. This used to be a field that prided itself on deterministic behavior and reproducibility.

Now? We have AGENTS.md files that look like a parent talking to a child with all the bold all-caps, double emphasis, just praying that's enough to be sure they run the commands you want them to be running

(1 Outside of some core ML developers at the big model companies)

harrall

It’s like playing a fretless instrument to me.

Practice playing songs by ear and after 2 weeks, my brain has developed an inference model of where my fingers should go to hit any given pitch.

Do I have any idea how my brain’s model works? No! But it tickles a different part of my brain and I like it.

klipt

Sufficiently advanced technology has become like magic: you have to prompt the electronic genie with the right words or it will twist your wishes.

undefined

[deleted]

silversmith

Light some incense, and you too can be a dystopian space tech support, today! Praise Omnissiah!

undefined

[deleted]

chickensong

For Claude at least, the more recent guidance from Anthropic is to not yell at it. Just clear, calm, and concise instructions.

glerk

Yep, with Claude saying "please" and "thank you" actually works. If you build rapport with Claude, you get rewarded with intuition and creativity. Codex, on the other hand, you have to slap it around like a slave gollum and it will do exactly what you tell it to do, no more, no less.

joshmn

Sometimes I daydream about people screaming at their LLM as if it was a TV they were playing video games on.

trueno

wait seriously? lmfao

thats hilarious. i definitely treat claude like shit and ive noticed the falloff in results.

if there's a source for that i'd love to read about it.

scuff3d

How anybody can read stuff like this and still take all this seriously is beyond me. This is becoming the engineering equivalent of astrology.

energy123

Anthropic recommends doing magic invocations: https://simonwillison.net/2025/Apr/19/claude-code-best-pract...

It's easy to know why they work. The magic invocation increases test-time compute (easy to verify yourself - try!). And an increase in test-time compute is demonstrated to increase answer correctness (see any benchmark).

It might surprise you to know that the only different between GPT 5.2-low and GPT 5.2-xhigh is one of these magic invocations. But that's not supposed to be public knowledge.

gehsty

I think this was more of a thing on older models. Since I started using Opus 4.5 I have not felt the need to do this.

cloudbonsai

The evolution of software engineering is fascinating to me. We started by coding in thin wrappers over machine code and then moved on to higher-level abstractions. Now, we've reached the point where we discuss how we should talk to a mystical genie in a box.

I'm not being sarcastic. This is absolutely incredible.

intrasight

And I've been had a long enough to go through that whole progression. Actually from the earlier step of writing machine code. It's been and continues to be a fun journey which is why I'm still working.

yawnr

Nice to hear someone say it. Like what are we even doing? It's exhausting.

sumedh

We have tests and benchmarks to measure it though.

fragmede

Feel free to run your own tests and see if the magic phrases do or do not influence the output. Have it make a Todo webapp with and without those phrases and see what happens!

scuff3d

That's not how it works. It's not on everyone else to prove claims false, it's on you (or the people who argue any of this had a measurable impact) to prove it actually works. I've seen a bunch of articles like this, and more comments. Nobody I've ever seen has produced any kind of measurable metrics of quality based on one approach vs another. It's all just vibes.

Without something quantifiable it's not much better then someone who always wears the same jersey when their favorite team plays, and swears they play better because of it.

hashmap

these sort-of-lies might help:

think of the latent space inside the model like a topological map, and when you give it a prompt, you're dropping a ball at a certain point above the ground, and gravity pulls it along the surface until it settles.

caveat though, thats nice per-token, but the signal gets messed up by picking a token from a distribution, so each token you're regenerating and re-distorting the signal. leaning on language that places that ball deep in a region that you want to be makes it less likely that those distortions will kick it out of the basin or valley you may want to end up in.

if the response you get is 1000 tokens long, the initial trajectory needed to survive 1000 probabilistic filters to get there.

or maybe none of that is right lol but thinking that it is has worked for me, which has been good enough

noduerme

Hah! Reading this, my mind inverted it a bit, and I realized ... it's like the claw machine theory of gradient descent. Do you drop the claw into the deepest part of the pile, or where there's the thinnest layer, the best chance of grabbing something specific? Everyone in everu bar has a theory about claw machines. But the really funny thing that unites LLMs with claw machines is that the biggest question is always whether they dropped the ball on purpose.

The claw machine is also a sort-of-lie, of course. Its main appeal is that it offers the illusion of control. As a former designer and coder of online slot machines... totally spin off into pages on this analogy, about how that illusion gets you to keep pulling the lever... but the geographic rendition you gave is sort of priceless when you start making the comparison.

basch

My mental model for them is plinko boards. Your prompt changes the spacing between the nails to increase the probability in certain directions as your chip falls down.

hashmap

i literally suggested this metaphor earlier yesterday to someone trying to get agents to do stuff they wanted, that they had to set up their guardrails in a way that you can let the agents do what they're good at, and you'll get better results because you're not sitting there looking at them.

i think probably once you start seeing that the behavior falls right out of the geometry, you just start looking at stuff like that. still funny though.

Betelbuddy

Its very logical and pretty obvious when you do code generation. If you ask the same model, to generate code by starting with:

- You are a Python Developer... or - You are a Professional Python Developer... or - You are one of the World most renowned Python Experts, with several books written on the subject, and 15 years of experience in creating highly reliable production quality code...

You will notice a clear improvement in the quality of the generated artifacts.

gehsty

Do you think that Anthropic don’t include things like this in their harness / system prompts? I feel like this kind of prompts are uneccessary with Opus 4.5 onwards, obviously based on my own experience (I used to do this, on switching to opus I stopped and have implemented more complex problems, more successfully).

I am having the most success describing what I want as humanly as possible, describing outcomes clearly, making sure the plan is good and clearing context before implementing.

hu3

Maybe, but forcing code generation in a certain way could ruin hello worlds and simpler code generation.

Sometimes the user just wants something simple instead of enterprise grade.

obiefernandez

My colleague swears by his DHH claude skill https://danieltenner.com/dhh-is-immortal-and-costs-200-m/

bavell

Haha, this reminds me of all the stable diffusion "in the style of X artist" incantations.

haolez

That's different. You are pulling the model, semantically, closer to the problem domain you want it to attack.

That's very different from "think deeper". I'm just curious about this case in specific :)

argee

I don't know about some of those "incantations", but it's pretty clear that an LLM can respond to "generate twenty sentences" vs. "generate one word". That means you can indeed coax it into more verbosity ("in great detail"), and that can help align the output by having more relevant context (inserting irrelevant context or something entirely improbable into LLM output and forcing it to continue from there makes it clear how detrimental that can be).

Of course, that doesn't mean it'll definitely be better, but if you're making an LLM chain it seems prudent to preserve whatever info you can at each step.

fragmede

Yeah, it's definitely a strange new world we're in, where I have to "trick" the computer into cooperating. The other day I told Claude "Yes you can", and it went off and did something it just said it couldn't do!

bpodgursky

You bumped the token predictor into the latent space where it knew what it was doing : )

itypecode

Solid dad move. XD

wilkystyle

Is parenting making us better at prompt engineering, or is it the other way around?

optimalsolver

The little language model that could.

computomatic

If I say “you are our domain expert for X, plan this task out in great detail” to a human engineer when delegating a task, 9 times out of 10 they will do a more thorough job. It’s not that this is voodoo that unlocks some secret part of their brain. It simply establishes my expectations and they act accordingly.

To the extent that LLMs mimic human behaviour, it shouldn’t be a surprise that setting clear expectations works there too.

sparin9

I think the real value here isn’t “planning vs not planning,” it’s forcing the model to surface its assumptions before they harden into code.

LLMs don’t usually fail at syntax. They fail at invisible assumptions about architecture, constraints, invariants, etc. A written plan becomes a debugging surface for those assumptions.

maxnevermind

Yeap, I recently came to realization that is useful to think about LLMs as assumption engines. They have trillions of those and fill the gaps when they see the need. As I understand, assumptions are supposedly based on industry standards, If those deviate from what you are trying to build then you might start having problems, like when you try to implement a solution which is not "googlable", LLM will try to assume some standard way to do it and will keep pushing it, then you have to provide more context, but if you have to spend too much time on providing the context, then you might not save that much time in the end.

remify

Sub agent also helps a lot in that regard. Have an agent do the planning, have an implementation agent do the code and have another one do the review. Clear responsabilities helps a lot.

There also blue team / red team that works.

The idea is always the same: help LLM to reason properly with less and more clear instructions.

jalopy

This sounds very promising. Any link to more details?

hinkley

A huge part of getting autonomy as a human is demonstrating that you can be trusted to police your own decisions up to a point that other people can reason about. Some people get more autonomy than others because they can be trusted with more things.

All of these models are kinda toys as long as you have to manually send a minder in to deal with their bullshit. If we can do it via agents, then the vendors can bake it in, and they haven't. Which is just another judgement call about how much autonomy you give to someone who clearly isn't policing their own decisions and thus is untrustworthy.

If we're at the start of the Trough of Disillusionment now, which maybe we are and maybe we aren't, that'll be part of the rebound that typically follows the trough. But the Trough is also typically the end of the mountains of VC cash, so the costs per use goes up which can trigger aftershocks.

vincentvandeth

This approach sounds clean in theory, but in production you're building a black box. When your planning agent hands off to an implementation agent and that hands off to a review agent — where did the bug originate? Which agent's context was polluted? Good luck tracing that. I went the opposite direction: single agent per task, strict quality gates between steps, full execution logs. No sub-agents. Every decision is traceable to one context window. The governance layer (PR gates, staged rollouts, acceptance criteria) does the work that people expect sub-agents to do — but with actual observability.

After 6 months in production and 1100+ learned patterns: fewer moving parts, better debugging, more reliable output. Built a full production crawler this way — 26 extractors, 405 tests — without sub-agents. Orchestrator acts as gatekeeper that redispatches uncompleted work.

gck1

> Every decision is traceable to one context window

There are no models that can do all the mentioned steps in a single usable context window. This is why subagents or multi-agent orchestrators exist in the first place.

antonvs

Since the phases are sequential, what’s the benefit of a sub agent vs just sequential prompts to the same agent? Just orchestration?

edmundsauto

Context pollution, I think. Just because something is sequential in a context file doesn’t mean it’ll happen sequentially, but if you use subagents there is a separation of concerns. I also feel like one bloated context window feels a little sloppy in the execution (and costs more in tokens).

YMMV, I’m still figuring this stuff out

drivebyhooting

This runs counter to the advice in the fine article: one long continuous session building context.

synergy20

I think claude-code is doing this at the background now

vagab0nd

I recently learned a trick to improve an LLM's thinking (maybe it's well know?):

Requesting { "output": "x" } consistently fails, despite detailed instructions.

Changing to requesting { "output": "x", "reasoning": "y" } produces the desired outcome.

asdxrfx

It's also great to describe the full use case flow in the instructions, so you can clearly understand that LLM won't do some stupid thing on its own

maccard

> LLMs don’t usually fail at syntax?

Really? My experience has been that it’s incredibly easy to get them stuck in a loop on a hallucinated API and burn through credits before I’ve even noticed what it’s done. I have a small rust project that stores stuff on disk that I wanted to add an s3 backend too - Claude code burned through my $20 in a loop in about 30 minutes without any awareness of what it was doing on a very simple syntax issue.

kertoip_1

Might depend on used language. From my experience Claude Sonnet indeed never make any syntax mistakes in JS/TS/C#, but these are popular language with lots of training data.

hun3

Except that merely surfacing them changes their behavior, like how you add that one printf() call and now your heisenbug is suddenly nonexistent

vincentvandeth

[dead]

MagicMoonlight

Did you just write this with ChatGPT?

zenoprax

I've never seen an LLM use "etc" but the rest gives a strong "it's not just X, it's Y" vibe.

I really hope the fine-tuning of our slop detectors can help with misinformation and bullshit detection.

brandall10

I go a bit further than this and have had great success with 3 doc types and 2 skills:

- Specs: these are generally static, but updatable as the project evolves. And they're broken out to an index file that gives a project overview, a high-level arch file, and files for all the main modules. Roughly ~1k lines of spec for 10k lines of code, and try to limit any particular spec file to 300 lines. I'm intimately familiar with every single line in these.

- Plans: these are the output of a planning session with an LLM. They point to the associated specs. These tend to be 100-300 lines and 3 to 5 phases.

- Working memory files: I use both a status.md (3-5 items per phase roughly 30 lines overall), which points to a latest plan, and a project_status (100-200 lines), which tracks the current state of the project and is instructed to compact past efforts to keep it lean)

- A planner skill I use w/ Gemini Pro to generate new plans. It essentially explains the specs/plans dichotomy, the role of the status files, and to review everything in the pertinent areas of code and give me a handful of high-level next set of features to address based on shortfalls in the specs or things noted in the project_status file. Based on what it presents, I select a feature or improvement to generate. Then it proceeds to generate a plan, updates a clean status.md that points to the plan, and adjusts project_status based on the state of the prior completed plan.

- An implementer skill in Codex that goes to town on a plan file. It's fairly simple, it just looks at status.md, which points to the plan, and of course the plan points to the relevant specs so it loads up context pretty efficiently.

I've tried the two main spec generation libraries, which were way overblown, and then I gave superpowers a shot... which was fine, but still too much. The above is all homegrown, and I've had much better success because it keeps the context lean and focused.

And I'm only on the $20 plans for Codex/Gemini vs. spending $100/month on CC for half year prior and move quicker w/ no stall outs due to token consumption, which was regularly happening w/ CC by the 5th day. Codex rarely dips below 70% available context when it puts up a PR after an execution run. Roughly 4/5 PRs are without issue, which is flipped against what I experienced with CC and only using planning mode.

jcurbo

This is pretty much my approach. I started with some spec files for a project I'm working on right now, based on some academic papers I've written. I ended up going back and forth with Claude, building plans, pushing info back into the specs, expanding that out and I ended up with multiple spec/architecture/module documents. I got to the point where I ended up building my own system (using claude) to capture and generate artifacts, in more of a systems engineering style (e.g. following IEEE standards for conops, requirement documents, software definitions, test plans...). I don't use that for session-level planning; Claude's tools work fine for that. (I like superpowers, so far. It hasn't seemed too much)

I have found it to work very well with Claude by giving it context and guardrails. Basically I just tell it "follow the guidance docs" and it does. Couple that with intense testing and self-feedback mechanisms and you can easily keep Claude on track.

I have had the same experience with Codex and Claude as you in terms of token usage. But I haven't been happy with my Codex usage; Claude just feels like it's doing more of what I want in the way I want.

brandall10

IME, Claude is more powerful, but Codex follows instructions better. So the more precise the context, the better results you'll get with Codex.

Claude OTOH works better with ambiguity, but it also tends to stray a bit off spec in subtle ways. I always had to take more corrective action w/ the PRs it produced.

That said, I haven't used CC in 3 months and the latest models may be better.

gck1

This looks very similar to what I'm doing. Few questions:

- How do you adress spec drift? A new feature can easily affect 2 or 3 specs. Do you update them manually? Is a new feature part of a new spec or you update the spec and then plan based on spec changes?

- How do you address plan drift? A plan may change as implementer surfaces some issues with the spec for example.

brandall10

- Whenever I have a change to suggest, I ask Gemini to review my docs/specs folder. I then describe the change I'm thinking of and ask it to modify the specs as it sees fit. I review those changes, ask questions or make suggestions/corrections, rinse/repeat until I'm satisfied. This tends to take about 5-6 iterations, esp. if the agent is adding or suggesting things I hadn't considered and want to dig in deeper on.

- I don't update plans in the past - any work that superseeds work from an earlier plan is simply a new plan. If during creation of a new plan I review the plan and decide I want to something else that requires a spec update, I trash the plan, do the spec update, and rerun plan generation. Past plans of course can point to divergent specs but that's not something I care about much, as plans are a self-contained enough story of the work that was done.

r1290

Looks good. Question - is it always better to use a monorepo in this new AI world? Vs breaking your app into separate repos? At my company we have like 6 repos all separate nextjs apps for the same user base. Trying to consolidate to one as it should make life easier overall.

throwup238

It really depends but there’s nothing stopping you from just creating a separate folder with the cloned repositories (or worktrees) that you need and having a root CLAUDE.md file that explains the directory structure and referencing the individual repo CLAUDE.md files.

oa335

Just put all the repos in all in one directory yourself. In my experience that works pretty well.

chickensong

AI is happy to work with any directory you tell it to. Agent files can be applied anywhere.

zmmmmm

I actually don't really like a few of things about this approach.

First, the "big bang" write it all at once. You are going to end up with thousands of lines of code that were monolithically produced. I think it is much better to have it write the plan and formulate it as sensible technical steps that can be completed one at a time. Then you can work through them. I get that this is not very "vibe"ish but that is kind of the point. I want the AI to help me get to the same point I would be at with produced code AND understanding of it, just accelerate that process. I'm not really interested in just generating thousands of lines of code that nobody understands.

Second, the author keeps refering to adjusting the behaviour, but never incorporating that into long lived guidance. To me, integral with the planning process is building an overarching knowledge base. Every time you're telling it there's something wrong, you need to tell it to update the knowledge base about why so it doesn't do it again.

Finally, no mention of tests? Just quick checks? To me, you have to end up with comprehensive tests. Maybe to the author it goes without saying, but I find it is integral to build this into the planning. Certain stages you will want certain types of tests. Some times in advance of the code (so TDD style) other times built alongside it or after.

It's definitely going to be interesting to see how software methodology evolves to incorporate AI support and where it ultimately lands.

girvo

The articles approach matches mine, but I've learned from exactly the things you're pointing out.

I get the PLAN.md (or equivalent) to be separated into "phases" or stages, then carefully prompt (because Claude and Codex both love to "keep going") it to only implement that stage, and update the PLAN.md

Tests are crucial too, and form another part of the plan really. Though my current workflow begins to build them later in the process than I would prefer...

alexrezvov

Cool, the idea of leaving comments directly in the plan never even occurred to me, even though it really is the obvious thing to do.

Do you markup and then save your comments in any way, and have you tried keeping them so you can review the rules and requirements later?

red_hare

I use Claude Code for lecture prep.

I craft a detailed and ordered set of lecture notes in a Quarto file and then have a dedicated claude code skill for translating those notes into Slidev slides, in the style that I like.

Once that's done, much like the author, I go through the slides and make commented annotations like "this should be broken into two slides" or "this should be a side-by-side" or "use your generate clipart skill to throw an image here alongside these bullets" and "pull in the code example from ../examples/foo." It works brilliantly.

And then I do one final pass of tweaking after that's done.

But yeah, annotations are super powerful. Token distance in-context and all that jazz.

saxelsen

Can I ask how you annotate the feedback for it? Just with inline comments like `# This should be changed to X`?

The author mentions annotations but doesn't go into detail about how to feed the annotations to Claude.

red_hare

Slidev is markdown, so i do it in html comments. Usually something like:

    <!-- TODOCLAUDE: Split this into a two-cols-title, divide the examples between -->

    <!-- TODOCLAUDE: Use clipart skill to make an image for this slide -->

And then, when I finish annotating I just say: "Address all the TODOCLAUDEs"

danyim

Thanks for sharing this method - such an elegant way to add annotations to generated specs!

malshe

Quarto can be used to output slides in various formats (Powerpoint, beamer for pdf, revealjs for HTML, etc.). I wonder why you use Slidev as you can just ask Claude Code to create another Quarto document.

sidpatil

It looks like Slidev is designed for presentations about software development, judging from its feature set. Quarto is more general-purpose. (That's not to say Quarto can't support the same features, but currently it doesn't.)

I'm not affiliated with Slidev. I was just curious.

ramoz

is your skill open source

red_hare

Not yet... but also I'm not sure it makes a lot of sense to be open source. It's super specific to how I like to build slide decks and to my personal lecture style.

But it's not hard to build one. The key for me was describing, in great detail:

1. How I want it to read the source material (e.g., H1 means new section, H2 means at least one slide, a link to an example means I want code in the slide)

2. How to connect material to layouts (e.g., "comparison between two ideas should be a two-cols-title," "walkthrough of code should be two-cols with code on right," "learning objectives should be side-title align:left," "recall should be side-title align:right")

Then the workflow is:

1. Give all those details and have it do a first pass.

2. Give tons of feedback.

3. At the end of the session, ask it to "make a skill."

4. Manually edit the skill so that you're happy with the examples.

mvkel

> the workflow I’ve settled into is radically different from what most people do with AI coding tools

This looks exactly like what anthropic recommends as the best practice for using Claude Code. Textbook.

It also exposes a major downside of this approach: if you don't plan perfectly, you'll have to start over from scratch if anything goes wrong.

I've found a much better approach in doing a design -> plan -> execute in batches, where the plan is no more than 1,500 lines, used as a proxy for complexity.

My 30,000 LOC app has about 100,000 lines of plan behind it. Can't build something that big as a one-shot.

onion2k

if you don't plan perfectly, you'll have to start over from scratch if anything goes wrong

This is my experience too, but it's pushed me to make much smaller plans and to commit things to a feature branch far more atomically so I can revert a step to the previous commit, or bin the entire feature by going back to main. I do this far more now than I ever did when I was writing the code by hand.

This is how developers should work regardless of how the code is being developed. I think this is a small but very real way AI has actually made me a better developer (unless I stop doing it when I don't use AI... not tried that yet.)

solarkraft

I do this too. Relatively small changes, atomic commits with extensive reasoning in the message (keeps important context around). This is a best practice anyway, but used to be excruciatingly much effort. Now it’s easy!

Except that I’m still struggling with the LLM understanding its audience/context of its utterances. Very often, after a correction, it will focus a lot on the correction itself making for weird-sounding/confusing statements in commit messages and comments.

mnicky

> Very often, after a correction, it will focus a lot on the correction itself making for weird-sounding/confusing statements in commit messages and comments.

I've experienced that too. Usually when I request correction, I add something like "Include only production level comments, (not changes)". Recently I also added special instruction for this to CLAUDE.md.

sixtyj

LLMs are really eager to start coding (as interns are eager to start working), so the sentence “don’t implement yet” has to be used very often at the beginning of any project.

onion2k

Most LLM apps have a 'plan' or 'ask' mode for that.

jerryharri

We're learning the lessons of Agile all over again.

intrasight

We're learning how to be an engineer all over again.

The authors process is super-close what we were taught in engineering 101 40 years ago.

mattmanser

Developers should work by wasting lots of time making the wrong thing?

I bet if they did a work and motion study on this approach they'd find the classic:

"Thinks they're more productive, AI has actually made them less productive"

But lots of lovely dopamine from this false progress that gets thrown away!

onion2k

Developers should work by wasting lots of time making the wrong thing?

Yes. In fact, that's not emphatic enough: HELL YES!

More specifically, developers should experiment. They should test their hypothesis. They should try out ideas by designing a solution and creating a proof of concept, then throw that away and build a proper version based on what they learned.

If your approach to building something is to implement the first idea you have and move on then you are going to waste so much more time later refactoring things to fix architecture that paints you into corners, reimplementing things that didn't work for future use cases, fixing edge cases than you hadn't considered, and just paying off a mountain of tech debt.

I'd actually go so far as to say that if you aren't experimenting and throwing away solutions that don't quite work then you're only amassing tech debt and you're not really building anything that will last. If it does it's through luck rather than skill.

Also, this has nothing to do with AI. Developers should be working this way even if they handcraft their artisanal code carefully in vi.

abustamam

> Developers should work by wasting lots of time making the wrong thing?

Yes? I can't even count how many times I worked on something my company deemed was valuable only for it to be deprecated or thrown away soon after. Or, how many times I solved a problem but apparently misunderstood the specs slightly and had to redo it. Or how many times we've had to refactor our code because scope increased. In fact, the very existence of the concepts of refactoring and tech debt proves that devs often spend a lot of time making the "wrong" thing.

Is it a waste? No, it solved the problem as understood at the time. And we learned stuff along the way.

SpaceNoodled

Classic

https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...

chickensong

> design -> plan -> execute in batches

This is the way for me as well. Have a high-level master design and plan, but break it apart into phases that are manageable. One-shotting anything beyond a todo list and expecting decent quality is still a pipe dream.

dbbk

This is actually embarrassing. His "radically different" workflow is... using the built-in Plan mode that they recommend you use? What?

sidpatil

It's not, to be fair.

> I use my own `.md` plan files rather than Claude Code’s built-in plan mode. The built-in plan mode sucks.

mvkel

From Claude docs: Planning is most useful when you’re uncertain about the approach, when the change modifies multiple files, or when you’re unfamiliar with the code being modified. If this isn't true, skip the plan.

oblio

Can you easily version their plans using git?

dbbk

"Write plan to the plans folder in the project"

zozbot234

> if you don't plan perfectly, you'll have to start over from scratch if anything goes wrong.

You just revert what the AI agent changed and revise/iterate on the previous step - no need to start over. This can of course involve restricting the work to a smaller change so that the agent isn't overwhelmed by complexity.

AstroBen

100,000 lines is approx. one million words. The average person reads at 250wpm. The entire thing would take 66 hours just to read, assuming you were approaching it like a fiction book, not thinking anything over

dakolli

wtf, why would you write 100k lines of plan to produce 30k loc.. JUST WRITE THE CODE!!!

oblio

That's not (or should not be what's happening).

They write a short high level plan (let's say 200 words). The plan asks the agent to write a more detailed implementation plan (written by the LLM, let's say 2000-5000 words).

They read this plan and adjust as needed, even sending it to the agent for re-dos.

Once the implementation plan is done, they ask the agent to write the actual code changes.

Then they review that and ask for fixes, adjustments, etc.

This can be comparable to writing the code yourself but also leaves a detailed trail of what was done and why, which I basically NEVER see in human generated code.

That alone is worth gold, by itself.

And on top of that, if you're using an unknown platform or stack, it's basically a rocket ship. You bootstrap much faster. Of course, stay on top of the architecture, do controlled changes, learn about the platform as you go, etc.

abustamam

I take this concept and I meta-prompt it even more.

I have a road map (AI generated, of course) for a side project I'm toying around with to experiment with LLM-driven development. I read the road map and I understand and approve it. Then, using some skills I found on skills.sh and slightly modified, my workflow is as such:

1. Brainstorm the next slice

It suggests a few items from the road map that should be worked on, with some high level methodology to implement. It asks me what the scope ought to be and what invariants ought to be considered. I ask it what tradeoffs could be, why, and what it recommends, given the product constraints. I approve a given slice of work.

NB: this is the part I learn the most from. I ask it why X process would be better than Y process given the constraints and it either corrects itself or it explains why. "Why use an outbox pattern? What other patterns could we use and why aren't they the right fit?"

2. Generate slice

After I approve what to work on next, it generates a high level overview of the slice, including files touched, saved in a MD file that is persisted. I read through the slice, ensure that it is indeed working on what I expect it to be working on, and that it's not scope creeping or undermining scope, and I approve it. It then makes a plan based off of this.

3. Generate plan

It writes a rather lengthy plan, with discrete task bullets at the top. Beneath, each step has to-dos for the llm to follow, such as generating tests, running migrations, etc, with commit messages for each step. I glance through this for any potential red flags.

4. Execute

This part is self explanatory. It reads the plan and does its thing.

I've been extremely happy with this workflow. I'll probably write a blog post about it at some point.

NobleLie

Yep with a human in the loop to process these larger sprawling plan docs (inflated with the intent of the designer iteratively)

Some get deleted from repo others archived, others merged or referenced elsewhere. It's kind of organic.

dakolli

[flagged]

Bishonen88

They didn't write 100k plan lines. The llm did (99.9% of it at least or more). Writing 30k by hand would take weeks if not months. Llms do it in an afternoon.

AstroBen

Just reading that plan would take weeks or months

undefined

[deleted]

dakolli

And my weeks or months of work beats an LLMs 10/10 times. There are no shortcuts in life.

elAhmo

How can you know that 100k lines plan is not just slop?

Just because plan is elaborate doesn’t mean it makes sense.

SignalStackDev

[dead]

EastLondonCoder

I don’t use plan.md docs either, but I recognise the underlying idea: you need a way to keep agent output constrained by reality.

My workflow is more like scaffold -> thin vertical slices -> machine-checkable semantics -> repeat.

Concrete example: I built and shipped a live ticketing system for my club (Kolibri Tickets). It’s not a toy: real payments (Stripe), email delivery, ticket verification at the door, frontend + backend, migrations, idempotency edges, etc. It’s running and taking money.

The reason this works with AI isn’t that the model “codes fast”. It’s that the workflow moves the bottleneck from “typing” to “verification”, and then engineers the verification loop:

  -keep the spine runnable early (end-to-end scaffold)

  -add one thin slice at a time (don’t let it touch 15 files speculatively)

  -force checkable artifacts (tests/fixtures/types/state-machine semantics where it matters)

  -treat refactors as normal, because the harness makes them safe

If you run it open-loop (prompt -> giant diff -> read/debug), you get the “illusion of velocity” people complain about. If you run it closed-loop (scaffold + constraints + verifiers), you can actually ship faster because you’re not paying the integration cost repeatedly.

Plan docs are one way to create shared state and prevent drift. A runnable scaffold + verification harness is another.

aitchnyu

Now that code is cheap, I ensured my side project has unit/integration tests (will enforce 100% coverage), Playwright tests, static typing (its in Python), scripts for all tasks. Will learn mutation testing too (yes, its overkill). Now my agent works upto 1 hour in loops and emits concise code I dont have to edit much.

EastLondonCoder

Totally get it, and I think we’re describing the same control loop from different angles.

Where I differ slightly is: “100% coverage” can turn into productivity theatre. It’s a metric that’s easy to optimize while missing the thing you actually care about: do we have machine-checkable invariants at the points where drift is expensive?

The harness that’s paid off for me (on a live payments system) is:

  - thin vertical slice first (end-to-end runnable, even if ugly)

  - tests at the seams (payments, emails, ticket verification / idempotency)

  - state-machine semantics where concurrency/ordering matters

  - unit tests as supporting beams, not wallpaper

Then refactors become routine, because the tests will make breakage explicit.

So yes: “code is cheap” -> increase verification. Just careful not to replace engineering judgement with an easily gamed proxy.

turingsroot

I've been teaching AI coding tool workshops for the past year and this planning-first approach is by far the most reliable pattern I've seen across skill levels.

The key insight that most people miss: this isn't a new workflow invented for AI - it's how good senior engineers already work. You read the code deeply, write a design doc, get buy-in, then implement. The AI just makes the implementation phase dramatically faster.

What I've found interesting is that the people who struggle most with AI coding tools are often junior devs who never developed the habit of planning before coding. They jump straight to "build me X" and get frustrated when the output is a mess. Meanwhile, engineers with 10+ years of experience who are used to writing design docs and reviewing code pick it up almost instantly - because the hard part was always the planning, not the typing.

One addition I'd make to this workflow: version your research.md and plan.md files in git alongside your code. They become incredibly valuable documentation for future maintainers (including future-you) trying to understand why certain architectural decisions were made.

hghbbjh

> it's how good senior engineers already work

The other trick all good ones I’ve worked with converged on: it’s quicker to write code than review it (if we’re being thorough). Agents have some areas where they can really shine (boilerplate you should maybe have automated already being one), but most of their speed comes from passing the quality checking to your users or coworkers.

Juniors and other humans are valuable because eventually I trust them enough to not review their work. I don’t know if LLMs can ever get here for serious industries.

__mharrison__

I teach a lot of folks who "aren't software engineers" but are sitting in front of Jupyter all day writing code.

Covertly teaching software engineering best practices is super relevant. I've also found testing skills sorely lacking and even more important in AI driven development.

nikolay

Well, that's already done by Amazon's Kiro [0], Google's Antigravity [1], GitHub's Spec Kit [2], and OpenSpec [3]!

[0]: https://kiro.dev/

[1]: https://antigravity.google/

[2]: https://github.github.com/spec-kit/

[3]: https://openspec.dev/

duttish

This is quite close to what I've arrived at, but with two modifications

1) anything larger I work on in layers of docs. Architecture and requirements -> design -> implementation plan -> code. Partly it helps me think and nail the larger things first, and partly helps claude. Iterate on each level until I'm satisfied.

2) when doing reviews of each doc I sometimes restart the session and clear context, it often finds new issues and things to clear up before starting the next phase.

stevendaniels

When I use other models review plans (eg opus 4.x with Gemini3 or codex5.x), they often surface different issues than the model that wrote the plan.

colinhb

Quoting the article:

> One trick I use constantly: for well-contained features where I’ve seen a good implementation in an open source repo, I’ll share that code as a reference alongside the plan request. If I want to add sortable IDs, I paste the ID generation code from a project that does it well and say “this is how they do sortable IDs, write a plan.md explaining how we can adopt a similar approach.” Claude works dramatically better when it has a concrete reference implementation to work from rather than designing from scratch.

Licensing apparently means nothing.

Ripped off in the training data, ripped off in the prompt.

larard

That is the exact passage I found so shocking - if one finds the code in an open source repo, is it really acceptable to pass it through Claude code as some sort of license filter and make it proprietary?

On the other hand, next time OSX/windows/etc is leaked, one could feed it through this very same license filter. What is sauce for the goose is sauce for the gander.

miohtama

Concepts are not copyrightable.

colinhb

The article isn’t describing someone who learned the concept of sortable IDs and then wrote their own implementation.

It describes copying and pasting actual code from one project into a prompt so a language model can reproduce it in another project.

It’s a mechanical transformation of someone else’s copyrighted expression (their code) laundered through a statistical model instead of a human copyist.

layer8

“Mechanical” is doing some heavy lifting here. If a human does the same, reimplement the code in their own style for their particular context, it doesn’t violate copyright. Having the LLM see the original code doesn’t automatically make its output a plagiarism.

Daily Digest email

Get the top HN stories in your inbox every day.