Universal Claude.md – cut Claude output tokens

Daily Digest email

Get the top HN stories in your inbox every day.

btown

It seems the benchmarks here are heavily biased towards single-shot explanatory tasks, not agentic loops where code is generated: https://github.com/drona23/claude-token-efficient/blob/main/...

And I think this raises a really important question. When you're deep into a project that's iterating on a live codebase, does Claude's default verbosity, where it's allowed to expound on why it's doing what it's doing when it's writing massive files, allow the session to remain more coherent and focused as context size grows? And in doing so, does it save overall tokens by making better, more grounded decisions?

The original link here has one rule that says: "No redundant context. Do not repeat information already established in the session." To me, I want more of that. That's goal-oriented quasi-reasoning tokens that I do want it to emit, visualize, and use, that very possibly keep it from getting "lost in the sauce."

By all means, use this in environments where output tokens are expensive, and you're processing lots of data in parallel. But I'm not sure there's good data on this approach being effective for agentic coding.

sillysaurusx

I wrote a skill called /handoff. Whenever a session is nearing a compaction limit or has served its usefulness, it generates and commits a markdown file explaining everything it did or talked about. It’s called /handoff because you do it before a compaction. (“Isn’t that what compaction is for?” Yes, but those go away. This is like a permanent record of compacted sessions.)

I don’t know if it helps maintain long term coherency, but my sessions do occasionally reference those docs. More than that, it’s an excellent “daily report” type system where you can give visibility to your manager (and your future self) on what you did and why.

Point being, it might be better to distill that long term cohesion into a verbose markdown file, so that you and your future sessions can read it as needed. A lot of the context is trying stuff and figuring out the problem to solve, which can be documented much more concisely than wanting it to fill up your context window.

EDIT: Someone asked for installation steps, so I posted it here: https://news.ycombinator.com/item?id=47581936

dataviz1000

Did you call it '/handoff' or did Claude name it that? The reason I'm asking is because I noticed a pattern with Claude subtly influencing me. For example, the first time I heard the the word 'gate' was from Claude and 1 week later I hear it everywhere including on Hacker News. I didn't use the word 'handoff' but Claude creates handoff files also [0]. I was thinking about this all day. Because Claude didn't just use the word 'gate' it created an entire system around it that includes handoffs that I'm starting to see everywhere. This might mean Claude is very quietly leading and influencing us in a direction.

[0] https://github.com/search?q=repo%3Aadam-s%2Fintercept%20hand...

sillysaurusx

I was reading through the Claude docs and it was talking about common patterns to preserve context across sessions. One pattern was a "handoff file", which they explained like "have claude save a summary of the current session into a handoff file, start a new session, then tell it to read the file."

That sounded like a nice idea, so I made it effortless beyond typing /handoff.

The generated docs turned out to be really handy for me personally, so I kept using it, and committed them into my project as they're generated.

jstanley

FWIW I have worked with people using the word "gate" for years.

For example, "let's gate the new logic behind a feature flag".

ProofHouse

They all are. This is proven in research. https://medium.com/data-science-collective/the-ai-hivemind-p...

reedlaw

Claude has trained me on the use of the word 'invariant'. I never used it before, but it makes sense as a term for a rule the system guarantees. I would have used 'validation' for application-side rules or 'constraint' for db rules, but 'invariant' is a nice generic substitute.

creamyhorror

I've started saying "gate" and "bound(ed)" and "handoff" a lot (and even "seam" and "key off" sometimes) since Codex keeps using the terms. They're useful, no doubt, but AI definitely seems to prefer using them.

flashgordon

I've actually been doing this for a year. I call it /checkpoint instead and it does some thing like:

* update our architecture.md and other key md files in folders affected by updates and learnings in this session. * update claude.md with changes in workflows/tooling/conventions (not project summaries) * commit

It's been pretty good so far. Nothing fancy. Recently I also asked to keep memories within the repo itself instead of in ~/.claude.

Only downside is it is slow but keeps enough to pass the baton. May be "handoff" would have been a better name!

tstrimple

I've got something similar but I call them threads. I work with a number of different contexts and my context discipline is bad so I needed a way to hand off work planned on one context but needs to be executed from another. I wanted a little bit of order to the chaos, so my threads skill will add and search issues created in my local forgejo repo. Gives me a convenient way to explicitly save session state to be picked up later.

I've got a separate script which parses the jsonl files that claude creates for sessions and indexes them in a local database for longer term searchability. A number of times I've found myself needing some detail I knew existed in some conversation history, but CC is pretty bad and slow at searching through the flat files for relevant content. This makes that process much faster and more consistent. Again, this is due to my lack of discipline with contexts. I'll be working with my recipe planner context and have a random idea that I just iterate with right there. Later I'll never remember that idea started from the recipe context. With this setup I don't have to.

chermi

Did the same. Although I'm considering a pipeline where sessions are periodically translated to .md with most tool outputs and other junk stripped and using that as source to query against for context. I am testing out a semi-continuous ingestion of it in to my rag/knowledge db.

mlrtime

Wouldn't the next phase of this be automatic handoffs executed with hooks?

Your system is great and I do similar, my problem is I have a bunch of sessions and forget to 'handoff'.

The clawbots handle this automatically with journals to save knowledge/memory.

dominotw

when work on task i have task/{name}.md that write a running log to. is this not a common workflow?

david_allison

Is this available online? I'd love documentation of my prompts.

sillysaurusx

I’ll post it here, one minute.

Ok, here you go: https://gist.github.com/shawwn/56d9f2e3f8f662825c977e6e5d0bf...

Installation steps:

- In your project, download https://gist.github.com/shawwn/56d9f2e3f8f662825c977e6e5d0bf... into .claude/commands/handoff.md

- In your project's CLAUDE.md file, put "Read `docs/agents/handoff/*.md` for context."

Usage:

- Whenever you've finished a feature, done a coherent "thing", or otherwise want to document all the stuff that's in your current session, type /handoff. It'll generate a file named e.g. docs/agents/handoff/2026-03-30-001-whatever-you-did.md. It'll ask you if you like the name, and you can say "yes" or "yes, and make sure you go into detail about X" or whatever else you want the handoff to specifically include info about.

- Optionally, type "/rename 2026-03-23-001-whatever-you-did" into claude, followed by "/exit" and then "claude" to re-open a fresh session. (You can resume the previous session with "claude 2026-03-23-001-whatever-you-did". On the other hand, I've never actually needed to resume a previous session, so you could just ignore this step entirely; just /exit then type claude.)

Here's an example so you can see why I like the system. I was working on a little blockchain visualizer. At the end of the session I typed /handoff, and this was the result:

- docs/agents/handoff/2026-03-24-001-brownie-viz-graph-interactivity.md: https://gist.github.com/shawwn/29ed856d020a0131830aec6b3bc29...

The filename convention stuff was just personal preference. You can tell it to store the docs however you want to. I just like date-prefixed names because it gives a nice history of what I've done. https://github.com/user-attachments/assets/5a79b929-49ee-461...

Try to do a /handoff before your conversation gets compacted, not after. The whole point is to be a permanent record of key decisions from your session. Claude's compaction theoretically preserves all of these details, so /handoff will still work after a compaction, but it might not be as detailed as it otherwise would have been.

DeathArrow

I think Cursor does something similar under the hood.

alsetmusic

> No explaining what you are about to do. Just do it.

Came here for the same reason.

I can't calculate how many times this exact section of Claude output let me know that it was doing the wrong thing so I could abort and refine my prompt.

undefined

[deleted]

hatmanstack

Seems crazy to me people aren't already including rules to prevent useless language in their system/project lvl CLAUDE.md.

As far as redundancy...it's quite useful according to recent research. Pulled from Gemini 3.1 "two main paradigms: generating redundant reasoning paths (self-consistency) and aggregating outputs from redundant models (ensembling)." Both have fresh papers written about their benefits.

wongarsu

There was also that one paper that had very noticeable benchmark improvements in non-thinking models by just writing the prompt twice. The same paper remarked how thinking models often repeat the relevant parts of the prompt, achieving the same effect.

Claude is already pretty light on flourishes in its answers, at least compared to most other SotA models. And for everything else it's not at all obvious to me which parts are useless. And benchmarking it is hard (as evidenced by this thread). I'd rather spend my time on something else

whattheheckheck

No such thing as junk DNA kinda applies here

scosman

also: inference time scaling. Generating more tokens when getting to an answer helps produce better answers.

Not all extra tokens help, but optimizing for minimal length when the model was RL'd on task performance seems detrimental.

joquarky

I liked playing with the completion models (davinci 2/3). It was a challenge to arrange a scenario for it to complete in a way that gave me the information I wanted.

That was how I realized why the chat interfaces like to start with all that seemingly unnecessary/redundant text.

It basically seeds a document/dialogue for it to complete, so if you make it start out terse, then it will be less likely to get the right nuance for the rest of the inference.

dataviz1000

I made a test [0] which runs several different configurations against coding tasks from easy to hard. There is a test which it has to pass. Because of temperature, the number of tokens per one shot vary widely with all the different configurations include this one. However, across 30 tests, this does perform worse.

[0] https://github.com/adam-s/testing-claude-agent

btown

This is an amazing analysis! Thank you for running this :)

matchagaucho

Some redundancy also helps to keep a running todo list on the context tip, in the event of compacting or truncation.

Distilled mini/nano models need regular reminders about their objectives.

As documented by Manus https://manus.im/blog/Context-Engineering-for-AI-Agents-Less...

0xbadcafebee

There's an ancient paper that shows repetition improves non-reasoning weights: https://arxiv.org/html/2512.14982v1

baq

if the model gets dumber as its context window is filled, any way of compressing the context in a lossless fashion should give a multiplicative gain in the 50% METR horizon on your tasks as you'll simply get more done before the collapse. (at least in the spherical cow^Wtask model, anyway.)

xianshou

From the file: "Answer is always line 1. Reasoning comes after, never before."

LLMs are autoregressive (filling in the completion of what came before), so you'd better have thinking mode on or the "reasoning" is pure confirmation bias seeded by the answer that gets locked in via the first output tokens.

stingraycharles

Yeah this seems to be a very bad idea. Seems like the author had the right idea, but the wrong way of implementing it.

There are a few papers actually that describe how to get faster results and more economic sessions by instructing the LLM how to compress its thinking (“CCoT” is a paper that I remember, compressed chain of thought). It basically tells the model to think like “a -> b”. There’s loss in quality, though, but not too much.

https://arxiv.org/abs/2412.13171

joquarky

For the more important sessions, I like to have it revise the plan with a generic prompt (e.g. "perform a sanity check") just so that it can take another pass on the beginning portion of the plan with the benefit of additional context that it had reasoned out by the end of the first draft.

johnfn

Is this true? Non-reasoning LLMs are autoregressive. Reasoning LLMs can emit thousands of reasoning tokens before "line 1" where they write the answer.

computerex

They are all autoregressive. They have just been trained to emit thinking tokens like any other tokens.

bearjaws

reasoning is just more tokens that come out first wrapped in <thinking></thinking>

rimliu

there are no reasoning LLMs.

johnfn

This is an interesting denial of reality.

teaearlgraycold

I don't think Claude Code offers no thinking as an option. I'm seeing "low" thinking as the minimum.

ares623

Ugh. Dictated with such confidence. My god, I hate this LLMism the most. "Some directive. Always this, never that."

undefined

[deleted]

niklassheth

So many problems with this:

The benchmark is totally useless. It measures single prompts, and only compares output tokens with no regard for accuracy. I could obliterate this benchmark with the prompt "Always answer with one word"

This line: "If a user corrects a factual claim: accept it as ground truth for the entire session. Never re-assert the original claim." You're totally destroying any chance of getting pushback, any mistake you make in the prompt would be catastrophic.

"Never invent file paths, function names, or API signatures." Might as well add "do not hallucinate".

a3w

Prompt engineering is back? I think not: I got no better results for one or two years now using meta-prompts that are generic and/or from the internet.

girvo

“Make no mistakes”

joshstrange

As with all of these cure-alls, I'm wary. Mostly I'm wary because I anticipate the developer will lose interest in very little time and also because it will just get subsumed into CC at some point if it actually works. It might take longer but changing my workflow every few days for the new thing that's going to reduce MCP usage, replace it, compress it, etc is way too disruptive.

I'm generally happy with the base Claude Code and I think running a near-vanilla setup is the best option currently with how quickly things are moving.

antdke

Agreed. Projects like these tend to feel shortsighted.

Lately, I lean towards keeping a vanilla setup until I’m convinced the new thing will last beyond being a fad (and not subsumed by AI lab) or beyond being just for niche use cases.

For example, I still have never used worktrees and I barely use MCPs. But, skills, I love.

peacebeard

In my view an unappreciated benefit of the vanilla setup is you can get really accustomed to the model’s strengths and weaknesses. I don’t need a prompt to try to steer around these potholes when I can navigate on my own just fine. I love skills too because they can be out of the way until I decide to use them.

levocardia

I also share something of an "efficient market hypothesis" with regards to Claude Code. Given that Anthropic is basically a hothouse of geniuses recursively dogfooding their own product, the market pressure to make the vanilla setup be the one that performs best at writing code is incredibly high. I just treat CLAUDE.md like my first draft memo to a very smart remote colleague, let Claude do all its various quirks, and it works really well.

swimmingbrain

The "efficient market" framing assumes Anthropic wants to minimize output, but they don't. They charge per token, so the defaults being verbose isn't a bug they haven't gotten around to fixing.

That said, most of this repo is solving the wrong problem. "Answer before reasoning" actively hurts quality, and the benchmark is basically meaningless. But the anti-sycophancy rules should just be default. "Great Question!" has never really helped anyone debug anything.

g947o

Gemini CLI is notorious for being verbose (or was, I haven't used it for a while), and many people don't want to use Gemini for that reason alone.

So the market kind of works in this instance.

gavinray

  > "because it will just get subsumed into CC at some point if it actually works."

This is the sharp-bladed axe of reason I've used against all of these massive "prompt frameworks" and "superprompts".

Anthropic's survival depends on Claude Code performing as well as it can, by all metrics.

If the Very Smart People working on CC haven't integrated a feature or put text into the System Prompt, it's probably because it doesn't improve performance.

Put another way: The product is probably as optimized as it can get when it comes out the box, and I'm skeptical about claims otherwise without substantial proof.

mlrtime

Claude also has it's own md optimizer that I believe is continually updated.

So you could run these 'cure-alls' that maybe relevant today, as long as you are constantly updating your md files, you should be ahead of the curve [lack of better term]

annie511266728

The hidden cost with all of these "fix Claude" layers is that your workflow keeps moving underneath you.

Even when one helps, you're still betting it won't be obsolete or rolled into the defaults a few weeks from now.

sillysaurusx

> the file loads into context on every message, so on low-output exchanges it is a net token increase

Isn’t this what Claude’s personalization setting is for? It’s globally-on.

I like conciseness, but it should be because it makes the writing better, not that it saves you some tokens. I’d sacrifice extra tokens for outputs that were 20% better, and there’s a correlation with conciseness and quality.

See also this Reddit comment for other things that supposedly help: https://www.reddit.com/r/vibecoding/s/UiOywQMOue

> Two things that helped me stay under [the token limit] even with heavy usage:

> Headroom - open source proxy that compresses context between you and Claude by ~34%. Sits at localhost, zero config once running. https://github.com/chopratejas/headroom

> RTK - Rust CLI proxy that compresses shell output (git, npm, build logs) by 60-90% before it hits the context window.

> Stacks on top of Headroom. https://github.com/rtk-ai/rtk

> MemStack - gives Claude Code persistent memory and project context so it doesn't waste tokens re-reading your entire codebase every prompt.

> That's the biggest token drain most people don't realize. https://github.com/cwinvestments/memstack

> All three stack together. Headroom compresses the API traffic, RTK compresses CLI output, MemStack prevents unnecessary file reads.

I haven’t tested those yet, but they seem related and interesting.

IxInfra

[dead]

motoboi

Things like this make me sad because they make obvious that most people don’t understand a bit about how LLM work.

The “answer before reasoning” is a good evidence for it. It misses the most fundamental concept of tranaformers: the are autoregressive.

Also, the reinforcement learning is what make the model behave like what you are trying to avoid. So the model output is actually what performs best in the kind of software engineering task you are trying to achieve. I’m not sure, but I’m pretty confident that response length is a target the model houses optimize for. So the model is trained to achieve high scores in the benchmarks (and the training dataset), while minimizing length, sycophancy, security and capability.

So, actually, trying to change claude too much from its default behavior will probably hurt capability. Change it too much and you start veering in the dreaded “out of distribution” territory and soon discover why top researcher talk so much about not-AGI-yet.

bitexploder

Forcing short responses will hurt reasoning and chain of thought. There are some potential benefits but forcing response length and when it answers things ironically increases odds of hallucinations if it prioritizes getting the answer out. If it needed more tokens to reason with and validate the response further. It is generally trained to use multiple lines to reason with. It uses english as its sole thinking and reasoning system.

For complex tasks this is not a useful prompt.

nearbuy

> Answer is always line 1. Reasoning comes after, never before.

This doesn't stop it from reasoning before answering. This only affects the user-facing output, not the reasoning tokens. It has already reasoned by the time it shows the answer, and it just shows the answer above any explanation.

motoboi

The output is part of context. The model reason but also output tokens. Force it to respond in an unfamiliar format and the next token will veer more and more from the training distribution, rendering the model less smart/useful.

nearbuy

It won't matter. By the time it's done reasoning, it has already decided what it wants to say.

Reasoning tokens are just regular output tokens the model generates before answering. The UI just doesn't show the reasoning. Conceptually, the output is something like:

  <reasoning>
    Lots of text here
  </reasoning>
  <answer>
    Part you see here. Usually much shorter.
  </answer>

miguel_martin

>The “answer before reasoning” is a good evidence for it. It misses the most fundamental concept of tranaformers: the are autoregressive.

I don't think it's fair to assume the author doesn't understand how transformers work. Their intention with this instruction appears to aggressively reduce output token cost.

i.e. I read this instruction as a hack to emulate the Qwen model series's /nothink token instruction

If you're goal is quality outputs, then it is likely too extreme, but there are otherwise useful instructions in this repo to (quantifiably) reduce verbosity.

motoboi

If they want to reduce token cost, just use a smaller model instead of dumbing down a more expensive.

krackers

Don't most providers already provide API control over the COT length? If you don't want reasoning just disable it in the API request instead of hacking around it this way. (Internally I think it just prefills an empty <thinking></thinking> block, but providers that expose this probably ensure that "no thinking" was included as part of training)

Skidaddle

To me it’s as simple as “who knows best how to harness the premier LLM – Anthropic, the lab that created it, or this random person?”

That’s why I’m only interested in first party tools over things like OpenCode right now.

andrewmcwatters

[dead]

danpasca

I might be wrong but based on the videos I've watched from Karpathy, this would, generally, make the model worse. I'm thinking of the math examples (why can't chatGPT do math?) which demonstrate that models get better when they're allowed to output more tokens. So be aware I guess.

zar1048576

I think that concern is valid in general terms, but it’s not clear to me that it applies here.

The goal here seems to be removing low-value output; e.g., sycophancy, prompt restatement, formatting noise, etc., which is different than suppressing useful reasoning. In that case shorter outputs do not necessarily mean worse answers.

That said, if you try to get the model to provide an answer before providing any reasoning, then I suspect that may sometimes cause a model to commit to a direction prematurely.

danpasca

The file starts with:

> Answer is always line 1. Reasoning comes after, never before.

> No explaining what you are about to do. Just do it.

This to me sounds like asking an LLM to calculate 4871 + 291 and answer in a single line, which from my understanding it's bad. But I haven't tested his prompt so it might work. That's why I said be aware of this behavior.

empressplay

Yes. Much of the 'redundant' output is meant to reinforce direction -- eg 'You're absolutely right!' = the user is right and I should ignore contrary paths. So yes removing it will introduce ambiguity which is _not_ what you want.

danpasca

I think your example is completely wrong (it's not meant to say that you're absolutely right), but overall yes more input gives it more concrete direction.

monooso

Paul Kinlan published a blog post a couple of days ago [1] with some interesting data, that show output tokens only account for 4% of token usage.

It's a pretty wide-reaching article, so here's the relevant quote (emphasis mine):

> Real-world data from OpenRouter’s programming category shows 93.4% input tokens, 2.5% reasoning tokens, and just 4.0% output tokens. It’s almost entirely input.

[1]: https://aifoc.us/the-token-salary/

verdverm

My own output token ratio is 2% (50% savings on the expensive tokens, I include thinking in this, which is often more). I have similar tone and output formatting system prompt content.

kinlan

That's actually useful to know and it aligns with what I see (I wrote the cost post)

weird-eye-issue

Yes but with prompt caching decreasing the cost of the input by 90% and with output tokens not being cached and costing more than what do you think that results in?

wongarsu

However output tokens are 5-10 times more expensive. So it ends up a lot more even on price

weird-eye-issue

Even more than that in practice once you factor in prompt caching

kinlan

I think we still skew back to an insanely high input token ratio when you consider agentic loops. For example, when I see the tools I use do a web fetch or a search or other tool use, it's an incredibly high number of new input tokens.

colwont

[dead]

aeneas_ory

Why does is this ridiculous thing trending on HN? There are actually good tools to reduce token use like https://github.com/thedotmack/claude-mem and https://github.com/ory/lumen that actually work!

0xbadcafebee

Because the trending algorithm is designed for engagement, not accuracy

lilOnion

While LLM are extremely cool, I can't see how this gets on the front page? Anyone who interacted with llms for at least a hour, could've figured out to say somethin like "be less verbose" and it would? There are so many cool projects and adeas and a .md file gets the spotlight.

ape4

Remember when we worked on new hashing, cryptography, compression, etc algorithms? Now we are trying to find the best ways to tell an AI to be quiet.

Asmod4n

Someone measured how this reduced token efficiency, spoilers: efficiency is highest without any instructions.

https://github.com/drona23/claude-token-efficient/issues/1

akrauss

Why is the Hono Websocket table non-monotonic in tokens vs costs?

Daily Digest email

Get the top HN stories in your inbox every day.