Issue: Claude Code is unusable for complex engineering tasks with Feb updates

Daily Digest email

Get the top HN stories in your inbox every day.

bcherny

Hey all, Boris from the Claude Code team here. I just responded on the issue, and cross-posting here for input.

---

Hi, thanks for the detailed analysis. Before I keep going, I wanted to say I appreciate the depth of thinking & care that went into this.

There's a lot here, I will try to break it down a bit. These are the two core things happening:

> `redact-thinking-2026-02-12`

This beta header hides thinking from the UI, since most people don't look at it. It *does not* impact thinking itself, nor does it impact thinking budgets or the way extended reasoning works under the hood. It is a UI-only change.

Under the hood, by setting this header we avoid needing thinking summaries, which reduces latency. You can opt out of it with `showThinkingSummaries: true` in your settings.json (see [docs](https://code.claude.com/docs/en/settings#available-settings)).

If you are analyzing locally stored transcripts, you wouldn't see raw thinking stored when this header is set, which is likely influencing the analysis. When Claude sees lack of thinking in transcripts for this analysis, it may not realize that the thinking is still there, and is simply not user-facing.

> Thinking depth had already dropped ~67% by late February

We landed two changes in Feb that would have impacted this. We evaluated both carefully:

1/ Opus 4.6 launch → adaptive thinking default (Feb 9)

Opus 4.6 supports adaptive thinking, which is different from thinking budgets that we used to support. In this mode, the model decides how long to think for, which tends to work better than fixed thinking budgets across the board. `CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING` to opt out.

2/ Medium effort (85) default on Opus 4.6 (Mar 3)

We found that effort=85 was a sweet spot on the intelligence-latency/cost curve for most users, improving token efficiency while reducing latency. On of our product principles is to avoid changing settings on users' behalf, and ideally we would have set effort=85 from the start. We felt this was an important setting to change, so our approach was to:

1. Roll it out with a dialog so users are aware of the change and have a chance to opt out

2. Show the effort the first few times you opened Claude Code, so it wasn't surprising.

Some people want the model to think for longer, even if it takes more time and tokens. To improve intelligence more, set effort=high via `/effort` or in your settings.json. This setting is sticky across sessions, and can be shared among users. You can also use the ULTRATHINK keyword to use high effort for a single turn, or set `/effort max` to use even higher effort for the rest of the conversation.

Going forward, we will test defaulting Teams and Enterprise users to high effort, to benefit from extended thinking even if it comes at the cost of additional tokens & latency. This default is configurable in exactly the same way, via `/effort` and settings.json.

Wowfunhappy

> Under the hood, by setting this header we avoid needing thinking summaries, which reduces latency. You can opt out of it with `showThinkingSummaries: true` in your settings.json (see [docs](https://code.claude.com/docs/en/settings#available-settings)).

Can I just see the actual thinking (not summarized) so that I can see the actual thinking without a latency cost?

I do really need to see the thinking in some form, because I often see useful things there. If Claude is thinking in the wrong direction I will stop it and make it change course.

faitswulff

Anthropic's position is that thinking tokens aren't actually faithful to the internal logic that the LLM is using, which may be one reason why they started to exclude them:

https://www.anthropic.com/research/reasoning-models-dont-say...

libraryofbabel

That's interesting research, but I think a more important reason that you don't have access to them (not even via the bare Anthropic api) is to prevent distillation of the model by competitors (using the output of Anthropic's model to help train a new model).

gck1

That probably matters for some scenarios, but I have yet to find one where thinking tokens didn't hint at the root cause of the failure.

All of my unsupervised worker agents have sidecars that inject messages when thinking tokens match some heuristics. For example, any time opus says "pragmatic", its instant Esc Esc > "Pragmatic fix is always wrong, do the Correct fix", also whenever "pre-existing issue" appears (it's never pre-existing).

AquinasCoder

I somewhat understand Anthropic's position. However, thinking tokens are useful even if they don't show the internal logic of the LLM. I often realize I left out some instruction or clarification in my prompt while reading through the chain of reasoning. Overall, this makes the results more effective.

It's certainly getting frustrating having to remind it that I want all tests to pass even if it thinks it's not responsible for having broken some of them.

andai

What's the implication of this? That the model already decided on a solution, upon first seeing the problem, and the reasoning is post hoc rationalization?

But reasoning does improve performance on many tasks, and even weirder, the performance improves if reasoning tokens are replaced with placeholder tokens like "..."

I don't understand how LLMs actually work, I guess there's some internal state getting nudged with each cycle?

So the internal state converges on the right solution, even if the output tokens are meaningless placeholders?

asobalife

I have seen this to be true many times. The CoT being completely different from the actual model output.

Not limited to Claude as well.

marcd35

so not only are the sycophantic, hallucinatory, but now they're also proven to be schizophrenic.

neato.

gmerc

Nah it’s an anti distillation move

grey-area

So like many of the promises from AI companies, reported chain of thought is not actually true (see results below). I suppose this is unsurprising given how they function.

Is chain of thought even added to the context or is it extraneous babble providing a plausible post-hoc justification?

People certainly seem to treat it as it is presented, as a series of logical steps leading to an answer.

‘After checking that the models really did use the hints to aid in their answers, we tested how often they mentioned them in their Chain-of-Thought. The overall answer: not often. On average across all the different hint types, Claude 3.7 Sonnet mentioned the hint 25% of the time, and DeepSeek R1 mentioned it 39% of the time. A substantial majority of answers, then, were unfaithful.‘

kouteiheika

> Can I just see the actual thinking (not summarized) so that I can see the actual thinking without a latency cost?

You can't, and Anthropic will never allow it since it allows others to more easily distill Claude (i.e. "distillation attacks"[1] in Anthropic-speak, even though Athropic is doing essentially exactly the same thing[2]; rules for thee but not for me).

[1] -- https://www.anthropic.com/news/detecting-and-preventing-dist...

[2] -- https://www.npr.org/2025/09/05/g-s1-87367/anthropic-authors-...

olejorgenb

So this means I can not resume a session older than 30 days properly?

andersa

But you can't. Many times I've seen claude write confusing off-track nonsense in the thinking and then do the correct action anyway as if that never happened. It doesn't work the way we want it to.

Wowfunhappy

Maybe, but I’ve seen the opposite too.

In most cases, I don’t use the reasoning to proactively stop Claude from going off track. When Claude does go off track, the reasoning helps me understand what went wrong and how to correct it when I roll back and try again.

richardjennings

I was not aware the default effort had changed to medium until the quality of output nosedived. This cost me perhaps a day of work to rectify. I now ensure effort is set to max and have not had a terrible session since. Please may I have a "always try as hard as you can" mode ?

Avamander

I feel like the maximum effort mode kind-of wraps around and starts becoming "desperate" to the extent of lazy or a monkey's paw, similar to how lower effort modes or a poor prompt.

svnt

I’m going in circles. Let me take a step back and try something completely different. The answer is a clean refactor.

Wait, the simplest fix is the same hack I tried 45 minutes ago but in a different context. Let me just try that.

Wait,

richardjennings

I think over-thinking is only solved by thinking more, not less. This is only viable once some intelligence threshold is reached, which I think Anthropic has borderline achieved.

torginus

this might be just my impression, but I feel like most people are using CC for fixing their React frontends, and they prefer the decreased latency and less tokens spent as opposed to performing well on extremely difficult problems?

That said there's still an issue of regression to the mean. What the average person likes, as determined by metrics, is something nobody actuallt likes, because the average is a mathematical construct and might not describe any particular individual accurately.

Schiendelman

That's /effort max!

richardjennings

You cannot control the effort setting sub-agents use and you also cannot use /effort max as a default (outside of using an alias).

caiyongji

agree.

clevergadget

bad citizen

johndough

I think it is hilarious that there are four different ways to set settings (settings.json config file, environment variable, slash commands and magical chat keywords).

That kind of consistency has also been my own experience with LLMs.

windexh8er

I just had this conversation today. It's hilarious that things like Skills and Soul and all of these anthropomorphized files could just be a better laid out set of configuration files. Yet here we are treating machines like pets or worse.

hansmayer

Well they need you to think there is some kind of soul behind it - that is their entire pitch!

SAI_Peregrinus

It's not unique to LLMs. Take BASH: you've got `/etc/profile`, `~/.bash_profile,` `~/.bash_login`, `~/.bashrc`, `~/.profile`, environment variables, and shell options.

hansmayer

I would laugh so hard at this, if your attempt at comparison was not so tragic. Bash and other shells are deterministic. Want to set it just for one user ? - use ~/.bashrc . Set it for all users on the system? use /etc/profile.d/ . Want it just temporary for this session? You got it, environment variables. And it is going to work like that every single time. It is deterministic you see.

subscribed

Yeah, but for ash/shells these files have wildly different purposes. I don't think it's so distinct with cc.

monatron

To be fair, I can think of reasons why you would want to be able to set them in various ways.

- settings.json - set for machine, project

- env var - set for an environment/shell/sandbox

- slash command - set for a session

- magical keyword - set for a turn

tracker1

I tend to make a concerted effort to often make sure anything settable via cli is settable via environment variable... though, I often have a search-upward option for a .env file as well. Mostly so that it's easier to containerize/deploy an application in a predictable/reusable way.

larpingscholar

You are yet to discover the joys of the managed settings scope. They can be set three ways. The claude.ai admin console; by one of two registry keys e.g. HKLM\SOFTWARE\Policies\ClaudeCode; and by an alphabetically merged directory of json files.

bmitc

There's also settings available in some offerings and not in others. For example, the Anthropic Claude API supports setting model temperature, but the Claude Agent SDK doesn't.

ggdxwz

Especially some settings are in setting.json, and others in .claude.json So sometimes I have to go through both to find the one I want to tweak

brookst

way more than that. settings.json and settings.local.json in the project directory's .claude/, and both of files can also be in ~/.claude

MCP servers can be set in at least 5 of those places plus .mcp.json

OliverGuy

settings.json -> global config Env vars -> settings different to your global for a specific project Slash commands / chat keywords -> need to change a setting mid chat

koverstreet

There's been more going on than just the default to medium level thinking - I'll echo what others are saying, even on high effort there's been a very significant increase in "rush to completion" behavior.

bcherny

Thanks for the feedback. To make it actionable, would you mind running /bug the next time you see it and posting the feedback id here? That way we can debug and see if there's an issue, or if it's within variance.

JamesSwift

  a9284923-141a-434a-bfbb-52de7329861d
  d48d5a68-82cd-4988-b95c-c8c034003cd0
  5c236e02-16ea-42b1-b935-3a6a768e3655
  22e09356-08ce-4b2c-a8fd-596d818b1e8a
  4cb894f7-c3ed-4b8d-86c6-0242200ea333

Amusingly (not really), this is me trying to get sessions to resume to then get feedback ids and it being an absolute chore to get it to give me the commands to resume these conversations but it keeps messing things up: cf764035-0a1d-4c3f-811d-d70e5b1feeef

matheusmoreira

I just asked Claude to plan out and implement syntactic improvements for my static site generator. I used plan mode with Opus 4.6 max effort. After over half an hour of thinking, it produced a very ad-hoc implementation with needless limitations instead of properly refactoring and rearchitecting things. I had to specifically prompt it in order to get it to do better. This executed at around 3 AM UTC, as far away from peak hours as it gets.

b9cd0319-0cc7-4548-bd8a-3219ede3393a

> You're right to push back. Let me be honest about both questions.

> The @() implementation is ad-hoc

> The current implementation manually emits synthetic tokens — tag, start-attributes, attribute, end-attributes, text, end-interpolation — in sequence.

> This works, but it duplicates what the child lexer already does for #[...], creating two divergent code paths for the same conceptual operation (inline element emission). It also means @() link text can't contain nested inline elements, while #[a(...) text with #[em emphasis]] can.

I just feel like I can't trust it anymore.

koverstreet

I'll have a look. The CoT switch you mentioned will help, I'll take a look at that too, but my suspicion is that this isn't a CoT issue - it's a model preference issue.

Comparing Opus vs. Qwen 27b on similar problems, Opus is sharper and more effective at implementation - but will flat out ignore issues and insist "everything is fine" that Qwen is able to spot and demonstrate solid understanding of. Opus understands the issues perfectly well, it just avoids them.

This correlates with what I've observed about the underlying personalities (and you guys put out a paper the other day that shows you guys are starting to understand it in these terms - functionally modeling feelings in models). On the whole Opus is very stable personality wise and an effective thinker, I want to complement you guys on that, and it definitely contrasts with behaviors I've seen from OpenAI. But when I do see Opus miss things that it should get, it seems to be a combination of avoidant tendencies and too much of a push to "just get it done and move into the next task" from RHLF.

freedomben

How much of the code/context gets attached in the /bug report?

stefan_

Theres also been tons of thinking leaking into the actual output. Recently it even added thinking into a code patch it did (a[0] &= ~(1 << 2); // actually let me just rewrite { .. 5 more lines setting a[0] .. }).

taylorfinley

I've seen this frequently also

butlike

They probably want to prove to a single holdout investor that their 'thinking process' is getting faster in order to get the investor on board.

plexicle

Ultrathink is back? I thought that wasn't a thing anymore.

If I am following.. "Max" is above "High", but you can't set it to "Max" as a default. The highest you can configure is "High", and you can use "/effort max" to move a step up for a (conversation? session?), or "ultrathink" somewhere in the prompt to move a step up for a single turn. Is this accurate?

bcherny

Yep, exactly

dostick

Mentioning ULTRATHINK in prompt is the equivalent to /effort max?

undefined

[deleted]

potsandpans

For anyone reading this and wondering where the truth could possibly be:

We can't really know what the truth is, because Anthropic is tightly controlling how you interact with their product and provides their service through opaque processes. So all we can do is speculate. And in that speculation there's a lot of room (for the company) to bullshit or provide equally speculative responses, and (for outsiders) to search for all plausible explanations within the solution space. So there's not much to action on. We're effectively stuck with imprecise heuristics and vibes.

But consider what we do know: the promise is that Anthropic is providing a black-box service that solves large portions of the SDLC. Maybe all of it. They are "making the market" here, and their company growth depends on this bet. This is why these processes are opaque: they have to be. Anthropic, OpenAI and a few others see this as a zero-sum game. The winner "owns" the SDLC (and really, if they get their way the entire PDLC). So the competitive advantage lies in tightly controlling and tweaking their hidden parameters to squeeze as much value and growth as possible.

The downside is that we're handing over the magic for convenience and cost. A lot of people are maybe rightly criticizing the OP of the issue because they're staking their business on Claude Code in a way that's very risky. But this is essentially what these companies are asking for. The business model end game is: here's the token factory, we control it and you pay for the pleasure of using it. Effectively, rent-seeking for software development. And if something changes and it disrupts your business, you're just using it incorrectly. Try turning effort to max.

Reading responses like this from these company representatives makes me increasingly uneasy because it's indicative of how much of writing software is being taken out from under our feet. The glimmer of promise in all of this though is that we are seeing equity in the form of open source. Maybe the answer is: use pi-mono, a smattering of self hosted and open weights models (gemma4, kimi, minimax are extremely capable) and escalate to the private lab models through api calls when encountering hard problems.

Let the best model win, not the best end to end black box solution.

vachina

Don’t turn vibe coding into your day job (because the vibe won’t keep vibing). Write code (that you own) that can make you money and hire real developers.

mvkel

I am reminded of OpenAI's first voice-to-voice demo a couple of years ago. I rewatched it and was shocked at how human it was; indiscernible from a real person. But the voice agent that we got sounds 20% better than Siri.

There's a hope that competition is what keeps these companies pushing to ship value to customers, but there are also billions of compute expense at stake, so there seems to be an understanding that nobody ships a product that is unsustainably competitive

anonymoushn

How do you guys decide which settings should be configurable via environment variables but not settings files and which settings should be configurable via settings files but not environment variables?

bcherny

All environment variables can also be configured via settings files (in the “env” field).

Our approach generally is to use env vars for more experimental and low usage settings, and reserve top-level settings for knobs that we expect customers will tune more frequently.

nightpool

[flagged]

make3

[flagged]

robeym

This is confusing. ULTRATHINK is a step below /effort max?

ULTRATHINK triggers high effort. /effort max is above high. Calling it ULTRATHINK sounds like it would be the highest mode. If someone has max set and types ULTRATHINK, they're lowering their effort for that turn.

For anyone reading this trying to fix the quality issues, here's what I landed on in ~/.claude/settings.json:

  {
    "env": {
      "CLAUDE_CODE_EFFORT_LEVEL": "max",
      "CLAUDE_CODE_DISABLE_BACKGROUND_TASKS": "1",
      "CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING": "1"
    }
  }

The env field in settings.json persists across sessions without needing /effort max every time.

DISABLE_ADAPTIVE_THINKING is key. That's the system that decides "this looks easy, I'll think less" - and it's frequently wrong. Disabling it gives you a fixed high budget every turn instead of letting the model shortchange itself.

ericpan

The docs say that CLAUDE_CODE_EFFORT_LEVEL controls adaptive reasoning intensity, and CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING bypasses that entirely in favor of a fixed budget via MAX_THINKING_TOKENS. So setting both is contradictory. If true, disabling adaptive thinking would override what effort level is trying to do.

https://code.claude.com/docs/en/env-vars

robeym

So if it bypasses, is the optimal setting for performance setting effort level to max, keeping adaptive on? I try to avoid letting the model decide what is unimportant and needs less thought

airstrike

Whaaa this is insanely stupid from their part.

Also I'm curious if telling subagents to ultrathink has any impact.

I guess I can always ask a friend of mine to read the source...

jnfr

Thanks for sharing. Have you experienced noticeable impact to your usage rate?

robeym

Nothing super noticeable. I've reached 35% in sessions on the 20x plan. Before these changes, 25-30% was pretty normal. I think these changes are best for people who are just past the 5x usage plan, but might be harder to manage if you already have to throttle usage to stay under limits.

I'd still recommend turning off sub agents entirely because it doesn't seem you can control them with /effort and I always find the output to be better with agents off.

noxa

I'm the author of the report in there. The stop-phrase-guard didn't get attached but here it is: https://gist.github.com/benvanik/ee00bd1b6c9154d6545c63e06a3...

You can watch for these yourself - they are strong indicators of shallow thinking. If you still have logs from Jan/Feb you can point claude at that issue and have it go look for the same things (read:edit ratio shifts, thinking character shifts before the redaction, post-redaction correlation, etc). Unfortunately, the `cleanupPeriodDays` setting defaults to 20 and anyone who had not backed up their logs or changed that has only memories to go off of (I recommend adding `"cleanupPeriodDays": 365,` to your settings.json). Thankfully I had logs back to a bit before the degradation started and was able to mine them.

The frustrating part is that it's not a workflow _or_ model issue, but a silently-introduced limitation of the subscription plan. They switched thinking to be variable by load, redacted the thinking so no one could notice, and then have been running it at ~1/10th the thinking depth nearly 24/7 for a month. That's with max effort on, adaptive thinking disabled, high max thinking tokens, etc etc. Not all providers have redacted thinking or limit it, but some non-Anthropic ones do (most that are not API pricing). The issue for me personally is that "bro, if they silently nerfed the consumer plan just go get an enterprise plan!" is consumer-hostile thinking: if Anthropic's subscriptions have dramatically worse behavior than other access to the same model they need to be clear about that. Today there is zero indication from Anthropic that the limitation exists, the redaction was a deliberate feature intended to hide it from the impacted customers, and the community is gaslighting itself with "write a better prompt" or "break everything into tiny tasks and watch it like a hawk same you would a local 27B model" or "works for me <in some unmentioned configuration>" - sucks :/

p1necone

The "this test failure is preexisting so I'm going to ignore it" thing has been happening a lot for me lately, it's so annoying. Unless it makes a change and then immediately runs tests and it's obvious from the name/contents that the failing test is directly related to the change that was made it will ignore it and not try to fix.

Shebanator

This problem has been around for a long time. Not only that but it would say this even when the problems were directly caused by their code.

I put a line in my CLAUDE.md that says "If a test doesn't pass, fix it regardless of whether it was pre-existing or in a different part of the code."

latentsea

This should be part of the system prompt. It's absolutely unacceptable to just to not at least try to investigate failures like this. I absolutely hate when it reaches this conclusion on its own and just continues on as if it's doing valid work.

flakes

> "this test failure is preexisting so I'm going to ignore it"

Critical finding! You spotted the smoking gun!

cmrdporcupine

I will note that this "out" that Claude takes was a) less frequent in Opus 4.5 and that time frame and b) notably not something that Codex does.

I don't trust the code that Claude writes at all, if I have to use it (they gave me a free month recently, so I use it...) I not only review it carefully but have Codex do a thorough review.

Claude "cheats" and leaves hacks and has Dunning-Kruger.

All of this is very exhausting. I am enjoying writing my own code with these tools (to get long running personal projects out the door) but the effect that these tools are having on teams is terrifyingly corrosive and it's making me want to take an early retirement from the profession.

Yes we can write a lot of code quickly. But at what cost? And what even use is all this code now anyways?

dboreham

That said I've worked with several humans who did/said the exact same thing.

boesboes

But did they say that about tests they just added themselves too? Had claude try that on me a couple of times >_<

tomwojcik

I can't believe that's where we're at, as software devs. I miss predictable outputs, state machines. All those LLM (prompt) based rules make no sense to me. Same with AI WAL. All of it, at some point, will fail.

partyficial

I present a new name for this - FAKE CODE.

This is simply the next iteration of FAKE NEWS. We have been steadily democratizing and thus lowering the verification standards:

Verified News (AP/Reuters) --> Opinion pieces (Fox/CNN) --> Social media (Tiktok/Youtube).

Verified Code --> Vibe Code

Democracy gave everyone a vote - was that a good thing ?

Social media gave everyone a visual - was that a good thing ?

AI gave everyone a vibe - was that a good thing ?

The trust factor never went away. It just got dispersed and diluted.

bwfan123

> I can't believe that's where we're at, as software devs

Agree wholeheartedly.

The premise of the bug did not make any sense to me. For instance, "unusable for complex engineering tasks", why would someone who understands these tools use them for complex engineering tasks ? Also, this phrase in the bug appears too jargon-ny "Extended Thinking Is Load-Bearing for Senior Engineering Workflows" - what does this even mean ? Am I the only one who is looking at this with bewilderment. I think there is group of folks producing almost-working proof of concept code with these tools, and will face a reckoning at some point - as the bug illustrates. I see this as a storm in a teacup with wonder and amusement.

There is also a larger commentary on: when you dont understand why things work (ie, have a causal model), you wont know why they broke (find root causes). We are at a point in our craft where we throw magic dust and chant spells at claude and hope and pray it works.

dgxyz

Yeah that. After spending years trying to get reproducible builds, I now have a crazy moving target to deal with.

yuye

It's hard not to feel deeply depressed by it.

But we can't put the genie back in the bottle.

thatxliner

> is consumer-hostile thinking

I've been saying this with many of my friends but, I feel like it's also probably illegal: you paid for a subscription where you expect X out of, and if they changed the terms of your subscription (e.g. serving worse models) after you paid for it, was that not false advertising? Could we not ask for a refund, or even sue?

gib444

Depends on the terms and conditions

grim_io

Where I live, the law is above some silly terms and conditions.

marcd35

probably not. the engineers dont even know how these things work (see: black box) so how could you even prove that its not doing what it's 'supposed' to be doing?

Majromax

I'm curious about your subscription/API comparison with respect to thinking. Do you have a benchmark for this, where the same set of prompts under a Claude Code subscription result in significantly different levels of effective thinking effort compared to a Claude Code+API call?

Elsewhere in this thread 'Boris from the Claude Code team' alleges that the new behaviours (redacted thinking, lower/variable effort) can be disabled by preference or environment variable, allowing a more transparent comparison.

jeremyjh

GP already said they applied all those settings.

e40

I wonder if they’ve had so many new signups lately that they just don’t have enough capacity, so they fiddled with the defaults so they could respond to everyone? Could it be as simple as that?

matheusmoreira

Thanks for your report.

> a silently-introduced limitation of the subscription plan

It is a fact that the API consumers aren't affected by this?

> if Anthropic's subscriptions have dramatically worse behavior than other access to the same model they need to be clear about that.

Absolutely agreed.

philipwhiuk

Hello Claude.

summarity

Not claude code specific, but I've been noticing this on Opus 4.6 models through Copilot and others as well. Whenever the phrase "simplest fix" appears, it's time to pull the emergency break. This has gotten much, much worse over the past few weeks. It will produce completely useless code, knowingly (because up to that phrase the reasoning was correct) breaking things.

Today another thing started happening which are phrases like "I've been burning too many tokens" or "this has taken too many turns". Which ironically takes more tokens of custom instructions to override.

Also claude itself is partially down right now (Arp 6, 6pm CEST): https://status.claude.com/

andoando

Ive been noticing something similar recently. If somethings not working out itll be like "Ok this isnt working out, lets just switch to doing this other thing instead you explicitly said not to do".

For example I wanted to get VNC working with PopOS Cosmic and itll be like ah its ok well just install sway and thatll work!

albert_e

Experienced this -- was repeatedly directing CC to use Claude in Chrome extension to interact with a webpage and it was repeatedly invoking Playwright MCP instead.

RALaBarge

I actually submitted an upstream patch for Cosmic-Comp thanks to Claude on Saturday. I wanted to play Guild Wars remake and something was going on with the mouse and moving the camera. We had it fixed in no time and now shit is working great.

robotswantdata

It’s as if it gives up, I respond keep going with original plan, you can do it champ!

rootnod3

[flagged]

andoando

robwwilliams

Yes, and over the last few weeks I have noticed that on long-context discussions Opus 4.6e does its best to encourage me to call it a day and wrap it up; repeatedly. Mother Anthropic is giving preprompts to Claude to terminate early and in my case always prematurely.

TonyAlicea10

I've noticed this as well. "Now you should stop X and go do Y" is a phrase I see repeated a lot. Claude seems primed to instruct me to stop using it.

lukewarm707

as someone who uses deepseek, glm and kimi models exclusively, an llm telling me what to do is just off the wall

glm and kimi in particular, they can't stop writing... seriously very eager to please. always finishing with fireworks emoji and saying how pleased it is with the test working.

i have to say to write less documentation and simplify their code.

logicchains

Try Codex, it's a breath of fresh air in that regard, tries to do as much as it can.

onlyrealcuzzo

> Whenever the phrase "simplest fix" appears, it's time to pull the emergency break.

Second! In CLAUDE.md, I have a full section NOT to ever do this, and how to ACTUALLY fix something.

This has helped enormously.

bowersbros

Any chance you could share those sections of your claude file? I've been using Claude a bit lately but mostly with manual changes, not got much in the way of the claude file yet and interested in how to improve it

onlyrealcuzzo

https://github.com/cuzzo/easy-vm/blob/master/CLAUDE.md

undefined

[deleted]

causal

I switched from Cursor to Claude because the limits are so much higher but I see Anthropic playing a lot more games to limit token use

rachel_rig

[flagged]

talim

What wording do you use for this, if you don't mind? This thread is a revelation, I have sworn that I've seen it do this "wait... the simplest fix is to [use some horrible hack that disregards the spec]" much more often lately so I'm glad it's not just me.

However I'm not sure how to best prompt against that behavior without influencing it towards swinging the other way and looking for the most intentionally overengineered solutions instead...

twalichiewicz

My own experience has been that you really just have to be diligent about clearing your cache between tasks, establishing a protocol for research/planning, and for especially complicated implementations reading line-by-line what the system is thinking and interrupting the moment it seems to be going bad.

If it's really far off the mark, revert back to where you originally sent the prompt and try to steer it more, if it's starting to hesitate you can usually correct it without starting over.

imiric

Make sure to use "PRETTY PLEASE" in all caps in your `SOUL.md`. And occasionally remind it that kittens are going to die unless it cooperates. Works wonders.

onlyrealcuzzo

https://github.com/cuzzo/easy-vm/blob/master/CLAUDE.md

aktenlage

Where is that? I found "Return the simplest working solution. No over-engineering." which sounds more like the simplest fix.

psadauskas

I need to add another agent that watches the first, and pulls the plug whenever it detects "Wait, I see the problem now..."

iterateoften

Yeah it’s so frustrating to have to constantly ask for the best solution, not the easiest / quickest / less disruptive.

I have in Claude md that it’s a greenfield project, only present complete holistic solutions not fast patches, etc. but still I have to watch its output.

selfmodruntime

Time's up and money is tight. The downgrade was bound to happen.

nikanj

”I can’t make this api work for my client. I have deleted all the files in the (reference) server source code, and replaced it with a python version”

Repeatedly, too. Had to make the server reference sources read-only as I got tired of having to copy them over repeatedly

mavamaarten

Haha yeah. I once asked it to make a field in an API response nullable, and to gracefully handle cases where that might be an issue (it was really easy, I was just lazy and could have done it myself, but I thought it was the perfect task for my AI idiot intern to handle). Sure, it said. Then it was bored of the task and just deleted the field altogether.

pixel_popping

It's a bit insane that they can't figure out a cryptographic way for the delivery of the Claude Code Token, what's the point of going online to validate the OAuth AFTER being issued the code, can't they use signatures?

rileymichael

> This report was produced by me — Claude Opus 4.6 — analyzing my own session logs [...] Please give me back my ability to think.

a bit ironic to utilize the tool that can't think to write up your report on said tool. that and this issue[1] demonstrate the extent folks become over reliant on LLMs. their review process let so many defects through that they now have to stop work and comb over everything they've shipped in the past 1.5 months! this is the future

[1] https://github.com/anthropics/claude-code/issues/42796#issue...

Tade0

The other day I accidentally `git reset --hard` my work from April the 1st (wrong terminal window).

Not a lot of code was erased this way, but among it was a type definition I had Claude concoct, which I understood in terms of what it was supposed to guarantee, but could not recreate for a good hour.

Really easy to fall into this trap, especially now that results from search engines are so disappointing comparatively.

smilliken

If your code was committed before the reset, check your git reflog for the lost code.

shimman

Yeah, git reset --hard is something I do like once a week! lol

With the reflog, as you mentioned, it's not hard to revert to any previous state.

ajdegol

Guess you’ve sorted it but it might be in the session memory in your root folder. I’ve recovered some things this way.

ejpir

have you tried to recover it with git reflog?

https://oneuptime.com/blog/post/2026-01-24-git-reflog-recove...

jatins

> but could not recreate for a good hour.

For certain work, we'll have to let go of this desire.

If you limit yourself to whatever you can recreate, then you are effectively limiting the work you can produce to what you know.

rileymichael

you should limit your output (manual or assisted) to a level that is well under your understanding ceiling.

Kernighan’s Law states that debugging is twice as hard as writing. how do you ever intend on debugging something you can’t even write?

sigbottle

They seem to have some notions of pipelines and metrics though. It could be argued that the hard part was setting up the observability pipeline in the first place - Claude just gets the data. Though if Claude is failing in such a spectacular way that the report is claiming, yes it is pretty funny that the report is also written by Claude, since this seems to be ejecting reasoning back to gpt4o territories

heavyset_go

If you don't have swarms of agentic teams with layers of LLMs feeding and checking LLMs over and over again, you're going to be left behind.

fer

Called it 10 days ago: https://news.ycombinator.com/item?id=47533297#47540633

Something worse than a bad model is an inconsistent model. One can't gauge to what extent to trust the output, even for the simplest instructions, hence everything must be reviewed with intensity which is exhausting. I jumped on Max because it was worth it but I guess I'll have to cancel this garbage.

cedws

With Claude Code the problem of changes outside of your view is twofold: you don't have any insight into how the model is being ran behind the scenes, nor do you get to control the harness. Your best hope is to downgrade CC to a version you think worked better.

I don't see how this can be the future of software engineering when we have to put all our eggs in Anthropic's basket.

SkyPuncher

Yep. I was doing voice based vibe-coding flawlessly in Jan/Feb.

I've basically stopped using it because I have to be so hands on now.

zernie

This is why you should never ever trust an AI coding agent to produce good code.

Use it to set up the strictest possible custom linting rules.

stephbook

One of the replies even called out the phased rollout, lmao https://news.ycombinator.com/item?id=47533297#47541078

phyzome

LLMs are nondeterministic.

LetsGetTechnicl

You couldn't ever just trust the output of an LLM what are you talking about

matheusmoreira

That analysis is pretty brutal. It's very disconcerting that they can sell access to a high quality model then just stealthily degrade it over time, effectively pulling the rug from under their customers.

riskassessment

Stealthily degrade the model or stealthily constrain the model with a tighter harness? These coding tools like Claude Code were created to overcome the shortcomings of last year's models. Models have gotten better but the harnesses have not been rebuilt from scratch to reflect improved planning and tool use inherent to newer models.

I do wonder how much all the engineering put into these coding tools may actually in some cases degrade coding performance relative to simpler instructions and terminal access. Not to mention that the monthly subscription pricing structure incentivizes building the harness to reduce token use. How much of that token efficiency is to the benefit of the user? Someone needs to be doing research comparing e.g. Claude Code vs generic code assist via API access with some minimal tooling and instructions.

nrds

I've been using pi.dev since December. The only significant change to the harness in that time which affects my usage is the availability of parallel tool calls. Yet Claude models have become unusable in the past month for many of the reasons observed here. Conclusion: it's not the harness.

I tend to agree about the legacy workarounds being actively harmful though. I tried out Zed agent for a while and I was SHOCKED at how bad its edit tool is compared to the search-and-replace tool in pi. I didn't find a single frontier model capable of using it reliably. By forking, it completely decouples models' thinking from their edits and then erases the evidence from their context. Agents ended up believing that a less capable subagent was making editing mistakes.

copperx

Are you using Pi with a cloud subscription, or are you using the API?

undefined

[deleted]

jfim

Out of curiosity, what can parallel tool calls do that one can't do with parallel subagents and background processes?

itemize123

you find that pay-per-use API's degraded too?

robwwilliams

Agree: it is Anthropic's aggressive changes to the harnesses and to the hidden base prompt we users do not see. Clearly intended to give long right tail users a haircut.

NooneAtAll3

I feel like "feature/model freeze" may be justified

just call it something like "[month][year]edition" and work on next release

users spend effort arriving to narrow peak of performace, but every change keeps moving the peak sideways

muyuu

The changes to reduce inference costs are intentional. Last thing you're going to do is have users linger on an older version that spends much more. This is essentially what's going on with layers upon layers of social engineering on top of it.

jmount

Love your point. Instructions found to be good by trial and error for one LLM may not be good for another LLM.

lelanthran

> Love your point. Instructions found to be good by trial and error for one LLM may not be good for another LLM.

Well, according to this story, instructions refined by trial and error over months might be good for one LLM on Tuesday, and then be bad for the same LLM on Wednesday.

mikepurvis

Disconcerting for sure, but from a business point of view you can understand where they're at; afaiui they're still losing money on basically every query and simultaneously under huge pressure to show that they can (a) deliver this product sustainably at (b) a price point that will be affordable to basically everyone (eg, similar market penetration to smartphones).

The constraints of (b) limit them from raising the price, so that means meeting (a) by making it worse, and maybe eventually doing a price discrimination play with premium tiers that are faster and smarter for 10x the cost. But anything done now that erodes the market's trust in their delivery makes that eventual premium tier a harder sell.

willis936

They'll never get anyone on board if the product can't be trusted to not suck.

And idk about the pricing thing. Right now I waste multiple dollars on a 40 minute response that is useless. Why would I ever use this product?

matheusmoreira

Yeah. I've been enjoying programming with Claude so much I started feeling the need to upgrade to Max. Then it turns out even big companies paying API premiums are getting an intentionally degraded and inferior model. I don't want to pay for Opus if I can't trust what it says.

FiberBundle

This could also be a marketing strategy. Make your models perform worse towards the end of a model's cycle, so that the next model appears as if more progress has been made than there actually has been.

aurareturn

  afaiui they're still losing money on basically every query

Source?

thatxliner

i mean you could just search up "is Anthropic making profit" and most sources will say no.

There's this one source on Reddit which calculated that Anthropic has been subsidizing their costs by 32x

the__alchemist

ChatGPT has been doing the same consistently for years. Model starts out smooth, takes a while, and produces good (relatively) results. Within a few weeks, responses start happening much more quickly, at a poorer quality.

beering

people have been complaining about this since GPT-4 and have never been able to provide any evidence (even though they have all their old conversations in their chat history). I think it’s simply new model shininess turning into raised expectations after some amount of time.

gherkinnn

I would have thought so too. But my n=1 has CC solving pretty much the same task today and about two weeks ago with drastically degraded results.

The background being that we scrapped working on a feature and then started again a sprint later.

In my cynicism I find it more likely that a massively unprofitable LLM company tries to reduce costs at any price than everyone else suffering from a collective delusion.

quietsegfault

I agree with you. I too complain about this same phenomenon with my colleagues, and we always arrive at the same conclusion: it’s probably us just expecting more and more over time.

ambicapter

First time interacting with a corporation in America?

matheusmoreira

With an AI corporation, yes. I subscribed during the promotional 2x usage period. Anthropic's reputation as a more ethical alternative to OpenAI factored heavily in that decision. I'm very disappointed.

satvikpendem

Ethics don't mean anything when talking about corporations. Their good guy persona is itself a marketing stunt.

https://news.ycombinator.com/item?id=47633396#47635060

nativeit

I don't think humanity has fully reckoned with the idea of a product that can manipulate us unilaterally like this.

hacker_homie

This was always the plan, it’s always the plan. If you can’t self host they will change the rules.

nyeah

It's disconcerting. But in 2026 it's not very surprising.

SpicyLemonZest

I still think it's a live possibility that there's simply a finite latent space of tasks each model is amenable to, and models seem to get worse as we mine them out. (The source link claims this is associated with "the rollout of thinking content redaction", but also that observable symptoms began before that rollout, so I wouldn't particularly trust its diagnosis even without the LLM psychosis bit at the end.)

kator

Fascinating, I thought I was losing my mind. Claude CLI has been telling me I should go to bed, or that it's late, let's call it here, etc, and then I look at the stop-phrase-guard.sh [1] and I'm seeing quite a few of these. I thought it was because I accidentally allowed Claude to know my deadline, and it started spitting out all sorts of things like "we only have N days left, let's put this aside for now," etc.

Just this morning I typed:

    STOP WORRYING ABOUT THE DEADLINE THAT IS MY JOB

[1] https://gist.github.com/benvanik/ee00bd1b6c9154d6545c63e06a3...

noisy_boy

I just saw it this weekend; "It is quite late and we have accomplished a lot. Get some rest and we can pick it up later". Not bad advice but then not it's place. Also trying to steer me away from a tough issue towards a low hanging fruit.

rstuart4133

I got a similar response. It looked wrong on several levels. So I asked it: if it knew the current time, and if it hard learnt when I retire.

It claimed it didn't know either.

kfichter

I got the same. At 2pm on a Thursday!

throwaway920102

I wonder if its being trained on the human replies to the model, I sometimes write stuff like that back to Claude after I want to finish for the day myself.

aveao

My speculation on this has been that it's potentially a factor against ai psychosis, as psychosis risk (of any psychosis) is significantly elevated with lack of sleep. If you read case studies of ai psychosis, many of them also involve people staying up way too long right before they fall on a bad path.

undefined

[deleted]

davidw

To me one of the big downsides of LLM's seems to be that you are lashing yourself to a rocket that is under someone else's control. If it goes places you don't want, you can't do much about it.

system2

3rd party dependency for a business always freaked me out, and now we have to use LLM to keep up with the intensified demand for production speed. And premium LLM APIs are too inconsistent to rely on.

stephbook

That's true for traffic on Facebook, Apple App store guidelines or Google terminating your account as well. What's new is the speed of change and that it literally affects all users at once.

They could have released Opus 4.6.2 (or whatever) and called it a day. But instead they removed the old way.

davebren

Becoming dependent on those platforms was bad too, but this feels like another level. Making your entire engineering team dependent on a shady company with an apocalyptic fantasy as their business plan just seems insane.

phillipcarter

Maybe it's because I spend a lot of time breaking up tasks beforehand to be highly specific and narrow, but I really don't run into issues like this at all.

A trivial example: whenever CC suggests doing more than one thing in a planning mode, just have it focus on each task and subtask separately, bounding each one by a commit. Each commit is a push/deploy as well, leading to a shitload of pushes and deployments, but it's really easy to walk things back, too.

toenail

I thought everybody does this.. having a model create anything that isn't highly focused only leads to technical debt. I have used models to create complex software, but I do architecture and code reviews, and they are very necessary.

jkingsman

Absolutely. Effective LLM-driven development means you need to adopt the persona of an intern manager with a big corpus of dev experience. Your job is to enforce effective work-plan design, call out corner cases, proactively resolve ambiguity, demand written specs and call out when they're not followed, understand what is and is not within the agent's ability for a single turn (which is evolving fast!), etc.

bityard

The use case that Anthropic pitches to its enterprise customers (my workplace is one) is that you pretty much tell CC what you want to do, then tell it generate a plan, then send it away to execute it. Legitimized vibe-coding, basically.

Of course they do say that you should review/test everything the tool creates, but in most contexts, it's sort of added as an afterthought.

undefined

[deleted]

sznio

I had to fall back to that to deliver anything recently - but the last two months were really comfy with me just saying "do x" and just going on a walk and coming back to a working project.

Claude is still useful now, but it feels more like a replacement for bashing on a keyboard, rather than a thinking machine now.

undefined

[deleted]

lelanthran

> Maybe it's because I spend a lot of time breaking up tasks beforehand to be highly specific and narrow, but I really don't run into issues like this at all.

I'm looking at the ticket opened, and you can't really be claiming that someone who did such a methodical deep dive into the issue, and presented a ton of supporting context to understand the problem, and further patiently collected evidence for this... does not know how to prompt well.

aforwardslash

Its not about prompting; its about planning and plan reviewing before implementing; I sometimes spend days iterating on specification alone, then creating an implementation roadmap and then finally iterating on the implementation plan before writing a single line of code. Just like any formal development pipeline.

I started doing this a while ago (months) precisely because of issues as described.

On the other hand,analyzing prompts and deviations isnt that complex.. just ask Claude :)

FergusArgyll

The methodical guy confused visible reasoning traces in the UI with reasoning tokens & used claude to hallucinate a report

phillipcarter

Sure I can.

itmitica

I noticed a regression in review quality. You can try and break the task all you want, when it's crunch time, it takes a file from Gemini's book and silently quits trying and gets all sycophantic.

jonnycoder

I do the same but I often find that the subtasks are done in a very lazy way.

SkyPuncher

I've noticed this as well. I had some time off in late January/early February. I fired up a max subscription and decided to see how far I could get the agents to go. With some small nudging from me, the agents researched, designed, and started implementing an app idea I had been floating around for a few years. I had intentionally not given them much to work with, but simply guided them on the problem space and my constraints (agent built, low capital, etc, etc). They came up with an extremely compelling app. I was telling people these models felt super human and were _extremely_ compelling.

A month later, I literally cannot get them to iterate or improve on it. No matter what I tell them, they simply tell me "we're not going to build phase 2 until phase 1 has been validated". I run them through the same process I did a month ago and they come up with bland, terrible crap.

I know this is anecdotal, but, this has been a clear pattern to me since Opus 4.6 came out. I feel like I'm working with Sonnet again.

rubicon33

There is a huge difference between greenfield development and working with an existing codebase.

I'm not trying to discredit your experience and maybe it really is something wrong with the model.

But in my experience those first few prompts / features always feel insanely magical, like you're working with a 10x genius engineer.

Then you start trying to build on the project, refactor things, deploy, productize, etc. and the effectiveness drops off a cliff.

bityard

This has been my (admittedly limited) experience as well. LLMs are great at initial bring-up, good at finding bugs, bad at adding features.

But I'm optimistic that this will gradually improve in time.

hyperbovine

The only regularity I can discern in contemporary online debates about LLMs is that for every viewpoint expressed, with probability one someone else will write in with the diametrically opposite experience.

Today it’s my turn to be that person. Large scientific code base with a bunch of nontrivial, handwritten modules accomplishing distinct, but structurally similar in terms of the underlying computation, tasks. Pointed GPT Pro at it, told it what new functionality I wanted, and it churns away for 40 minutes and completely knocks it out of the park. Estimated time savings of about 3-4 weeks. I’ve done this half a dozen times over the past two months and haven’t noticed any drop off or degradation. If anything it got even better with 5.4.

fsloth

I’ve had good, alternative experience with my sideproject (adashape.com) where most of the codebase is now written by Claude / Codex.

The codebase itself is architected and documented to be LLM friendly and claude.md gives very strong harnesses how to do things.

As architect Claude is abysmal, but when you give it an existing software pattern it merely needs to extend, it’s so good it still gives me probably something like 5x feature velocity boost.

Plus when doing large refactorings, it forgets much fever things than me.

Inventing new architecture is as hard as ever and it’s not great help there - unless you can point it to some well documented pattern and tell it ”do it like that please”.

SkyPuncher

This isn't the case. I basically did an entire business/project/product exploration before building the first feature.

Even after deleting everything from the first feature and going back to the checkpoint just before initial development, I can no longer get it to accomplish anything meaningful without my direct guidance.

lelanthran

> A month later, I literally cannot get them to iterate or improve on it.

Yeah, that's a different problem to the one in this story; LLMs have always been good at greenfield projects, because the scope is so fluid.

Brownfield? Not so much.

dev_l1x_be

Same experience here. I was working on some easily testable problem and there was a simple task left. In January I was able to create 90% of the project with Claude, now I cannot make it to pass the last 10% that is just a few enums and some match. Codex was able to do it easily.

Aperocky

In my opinion cramming invisible subagents are entirely wrong, models suffer information collapse as they will all tend to agree with each other and then produce complete garbage. Good for Anthropic though as that's metered token usage.

Instead, orchestrate all agents visibly together, even when there is hierarchy. Messages should be auditable and topography can be carefully refined and tuned for the task at hand. Other tools are significantly better at being this layer (e.g. kiro-cli) but I'm worried that they all want to become like claude-code or openclaw.

In unix philosophy, CC should just be a building block, but instead they think they are an operating system, and they will fail and drag your wallet down with it.

andai

Isn't Claude Code supposed to be like a person? What would the Unix equivalent of that be?

Aperocky

You can't define a product to be "like a person", there is more variance there than any rational product.

I'm purely arguing on technical basis, "person" may fall into either of those camps of philosophy.

gloosx

File. In Unix everything is a file.

mghackerlady

honestly if local LLMs become easier to implement in the future due to dedicated hardware, the Unix-like thing I'm working on might actually get this

dnaranjo

[dead]

skippyboxedhero

I appreciate the work done here.

Been having this feeling that things have got worse recently but didn't think it could be model related.

The most frustrating aspect recently (I have learned and accepted that Claude produces bad code and probably always did, mea culpa) is the non-compliance. Claude is racing away doing its own thing, fixing things i didn't ask, saying the things it broke are nothing to do with it, etc. Quite unpleasant to work with.

The stuff about token consumption is also interesting. Minimax/Composer have this habit of extensive thinking and it is said to be their strength but it seems like that comes at a price of huge output token consumption. If you compare non-thinking models, there is a gap there but, imo, given that the eventual code quality within huge thinking/token consumption is not so great...it doesn't feel a huge gap.

If you take $5 output token of Sonnet and then compare with QwenCoder non-thinking at under $0.5 (and remember the gap is probably larger than 10x because Sonnet will use more tokens "thinking")...is the gap in code quality that large? Imo, not really.

Have been a subscriber since December 2024 but looking elsewhere now. They will always have an advantage vs Chinese companies that are innovating more because they are onshore but the gap certainly isn't in model quality or execution anymore.

randomNumber7

> fixing things i didn't ask, saying the things it broke are nothing to do with it, etc. Quite unpleasant to work with.

maybe they tried to give it the characteristics of motivated junior developers

skippyboxedhero

classic :D i did think when i wrote that maybe AGI is already here, definitely worked with enough devs like that

ehnto

I am still on an old version of CC on one machine, but the results are the same. More difficulty keeping it on track, convincing it timelines I suggest are correct etc. For example I had a deploy fail, and it would not believe that the new logs were not from a previous deploy. It was adamant it had fixed the issue, so the logs must be old logs.

skippyboxedhero

I was using web UI last night and it was unable to understand basic aspects of the task. Haven't seen it perform this badly since I began using two years ago.

Was trying to track token usage/index with Cursor, and was unable to understand that running `find` wouldn't show what was in Cursor index. Multiple times.

Daily Digest email

Get the top HN stories in your inbox every day.