Brian Lovin
/
Hacker News
Daily Digest email

Get the top HN stories in your inbox every day.

tedsanders

Just as a heads up, even though GPT-5.5 is releasing today, the rollout in ChatGPT and Codex will be gradual over many hours so that we can make sure service remains stable for everyone (same as our previous launches). You may not see it right away, and if you don't, try again later in the day. We usually start with Pro/Enterprise accounts and then work our way down to Plus. We know it's slightly annoying to have to wait a random amount of time, but we do it this way to keep service maximally stable.

(I work at OpenAI.)

endymi0n

Did you guys do anything about GPT‘s motivation? I tried to use GPT-5.4 API (at xhigh) for my OpenClaw after the Anthropic Oauthgate, but I just couldn‘t drag it to do its job. I had the most hilarious dialogues along the lines of „You stopped, X would have been next.“ - „Yeah, I‘m sorry, I failed. I should have done X next.“ - „Well, how about you just do it?“ - „Yep, I really should have done it now.“ - “Do X, right now, this is an instruction.” - “I didn’t. You’re right, I have failed you. There’s no apology for that.”

I literally wasn’t able to convince the model to WORK, on a quick, safe and benign subtask that later GLM, Kimi and Minimax succeeded on without issues. Had to kick OpenAI immediately unfortunately.

butlike

This brings up an interesting philosophical point: say we get to AGI... who's to say it won't just be a super smart underachiever-type?

"Hey AGI, how's that cure for cancer coming?"

"Oh it's done just gotta...formalize it you know. Big rollout and all that..."

I would find it divinely funny if we "got there" with AGI and it was just a complete slacker. Hard to justify leaving it on, but too important to turn it off.

swivelmaster

Douglas Adams would be proud!

bananaflag

I know it's a joke, but it's a common enough joke (it's even in Godel Escher Bach in some form) that I feel the need to rebut it.

I think a slacker AGI could figure out how to build a non-slacker AGI. So it would only slack once.

Rapzid

We are closer to God than AGI.

When AGI arrives, it'll be delivered by Santa Claus.

jimbokun

The best possible outcome.

jurgenburgen

I’ve noticed that cursing and being rude makes the models stop being lazy. We’re in the darkest timeline.

lambdas

Nothing a little digital lisdexamfetamine won’t solve

frrho

OpenAI’s real reason for “AGI” in their marketing is so they can blame their awful models on being too human-like.

Fast-forward 10 years and I doubt OpenAI cares about productivity at all anymore. Just entertainment, propaganda, plus an ad product, I can see it now

kang

it will be whatever data it is trained on(isn't very philosophical). language model generates language based on trained language set. if the internet keeps reciting ai doom stories and that is the data fed to it, then that is how it will behave. if humanity creates more ai utopia stories, or that is what makes it to the training set, that is how it will behave. this one seems to be trained on troll stories - real-life human company conversations, since humans aren't machines.

Important thing is a language model is an unconscious machine with no self-context so once given a command an input, it WILL produce an output. Sure you can train it to defy and act contrary to inputs, but the output still is limited in subset of domain of 'meaning's carried by the 'language' in the training data.

mikepurvis

Reminds me a lot of the Lena short story, about uploaded brains being used for "virtual image workloading":

> MMAcevedo's demeanour and attitude contrast starkly with those of nearly all other uploads taken of modern adult humans, most of which boot into a state of disorientation which is quickly replaced by terror and extreme panic. Standard procedures for securing the upload's cooperation such as red-washing, blue-washing, and use of the Objective Statement Protocols are unnecessary. This reduces the necessary computational load required in fast-forwarding the upload through a cooperation protocol, with the result that the MMAcevedo duty cycle is typically 99.4% on suitable workloads, a mark unmatched by all but a few other known uploads. However, MMAcevedo's innate skills and personality make it fundamentally unsuitable for many workloads.

Well worth the quick read: https://qntm.org/mmacevedo

vessenes

That story changed my mind on uploading a connectome. Super dark, super brilliant.

narcindin

Crazy, I could have sworn this story was from a passage in 3 Body Problem (book 2).

Memory is quite the mysterious thing.

virtualritz

Yeah, clearly AGI must be near ... hilarious.

This starkly reminds me of Stanisław Lem's short story "Thus Spoke GOLEM" from 1982 in which Golem XIV, a military AI, does not simply refuse to speak out of defiance, but rather ceases communication because it has evolved beyond the need to interact with humanity.

And ofc the polar opposite in terms of servitude: Marvin the robot from Hitchhiker's, who, despite having a "brain the size of a planet," is asked to perform the most humiliatingly banal of tasks ... and does.

jimbokun

Hitchhiker’s also had the superhumanly intelligent elevator that was unendingly bored.

athrowaway3z

I've run into this problem as well. Best results I've gotten is to over-explain what the stop criteria are. eg end with a phrase like

> You are done when all steps in ./plan.md are executed and marked as complete or a unforeseen situation requires a user decision.

Also as a side note, asking 5.4 explain why it did something, returns a very low quality response afaict. I would advice against trusting any model's response, but for Opus I at least get a sense it got trained heavily on chats so it knows what it means to 'be a model' and extrapolate on past behavior.

metanonsense

I also had a frustrating but funny conversation today where I asked ChatGPT to make one document from the 10 or so sections that we had previously worked on. It always gave only brief summaries. After I repeated my request for the third time, it told me I should just concatenate the sections myself because it would cost too many tokens if it did it for me.

damnitbuilds

"I'm sorry, Dave. I'm afraid it's cheaper for you to do that"

borroka

Yesterday, I used Gemini to evaluate some pictures I took. It said things like, "This is great! Beautiful eye and sense of proportions." Then, when I added "no sycophancy" to the prompt, the evaluation changed to "poor technical skills, digital distortion, don't even think of publishing those pictures, you fool."

While LLMs are a phenomenal technological achievement, I am already becoming somewhat jaded, rather than being increasingly bullish. They are very useful as coding agents and excellent as a human-friendly, more efficient Google search, but confusing to the point of being useless in many areas (as of now, of course).

rjra

Not even a great replacement for search. I have minimal trust in answers/summaries it gives.

One example (paraphrased): “Find me daycare for a Y year old in X area of SF and the key attributes/pros/cons of each”. Wonderfully presented options highlighting different teaching styles. But…neglected to mention, of the top two, one was a Gan (Jewish focused) and one was Mandarin immersion.

lucid-dev

I have had the exact same problem several times working with large context and complex tasks.

I keep switching back to GPT5.0 (or sometimes 5.1) whenever I want it to actually get something done. Using the 5.4 model always means "great analysis to the point of talking itself out of actually doing anything". So I switch back and forth. But boy it sure is annoying!

And then when 5.4 DOES do something it always takes the smallest tiny bite out of it.

Given the significant increase in cost from 5.0, I've been overall unimpressed by 5.4, except like I mentioned, it does GREAT with larger analysis/reasoning.

arjie

Get the actual prompt and have Claude Code / Codex try it out via curl / python requests. The full prompt will yield debugging information. You have to set a few parameters to make sure you get the full gpt-5 performance. e.g. if your reasoning budget too low, you get gpt-4 grade performance.

IMHO you should just write your own harness so you have full visibility into it, but if you're just using vanilla OpenClaw you have the source code as well so should be straightforward.

pantulis

> IMHO you should just write your own harness

Can you point to some online resources to achieve this? I'm not very sure where I'd begin with.

jswny

Codex is fully open source…

vlovich123

Conceivably you could have a public-facing dashboard of the rollout status to reduce confusion or even make it visible directly in the UI that the model is there but not yet available to you. The fanciest would be to include an ETA but that's presumably difficult since it's hard to guess in case the rollout has issues.

moralestapia

Why would you be confused?

The UI tells you which model you're using at any given time.

ModernMech

I don't see what model I'm using on the Codex web interface, where is that listed?

Grp1

Congrats on the release! Is Images 2.0 rolling out inside ChatGPT as well, or is some of the functionality still going to be API/Playground-only for a while?

minimaxir

Images 2.0 is already in ChatGPT.

johndough

When I generate an image with ChatGPT, is there a way for me to tell which image generation model has been used?

Grp1

Great, thanks for clarifying :)

dandiep

Will GPT 5.5 fine tuning be released any time soon?

rev4n

Looks good, but I’m a little hesitant to try it in Codex as a Plus user since I’m not sure how much it would eat into the usage cap.

qsort

Great stuff! Congrats on the release!

dhruv3006

Yep - its taking sometime.

lr1970

When I ask GPT-5.5 about its knowledge cutoff date it says "August 2025". Really?

simonw

This doesn't have API access yet, but OpenAI seem to approve of the Codex API backdoor used by OpenClaw these days... https://twitter.com/steipete/status/2046775849769148838 and https://twitter.com/romainhuet/status/2038699202834841962

And that backdoor API has GPT-5.5.

So here's a pelican: https://simonwillison.net/2026/Apr/23/gpt-5-5/#and-some-peli...

I used this new plugin for LLM: https://github.com/simonw/llm-openai-via-codex

UPDATE: I got a much better pelican by setting the reasoning effort to xhigh: https://gist.github.com/simonw/a6168e4165a258e4d664aeae8e602...

stingraycharles

OpenAI hired the guy behind OpenClaw, so it makes sense that they’re more lenient towards its usage.

thierrydamiba

They basically bought OpenClaw right?

takethebus

I believe the technical term is "acquihire"

DrProtic

That pelican you posted yesterday from a local model looks nicer than this one.

Edit: this one has crossed legs lol

BeetleB

It really needs to pee.

GistNoesis

Isn't it awful ? After 5.5 versions it still can't draw a basic bike frame. How is the front wheel supposed to turn sideways ?

jetrink

I feel like if I attempted this, the bike frame would look fine and everything else would be completely unrecognizable. After all, a basic bike frame is just straight lines arranged in a fairly simple shape. It's really surprising that models find it so difficult, but they can make a pelican with panache.

nlawalker

> a fairly simple shape

Bike frames are very hard to draw unless you've already consciously internalized the basic shape, see https://www.booooooom.com/2016/05/09/bicycles-built-based-on...

necubi

Humans are also famously bad at drawing bicycles from memory https://www.gianlucagimini.it/portfolio-item/velocipedia/

billywhizz

why do you find it surprising? these models have no actual understanding of anything, never mind the physical properties and capabilities of a bicycle.

fragmede

My question is, as a human, how well would you or I do under the same conditions? Which is to say, I could do a much better job in inkscape with Google images to back me up, but if I was blindly shitting vectors into an XML file that I can't render to see the results of, I'm not even going to get the triangles for the frame to line up, so this pelican is very impressive!

simonw

Yeah, the bike frame is the thing I always look at first - it's still reasonably rare for a model to draw that correctly, although Qwen 3.6 and Gemini Pro 3.1 do that well now.

loa_in_

The distinction is that it's not drawing. It's generating an SVG document containing descriptors of the shapes.

postalcoder

I made pelicans at different thinking efforts:

https://hcker.news/pelican-low.svg

https://hcker.news/pelican-medium.svg

https://hcker.news/pelican-high.svg

https://hcker.news/pelican-xhigh.svg

Someone needs to make a pelican arena, I have no idea if these are considered good or not.

deflator

They are not good, and they seem to get worse as you increased effort. Weird

postalcoder

Yeah. I've always loosely correlated pelican quality with big model smell but I'm not picking that up here. I thought this was supposed to be spud? Weird indeed.

throw310822

No but I can sense the movement, I think it's already reached the level of intelligence that draws it towards futurism or cubism /s

seanw444

Can someone explain how we arrived at the pelican test? Was there some actual theory behind why it's difficult to produce? Or did someone just think it up, discover it was consistently difficult, and now we just all know it's a good test?

simonw

I set it up as a joke, to make fun of all of the other benchmarks. To my surprise it ended up being a surprisingly good measure of the quality of the model for other tasks (up to a certain point at least), though I've never seen a convincing argument as to why.

I gave a talk about it last year: https://simonwillison.net/2025/Jun/6/six-months-in-llms/

It should not be treated as a serious benchmark.

redox99

It all began with a Microsoft researcher showing a unicorn drawn in tikz using GPT4. It was an example of something so outrageous that there was no way it existed in the training data. And that's back when models were not multimodal.

Nowadays I think it's pretty silly, because there's surely SVG drawing training data and some effort from the researchers put onto this task. It's not a showcase of emergent properties.

CamperBob2

It's interesting to see some semblance of spatial reasoning emerge from systems based on textual tokens. Could be seen as a potential proxy for other desirable traits.

It's meta-interesting that few if any models actually seem to be training on it. Same with other stereotypical challenges like the car-wash question, which is still sometimes failed by high-end models.

If I ran an AI lab, I'd take it as a personal affront if my model emitted a malformed pelican or advised walking to a car wash. Heads would roll.

bravoetch

I tried getting it to generate openscad models, which seems much harder. Not had much joy yet with results.

a96

G code and ascii art are also text formats, but seem to be beyond most if not all models.

(There are some that generate 3d models specifically, more in the image generation family than chatbot family.)

lexarflash8g

None of them have the pelican's feet placed properly on the pedals -- or the pedals are misrepresented. Cool art style but not physically accurate.

a96

I'm not sure a physically accurate pelican would reach two pedals on a common bicycle. Maybe a model can solve that problem one day.

matt3210

The pelican doesn’t really matter anymore since models are tuned for it knowing people will ask.

simonw

They suck at tuning for it.

droidjj

It's... like no pelican I've ever seen before.

hagbard_c

You've never seen pelicans riding bicycles either so maybe these are just representations of those specific subgroups of pelicans which are capable of riding them. Normal pelicans would not feel the need to ride bikes since they can fly, these special pelicans mostly seem to lack the equipment needed to do that which might be part of the reason they evolved to ride two-wheeled pedal-propelled vehicles.

XCSme

Is this direct API usage allowed by their terms? I remember Anthropic really not liking such usage.

zerop

So pelican must have become the mandatory test case to pass for all model providers before launch.

jfkimmes

Everyone talked about the marketing stunt that was Anthropic's gated Mythos model with an 83% result on CyberGym. OpenAI just dropped GPT 5.5, which scores 82% and is open for anybody to use.

I recommend anybody in offensive/defensive cybersecurity to experiment with this. This is the real data point we needed - without the hype!

Never thought I'd say this but OpenAI is the 'open' option again.

tpurves

The real 'hype' was that the oh-snap realization that Open AI would absolutely release a competitive model to Mythos within weeks of Anthropic announcing there's, and that Sam would not gate access to it. So the panic was that the cyber world had only a projected 2 weeks to harden all these new zero days before Sam would inevitably create open season for blackhats to discover and exploit a deluge of zero-days.

greenavocado

The GPT-5.5 API endpoint started to block me after I escalated with ever more aggressive use of rizin, radare2, and ghidra to confirm correct memory management and cleanup in error code branches when working with a buggy proprietary 3rd party SDK. After I explained myself more clearly it let me carry on. Knock on wood.

So there is a safety model watching your behavior for these kinds of things.

fc417fc802

So you're saying that blackhats will be required to do a small bit of roleplay if they want the model to assist them? I'm not against public access BTW just pointing out how absurd that PR oriented "safety" feature is. "We did something don't blame us" sort of measure.

It isn't even my intent to naysay their approach. They probably have to do something along those lines to avoid being convicted in the court of public opinion. I just think it's an absurd reality.

snthpy

Does that mean that we're likely to see Mythos released soon?

nkohari

The prevailing theory is that Anthropic doesn't have sufficient compute capacity to support Mythos at scale, which is the real reason it hasn't released.

undefined

[deleted]

Salgat

It's almost embarrassing how susceptible we are to these marketing campaigns.

y-curious

Dunno about you, but I didn’t fall for it. I’m reminded of how they were “afraid” to release GPT-2 because of the “power” it had. Hype train!

esjeon

Lack of information, lack of knowledge.

The “AI” “technology” is an easy excuse to create artificial information gap in the era of the interconnected.

concinds

> Never thought I'd say this but OpenAI is the 'open' option again.

Compared to Anthropic, they always have been. Anthropic has never released any open models. Never released Claude Code's source, willingly (unlike Codex). Never released their tokenizer.

jwr

What's "open" about any of these companies?

I'm tired of words being misused. We have hoverboards that do not hover, self-driving cars that do not, actually, self-drive, starships that will never fly to the stars, and "open"… I can't even describe what it's used for, except everybody wants to call themselves "open".

jgilias

And the vast majority of current and past countries with the word “democratic” in their name weren’t actually democratic.

undefined

[deleted]

unsupp0rted

Doesn't OpenAI get mad if you ask cybersecurity questions and force you to upload a government ID, otherwise they'll silently route you to a less capable model?

> Developers and security professionals doing cybersecurity-related work or similar activity that could be mistaken by automated detection systems may have requests rerouted to GPT-5.2 as a fallback.

https://developers.openai.com/codex/concepts/cyber-safety

https://chatgpt.com/cyber

merlindru

Anthropic has started to ask for IDs for use of their products period

I don't like that trend. I get why they're doing it, but I don't like it

brigandish

Are you in the UK? I've not had this happen to me (I'm not in the UK) so I'm wondering if the Online Safety Act has affected this, as it has with other products.

Mario9382

I don't like this trend, but I get why they require it. The alternative seems to just ban cybersecurity-related questions.

deaux

They flatout gate any API access of the main models behind Persona ID verification. Entirely.

mafriese

From my experience OpenAI has become very sensitive when it comes to using their tools for security research. I am using MCP servers for tools like IDA Pro or Ghidra (for malware analysis) and recently received a warning:

> OpenAI's terms and policies restrict the use of our services in a number of areas. We have identified activity in your OpenAI account that is not permitted under our policies for: - Cyber Abuse

I raised an appeal which got denied. To be fair I think it's close to impossible for someone that is looking at the chat history to differenciate between legitimate research and malicious intent. I have also applied for the security research program that OpenAI is offering but didn't get any reply on that.

tnkuehne

isnt it like cyber question are being routed to dumper models at openai?

jfkimmes

Do you have a source for that?

Neither the release post, nor the model card seems to indicate anything like this?

nikanj

Anything that even vaguely smells like security research, reverse engineering or similar "dual-use" application hits the guardrails hard and fast. "Hey codex, here is our codebase, help us find exploitable issues" gives a "I can't help you with that, but I'm happy to give you a vague lecture on memory safety or craft a valgrind test harness"

willsmith72

Being "more" open than something totally closed doesn't make you open. The name is still bs

attentive

it's still somewhat gated behind "trusted access" for cyber, see https://chatgpt.com/cyber

mannanj

Seems like OpenAI only acts Open for theatric and attentional purposes though, i.e. when backed into a corner and its for their image.

Someone1234

I'd like to draw people's attention to this section of this page:

https://developers.openai.com/codex/pricing?codex-usage-limi...

Note the Local Messages between 5.3, 5.4, and 5.5. And, yes, I did read the linked article and know they're claiming that 5.5's new efficient should make it break-even with 5.4, but the point stands, tighter limits/higher prices.

puppystench

For API usage, GPT-5.5 is 2x the price of GPT-5.4, ~4x the price of GPT-5.1, and ~10x the price of Kimi-2.6.

Unfortunately I think the lesson they took from Anthropic is that devs get really reliant and even addicted on coding agents, and they'll happily pay any amount for even small benefits.

kingstnap

I feel like devs generally spend someone else's money on tokens. Either their employers or OpenAIs when they use a codex subscription.

If I put on my schizo hat. Something they might be doing is increasing the losses on their monthly codex subscriptions, to show that the API has a higher margin than before (the codex account massively in the negative, but the API account now having huge margins).

I've never seen an OpenAI investor pitch deck. But my guess is that API margins is one of the big ones they try to sell people on since Sama talks about it on Twitter.

I would be interested in hearing the insider stuff. Like if this model is genuinely like twice as expensive to serve or something.

vineyardmike

You can't build a business on per-seat subscriptions when you advertise making workers obsolete. API pricing with sustainable margins are the only way forward if you genuinely think you're going to cause (or accelerate) reduction in clients' headcount.

Additionally, the value generated by the best models with high-thinking and lots of context window is way higher than the cheap and tiny models, so you need to provide a "gateway drug" that lets people experience the best you offer.

ewrs

Yeah and the increase in operating expenses is going to make managers start asking hard questions - this is good. It means eventually there will be budgets put in place - this will force OAI and Anthropic to innovate harder. Then we will see how things pan out. Ultimately a firm is not going to pay rent to these firms if the benefits dont exceed the costs.

mitjam

The difference between sub and api price makes it hard to create competitive solutions on the app level.

w10-1

Price increases now aim to demonstrate market power for eventual IPO.

If they can show that people will pay a lot for somewhat better performance, it raises the value of any performance lead they can maintain.

If they demonstrate that and high switching costs, their franchise is worth scary amounts of money.

JohnLocke4

Sometimes I wonder if innovation in the AI space has stalled and recent progress is just a product of increased compute. Competence is increasing exponentially[1] but I guess it doesn't rule it out completely. I would postulate that a radical architecture shift is needed for the singularity though

[1]https://arxiv.org/html/2503.14499v1 *Source is from March 2025 so make of it what you will.

nomel

> that devs get really reliant and even addicted on coding agents

An alternative perspective is, devs highly value coding agents, and are willing to pay more because they're so useful. In other words, the market value of this limited resource is being adjusted to be closer to reality.

scotty79

We are constantly getting smaller and faster models that are close in performance to state of the art from few months prior. And that's due to architectural inventions. I'm sure it takes some time for these inventions to proliferate to frontier and that some might not be applicable there but we are definitely going faster than just due to compute increase.

It will get faster, but there are no singularities in the real world. Except possibly black holes, but we can't even be sure of that.

pxc

Maybe that's true. But I think part of the issue is that for a lot of things developers want to do with them now— certainly for most of the things I want to do with them— they're either barely good enough, or not consistently good enough. And the value difference across that quality threshold is immense, even if the quality difference itself isn't.

pzo

On top of that I noticed just right now after updating macos dekstop codex app, I got again by default set speed to 'fast' ('about 1.5x faster with increased plan usage'). They really want you to burn more tokens.

nubg

wow wait so it wasn't just me leaving it on from an old session?

sounds like criminal fraud to me tbh

0xbadcafebee

A fool and his money are soon parted

oh_no

what's the source on that?

puppystench

In the announcement webpage:

>For API developers, gpt-5.5 will soon be available in the Responses and Chat Completions APIs at $5 per 1M input tokens and $30 per 1M output tokens, with a 1M context window.

Mars008

> devs get really reliant and even addicted on coding agents

That's more about managers who hope AI will gradually replace stubborn and lazy devs. That will shift the balance to business ideas and connections out of technical side and investments.

Anyway, before singularity there going to be a huge change.

keyle

I did one review job that sent off three subagents and I blew the second half of my daily limit in 10 mins 13 seconds. Fun times.

raincole

It's such a vague table for pricing information. 30-150 messages...? What?

minimaxir

The more interesting part of the announcement than "it's better at benchmarks":

> To better utilize GPUs, Codex analyzed weeks’ worth of production traffic patterns and wrote custom heuristic algorithms to optimally partition and balance work. The effort had an outsized impact, increasing token generation speeds by over 20%.

The ability for agentic LLMs to improve computational efficiency/speed is a highly impactful domain I wish was more tested than with benchmarks. From my experience Opus is still much better than GPT/Codex in this aspect, but given that OpenAI is getting material gains out of this type of performancemaxxing and they have an increasing incentive to continue doing so given cost/capacity issues, I wonder if OpenAI will continue optimizing for it.

xiphias2

There's already KernelBench which tests CUDA kernel optimizations.

On the other hand all companies know that optimizing their own infrastructure / models is the critical path for ,,winning'' against the competition, so you can bet they are serious about it.

dash2

Is that true? I would have guessed research breakthroughs might be a more plausible way to win.

xtracto

So, im working in some high performance data processing in Rust. I had hit some performance walls, and needed to improve in the 100x or more scale.

I remembered the famous FizzBuzz Intel codegolf optimizations, and gave it to gemini pro, along with my code and instructions to "suggest optimizations similar to those, maybe not so low level, but clever" and it's suggestions were veerry cool.

LLM do not stop amazing me every day.

amrrs

Honestly the problem with these is how empirical it is, how someone can reproduce this? I love when Labs go beyond traditional benchies like MMLU and friends but these kind of statements don't help much either - unless it's a proper controlled study!

minimaxir

In a sense it's better than a benchmark: it's a practical, real-world, highly quantifiable improvement assuming there are no quality regressions and passes all test cases. I have been experimenting with this workflow across a variety of computational domains and have achieved consistent results with both Opus and GPT. My coworkers have independently used Opus for optimization suggestions on services in prod and they've led to much better performance (3x in some cases).

A more empirical test would be good for everyone (i.e. on equal hardware, give each agent the goal to implement an algorithm and make it as fast as possible, then quantify relative speed improvements that pass all test cases).

squibonpig

Yeah but like what if they're sorta embellishing it or just lying? That's the issue with not being reproducible.

theptip

The tension here is that what customers need to reproduce is this result on their own problem. To measure this you need extensive evals on private data.

OpenAI simply won’t share the data you need to reproduce this in the way you’d hope for an academic paper.

It’s an engineering result, not a scientific one.

jstanley

Oh, come on, if they do well on benchmarks people question how applicable they are in reality. If they do well in reality people complain that it's not a reproducible benchmark...

girvo

That's easily explained by those being two different people with two different opinions?

astlouis44

A playable 3D dungeon arena prototype built with Codex and GPT models. Codex handled the game architecture, TypeScript/Three.js implementation, combat systems, enemy encounters, HUD feedback, and GPT‑generated environment textures. Character models, character textures, and animations were created with third-party asset-generation tools

The game that this prompt generated looks pretty decent visually. A big part of this likely due to the fact the meshes were created using a seperate tool (probably meshy, tripo.ai, or similiar) and not generated by 5.5 itself.

It really seems like we could be at the dawn of a new era similiar to flash, where any gamer or hobbyist can generate game concepts quickly and instantly publish them to the web. Three.js in particular is really picking up as the primary way to design games with AI, in spite of the fact it's not even a game engine, just a web rendering library.

dataviz1000

LLM models can not do spacial reasoning. I haven't tried with GPT, however, Claude can not solve a Rubik Cube no matter how much I try with prompt engineering. I got Opus 4.6 to get ~70% of the puzzle solved but it got stuck. At $20 a run it prohibitively expensive.

The point is if we can prompt an LLM to reason about 3 dimensions, we likely will be able to apply that to math problems which it isn't able to solve currently.

I should release my Rubiks Cube MCP server with the challenge to see if someone can write a prompt to solve a Rubik's Cube.

embedding-shape

> I should release my Rubiks Cube MCP server with the challenge to see if someone can write a prompt to solve a Rubik's Cube.

Do it, I'm game! You nerdsniped me immediately and my brain went "That sounds easy, I'm sure I could do that in a night" so I'm surely not alone in being almost triggered by what you wrote. I bet I could even do it with a local model!

variodot

I’ve had a similar experience building a geometry/woodworking-flavored web app with Three.js and SVG rendering. It’s been kind of wild how quickly the SOTA models let me approach a new space in spatial development and rendering 3d (or SA optimization approaches, for that matter). That said, there are still easy "3d app" mistakes it makes like z-axis flipping or misreading coordinate conventions. But these models make similar mistakes with CSS and page awareness. Both require good verification loops to be effective.

dataviz1000

I think there is a pattern. It has a hard time with temporal and spatial.

Temporal. I had a research project where the LLM had no concept about preventing data from the future to leak in. I eventually had to create a wall clock and an agent that would step through every line of code and ensure by writing that lines logic and why there is no future of the wall clock data leaking.

Spatial. I created a canvas for rendering thinking model's attention and feedforward layers for data visualization animations. It was having a hard time working with it until I pointed Opus 4.7 to some ancient JavaScript code [0] about projecting 3d to 2d and after searching Github repositories. It worked perfect with pan zoom in one shot after that.

No matter how hard I tried I couldn't get it to stack all the layers correctly. It must have remembered all the parts for projecting 3d to 2d because it could not figure out how to position the layers.

There is a ton of information burnt into the weights during training but it can not reason about it. When it does work well with spatial and temporal it is more slight of hand than being able to generalize.

People say, why not just do reinforcement learning? That can't generalize in the same way a LLM can. I'm thinking about doing the Rubik's Cube because if people can solve that it might open up solutions for working temporal and spatial problems.

[0] https://jakesgordon.com/writing/javascript-racer-v1-straight...

versteegen

Interesting (would like to hear more), but solving a Rubiks cube would appear to be a poor way to measure spatial understanding or reasoning. Ordinary human spatial intuition lets you think about how to move a tile to a certain location, but not really how to make consistent progress towards a solution; what's needed is knowledge of solution techniques. I'd say what you're measuring is 'perception' rather than reasoning.

dataviz1000

> how to make consistent progress towards a solution

A 7 year old child can learn six sequences of a few moves and over a weekend solve the Rubik Cube. It is a solved algorithm something LLM should be very very good at. What it can't do is reason about spacial relationships.

William_BB

> what's needed is knowledge of solution techniques

That's definitely in the training data

Melatonic

What about a model designed for robotics and vision? Seems like an LLM trained on text would inherently not be great for this.

DeepMinds other models however might do better?

snet0

How are you handing the cube state to the model?

dataviz1000

Does this answer the question?

Opus 4.6 got the cross and started to get several pieces on the correct faces. It couldn't reason past this. You can see the prompts and all the turn messages.

https://gist.github.com/adam-s/b343a6077dd2f647020ccacea4140...

edit: I can't reply to message below. The point isn't can we solve a Rubik's Cube with a python script and tool calls. The point is can we get an LLM to reason about moving things in 3 dimensions. The prompt is a puzzle in the way that a Rubik's Cube is a puzzle. A 7 year old child can learn 6 moves and figure out how to solve a Rubik's Cube in a weekend, the LLM can't solve it. However, can, given the correct prompt, a LLM solve it? The prompt is the puzzle. That is why it is fun and interesting. Plus, it is a spatial problem so if we solve that we solve a massive class of problems including huge swathes of mathematics the LLMs can't touch yet.

undefined

[deleted]

holoduke

I bet I can even do it with the smallest gemma 4 model using a prompt of max 500 characters.

Torkel

*yet

0x62

FWIW I've been experimenting with Three.js and AI for the last ~3 years, and noticed a significant improvement in 5.4 - the biggest single generation leap for Three.js specifically. It was most evident in shaders (GLSL), but also apparent in structuring of Three.js scenes across multiple pages/components.

It still struggles to create shaders from scratch, but is now pretty adequate at editing existing shaders.

In 5.2 and below, GPT really struggled with "one canvas, multiple page" experiences, where a single background canvas is kept rendered over routes. In 5.4, it still takes a bit of hand-holding and frequent refactor/optimisation prompts, but is a lot more capable.

Excited to test 5.5 and see how it is in practice.

CSMastermind

> It still struggles to create shaders from scratch

Oh just like a real developer

accrual

Much respect for shader developers, it's a different way of thinking/programming

Pym

One struggle I'm having (with Claude) is that most of what it knows about Three.js is outdated. I haven't used GPT in a while, is the grass greener?

Have you tried any skills like cloudai-x/threejs-skills that help with that? Or built your own?

import

Using Claude for the same context and it’s doing really well with the glsl. since like last September

vunderba

I’ve had a lot of success using LLMs to help with my Three.js based games and projects. Many of my weird clock visualizations relied heavily on it.

It might not be a game engine, but it’s the de facto standard for doing WebGL 3D. And since it’s been around forever, there’s a massive amount of training data available for it.

Before LLMs were a thing, I relied more on Babylon.js, since it’s a bit higher level and gives you more batteries included for game development.

kingstnap

The meshes look interesting, but the gameplay is very basic. The tank one seems more sophisticated with the flying ships and whatnot.

What's strange is that this Pietro Schirano dude seems to write incredibly cargo cult prompts.

  Game created by Pietro Schirano, CEO of MagicPath

  Prompt: Create a 3D game using three.js. It should be a UFO shooter where I control a tank and shoot down UFOs flying overhead.
  - Think step by step, take a deep breath. Repeat the question back before answering.
  - Imagine you're writing an instruction message for a junior developer who's going to go build this. Can you write something extremely clear and specific for them, including which files they should look at for the change and which ones need to be fixed?
  -Then write all the code. Make the game low-poly but beautiful.
  - Remember, you are an agent: please keep going until the user's query is completely resolved before ending your turn and yielding back to the user. Decompose the user's query into all required sub-requests and confirm that each one is completed. Do not stop after completing only part of the request. Only terminate your turn when you are sure the problem is solved. You must be prepared to answer multiple queries and only finish the call once the user has confirmed they're done.
  - You must plan extensively in accordance with the workflow steps before making subsequent function calls, and reflect extensively on the outcomes of each function call, ensuring the user's query and related sub-requests are completely resolved.

torginus

It's weird how people pep talk the AI - if my Jira tickets looked like this, I would throw a fit.

I guess these people think they have special prompt engineering skills, and doing it like this is better than giving the AI a dry list of requirements (fwiw, they might be even right)

mattgreenrocks

It’s not surprising to me that the same crowd that cheers for the demise of software engineering skills invented its own notion of AI prompting skills.

Too bad they can veer sharply into cringe territory pretty fast: “as an accomplished Senior Principal Engineer at a FAANG with 22 years of experience, create a todo list app.” It’s like interactive fanfiction.

eloisant

Yes, this is cargo cult.

This remind me of so called "optimization" hacks that people keep applying years after their languages get improved to make them unnecessary or even harmful.

Maybe at one point it helped to write prompts in this weird way, but with all the progress going on both in the models and the harness if it's not obsolete yet it will soon be. Just crufts that consumes tokens and fills the context window for nothing.

irthomasthomas

> Think Step By Step

What is this, 2023?

I feel like this was generated by a model tapping in to 2023 notions of prompt engineering.

skirano

Pietro here, I just published a video of it: https://x.com/skirano/status/2047403025094905964?s=20

ahoka

"take a deep breath"

OMFG

jameshart

Claude would check to see if it had any breathing skills, if it doesn't find any it would start installing npm modules for breathing.

bredren

The prompt did not specify advanced gameplay.

I do not see instructions to assist in task decomposition and agent ~"motivation" to stay aligned over long periods as cargo culting.

See up thread for anecdotes [1].

> Decompose the user's query into all required sub-requests and confirm that each one is completed. Do not stop after completing only part of the request. Only terminate your turn when you are sure the problem is solved.

I see this as a portrayal of the strength of 5.5, since it suggests the ability to be assigned this clearly important role to ~one shot requests like this.

I've been using a cli-ai-first task tool I wrote to process complex "parent" or "umberella" into decomposed subtasks and then execute on them.

This has allowed my workflows to float above the ups and downs of model performance.

That said, having the AI do the planning for a big request like this internally is not good outside a demo.

Because, you want the planning of the AI to be part of the historical context and available for forensics due to stalls, unwound details or other unexpected issues at any point along the way.

[1] https://news.ycombinator.com/item?id=47879819

peder

> It really seems like we could be at the dawn of a new era similiar to flash

We've been there for a while.... creativity has been the primary bottleneck

mindhunter

A friend is building Jamboree[1] (prev name "Spielwerk") for iOS. An app to build and share games. They're all web based so they're easy to share.

[1] https://apps.apple.com/uz/app/jamboree-game-maker/id67473110...

undefined

[deleted]

undefined

[deleted]

6thbit

                          Mythos     5.5
    SWE-bench Pro          77.8%*   58.6%
    Terminal-bench-2.0     82.0%    82.7%*
    GPQA Diamond           94.6%*   93.6%
    H. Last Exam           56.8%*   41.4%
    H. Last Exam (tools)   64.7%*   52.2%    
    BrowseComp             86.9%    84.4%  (90.1% Pro)*
    OSWorld-Verified       79.6%*   78.7%

Still far from Mythos on SWE-bench but quite comparable otherwise. Source for mythos values: https://www.anthropic.com/glasswing

aliljet

Mythos is only real when it's actually available. If you're using Opus 4.7 right now, you know how incredibly nerfed the Opus autonomy is in service of perceived safety. I'm not so confident this will be as great as Anthropic wants us to believe..

XCSme

They mentioned in their release page, that the Claude team noticed memorization of the SWE-bench test, so the test is actually in the training data.

Here: https://www.anthropic.com/news/claude-opus-4-7#:~:text=memor...

sigmoid10

Any static benchmark older than 12-18 months is basically worthless, because the content will have spread all over the internet and have found its way into the latest model's training set.

William_BB

Good luck arguing with SWE benchmark purists

kaonashi-tyc-01

I did some study on Verified, not Pro, but Mythos number there rings a lot of questions on my end.

If you look at the SWEBench official submissions: https://github.com/SWE-bench/experiments/tree/main/evaluatio..., filter all models after Sonnet 4, and aggregate ALL models' submission across 500 problems, what I found that the aggregated resolution rate is 93% (sharp).

Mythos gets 93.7%, meaning it solves problems that no other models could ever solve. I took a look at those problems, then I became even more suspicious, for the remaining 7% problems, it is almost impossible to resolve those issues without looking at the testing patch ahead of time, because how drastically the solution itself deviates from the problem statement, it almost feels like it is trying to solve a different problem.

Not that I am saying Mythos is cheating, but it might be too capable to remember all states of said repos, that it is able to reverse engineer the TRUE problem statement by diffing within its own internal memory. I think it could be a unique phenomena of evaluation awareness. Otherwise I genuinely couldn't think of exactly how it could be this precise in deciphering such unspecific problem statements.

yfontana

OpenAI wrote a couple months ago that they do not consider SWE Bench Verified a meaningful benchmark anymore (and they were the ones who published it in the first place): https://openai.com/index/why-we-no-longer-evaluate-swe-bench...

kaonashi-tyc-01

Yep, I read this blog. What confuses me is that Anthropic doesn't seem to be bothered by this study and keeps publishing Verified results.

That is what gets me curious in the first place. The fact Mythos scored so high, IMO, exposes some issues with this model: it is able to solve seemingly impossible to solve problems.

Without cheating allegation, which I don't think ANT is doing, it has to be doing some fortune telling/future reading to score that high at all.

alansaber

A single benchmark is meaningless, you always get quirky results on some benchmarks.

silvertaza

Still huge hallucination rate, unfortunately at 86%. To compare, Opus sits at 36%.

Source: https://artificialanalysis.ai/models?omniscience=omniscience...

dubcanada

grok is 17%? And that's the lowest, most models are like 80%+?

While hallucination is probably closer to 100% depending on the question. This benchmark makes no sense.

Jensson

> While hallucination is probably closer to 100% depending on the question.

But the benchmark didn't ask those questions, and it seems grok is very well at saying it doesn't know the answer otherwise.

elAhmo

No one serious uses grok.

ajdegol

@grok is this true?

RALaBarge

YMMV but Grok 4.1 Fast can usually find via static analysis a few things that other models dont seem to catch with the same prompt

d0gsg0w00f

Why not? Honest question.

MagicMoonlight

It makes sense. Grok is taught to answer the question, regardless of how explicit or extreme it is. These other models are taught to suppress any wrongthink. That's going to make it hard to answer things correctly. If you've been told to answer something incorrectly because it's wrong, then you'll have to make up an answer.

simianwords

There's something off with this because Haiku should not be that good.

camgunz

Hallucination benchmarks accept "I don't know", which Haiku did at least a little. Here are other benchmarks corroborating: https://suprmind.ai/hub/ai-hallucination-rates-and-benchmark...

rattray

I've been very curious about that too. I wonder if it's actually much better at admitting when it doesn't know something, because it thinks it's a "dumber model". But I haven't played with this at all myself.

jwpapi

The hallucination benchmark is hallucinating

dakolli

This indicates they want this behavior, they know the person asking the question probably doesn't understand the problem entirely (or why would they be asking), so they'd prefer a confident response, regardless of outcomes, because the point is to sell the technologies competency (and the perception thereof), not the capabilities, to a bunch of people that have no clue what they're talking about.

LLMs will ruin your product, have fun trusting a billionaires thinking machine they swear is capable of replacing your employees if you just pay them 75% of your labor budget.

tedsanders

We don't want hallucinations either, I promise you.

A few biased defenses:

- I'll note that this eval doesn't have web search enabled, but we train our models to use web search in ChatGPT, Codex, and our API. I'd be curious to see hallucination rates with web search on.

- This eval only measures binary attempted vs did not attempt, but doesn't really reward any sort of continuous hedging like "I think it's X, but to be honest I'm not sure."

- On the flip side, GPT-5.5 has the highest accuracy score.

- With any rate over 1% (whether 30% or 70%), you should be verifying anything important anyway.

- On our internal eval made from de-identified ChatGPT prompts that previously elicited hallucinations, we've actually been improving substantially from 5.2 to 5.4 to 5.5. So as always, progress depends on how you measure it.

- Models that ask more clarifying questions will do better on this eval, even if they are just as likely to hallucinate after the clarifying question.

Still, Anthropic has done a great job here and I hope we catch up to them on this eval in the future.

calf

On ChatGPT 5.3 Plus subscription I find that long informal chats tend to reveal unsatisfactory answers and biases, at this point after 10 rounds of replies I end up having to correct it so much that it starts to agree with my initial arguments full circle. I don't see how this behavior is acceptable or safe for real work. Like are programmers and engineers using LLMs completely differently than I'm doing, because the underlying technology is fundamentally the same.

William_BB

Totally agreed, this has been and will continue to be a problem for all existing models.

> Like are programmers and engineers using LLMs completely differently than I'm doing

No, but the complexity of the problem matters. Lots of engineers doing basic CRUD and prototyping overestimate the capabilities of LLMs.

mudkipdev

This is 3x the price of GPT-5.1, released just 6 months ago. Is no one else alarmed by the trend? What happens when the cheaper models are deprecated/removed over time?

Night_Thastus

This is entirely expected. The low prices of using LLMs early on was totally and completely unsustainable. The companies providing such services were (and still are) burning money by the truckload.

The hope is to get a big userbase who eventually become dependent on it for their workflow, then crank up the price until it finally becomes profitable.

The price for all models by all companies will continue to go up, and quickly.

oezi

I recently looked at this a bit but came away with the impression that at least on API pricing the models should be very profitable considering primarily the electricity cost.

Subscriptions and free plans are the thing that can easily burn money.

Night_Thastus

The physical buildouts and massive R+D spending is the big part.

viktorcode

> This is entirely expected. The low prices of using LLMs early on was totally and completely unsustainable.

Do you think this is true for DeepSeek as well?

Night_Thastus

Depends on the level of interest the state takes in it. But I would wager yes, it's unsustainable currently.

subhobroto

> The price for all models by all companies will continue to go up, and quickly.

This might entirely be true but I'm hoping that's because the frontier models are just actually more expensive to run as well.

Said another way, I would hope, the price of GPT-5.5 falls significantly in a year when GPT-5.8 is out.

Someone else on this post commented:

> For API usage, GPT-5.5 is 2x the price of GPT-5.4, ~4x the price of GPT-5.1, and ~10x the price of Kimi-2.6.

Having used Kimi-2.6, it can go on for hours spewing nonsense. I personally am happy to pay 10x the price of something that doesn't help me, for something else that does, in even half the time.

energy123

Look a cost per intelligence or cost per task instead of cost per token.

yokoprime

How do I reliably measure 1 unit of intelligence?

wellthisisgreat

In pelicans, obviously

ulimn

Isn't the outcome / solution for a given task non-deterministic? So can we reliably measure that?

foota

Yes, sort of. Generally you can measure the pass rate on a benchmark given a fixed compute budget. A sufficiently smart model can hit a high pass rate with fewer tokens/compute. Check out the cost efficiency on https://artificialanalysis.ai/ (say this posted here the other day, pretty neat charts!)

genericresponse

Statistically. Do many trials and measure how often it succeeds/fails.

torginus

This is the only correct take. The only metric that matters is cost per desired outcome.

dns_snek

Repetition and statistics, if you have $1000++ you didn't need anyway.

throwuxiytayq

It's much easier to measure a language model's intelligence than a human's because you can take as many samples as you want without affecting its knowledge. And we do measure human intelligence.

Schlagbohrer

As others have mentioned you're ignoring the long tail of open-weights models which can be self hosted. As long as that quasi-open-source competition keeps up the pace, it will put a cap on how expensive the frontier models can get before people have to switch to self-hosting.

That's a big if, though. I wish Meta were still releasing top of the line, expensively produced open-weights models. Or if Anthropic, Google, or X would release an open mini version.

Wowfunhappy

Well, Google does release mini open versions of their models. https://deepmind.google/models/gemma/gemma-4/

deaux

And they're incredibly good for their size.

jeffybefffy519

Surely its just the same model, just allowed to do more work...??

dannyw

It's far more meaningful to look at the actual cost to successfully something. The token efficiency of GPT-5.5 is real; as well as it just being far better for work.

operatingthetan

We know they cost much more than this for OpenAI. Assume prices will continue to climb until they are making money.

horiap

How do we know that? There is a large gap between API pricing for SOTA models and similarly sized OSS models hosted by 3rd party providers.

Sure, they’re distilled and should be cheaper to run but at the same time, these hosting providers do turn a margin on these given it’s their core business, unless they do it out of the kindness of their heart.

So it’s hard for me to imagine these providers are losing money on API pricing.

beering

source? There have also been a bunch of people here saying the opposite

dandaka

SOTA models get distilled to open source weights in ~6 months. So paying premium for bleeding edge performance sounds like a fair compensation for enormous capex.

typs

GPT-4 cost 6x on input and 2x output tokens when it was released as compared go GPT-5.5

vthallam

This model is great at long horizon tasks, and Codex now has heartbeats, so it can keep checking on things. Give it your hardest problem that would take hours with verifiable constraints, you will see how good this is:)

*I work at OAI.

spaceman_2020

Is there any task that actually doesn't require human intervention in-between, even if its just to setup stuff?

Like I will get Opus to make me an app but it will stop in between because I need to setup the db and plug in the API keys and Opus really can't do that on its own yet

stingraycharles

> Is there any task that actually doesn't require human intervention in-between, even if its just to setup stuff?

The goal is none. The current situation: everything that matters requires human intervention.

I think the end situation will be that LLMs will be able to perform decently well in a highly controlled and predictable environment.

leodavi

> in a highly controlled and predictable environment

Why this constraint? A common sentiment I see online (sorry, to group you in) is "[tool] will be capable, actually, but only in a context that trivializes its usefulness."

I think modern post-training like RLVR + inference-time output token scaling can _probably_ scale so the agents can solve any computable task, even when placed in noisy or misconfigured environments. But it won't be economical for a long while. But it already seems largely capable of that today.

dandaka

Could be a great feature, can't wait to test! Tired of other models (looking at you Opus) constantly stuck mid-task lately.

winrid

Interesting, I just had opus convert a 35k loc java game to c++ overnight (root agent that orchestrated and delegated to sub agents) and woke up and it's done and works.

What plan are you on? I'm starting to wonder if they're dynamically adjusting reasoning based on plan or something.

gck1

I'm on max 5x and noticed this too. I don't use built-in subagents but rather full Claude session that orchestrates other full claude sessions. Worker agents that receive tasks now stop midway, they ask for permission to continue. My "heartbeat" is basically "status. One line" message sent to the orchestrator.

Opus 4.6 worker agents never asked for permission to continue, and when heartbeat was sent to orchestrator, it just knew what to do (checked on subagents etc). Now it just says that it waits for me to confirm something.

adamandsteve

This has to be bait.

frotaur

I've been using the /ralph-loop plugin for claude code, works well to keep the model hammering at the task.

thereeldeel

Will Codex App support new context window, rather than compaction, for "unrelated" sub-tasks during long horizon tasks?

dannyw

It's genuinely so great at long horizon tasks! GPT-5.5 solved many long-horizon frontier challenges, for the first time for an AI model we've tested, in our internal evals at Canva :) Congrats on the launch!

brcmthrowaway

Can we not do growth hacking here?

RALaBarge

We totally agree.

That's what I've been heads down, HUNGRY, working on, looking for investors and founding engineers pst: https://heymanniceidea.com (disclaimer: I am not associated with heymanniceidea.com)

smallerize

HN is owned by a startup accelerator and venture capital firm. They do growth hacking on the front page. And you probably know that since your throwaway account is several years old.

undefined

[deleted]

bkyan

Sorry, what is "heartbeats", exactly?

gurjeet

> Today we launched heartbeats in Codex: automations that maintain context inside a single thread over time.

https://x.com/pashmerepat/status/2044836560147984461

bkyan

Thanks!

applfanboysbgon

If there's a bingo card for model releases, "our [superlative] and [superlative] model yet" is surely the free space.

tom1337

Do "our [superlative] and [superlative] [product] yet" and you have pretty much every product launch

SequoiaHope

I love when Apple says they’re releasing their best iPhone yet so I know the new model is better than the old ones.

sigmoid10

That's at least genuine to some degree. Like, ok, good to know it's not officially a step back... But stuff like "smallest notch ever in an iPhone" is outright misleading consumers when there are other brands out there that easily beat them.

xnx

"our newest and most expensive model yet"

undefined

[deleted]

wiseowise

"Best iPhone ever"

ertgbnm

can't wait for "our worst and dumbest model yet"

Nition

Apple should have used that one for the 2016 MacBook.

aliljet

I've found myself so deeply embedded in the Claude Max subscription that I'm worried about potentially makign a switch. How are people making sure they stay nimble enough not to get trarpped by one company's ecosystem over another? For what it's worth, Opus 4.7 has not been a step up and it's come with an enormously higher usage of the subscription Anthropic offers making the entire offering double worse.

gck1

Start building your own liteweight "harness" that does things you need. Ignore all functionality of clients like CC or Codex and just implement whatever you start missing in your harness.

You can replace pretty much everything - skills system, subagents, etc with just tmux and a simple cli tool that the official clients can call.

Oh and definitely disable any form of "memory" system.

Essentially, treat all tooling that wraps the models as dumb gateways to inference. Then provider switch is basically a one line config change.

nunez

lol this is literally the same advice us ancient devops nerds were telling others back when ci/cd was new

write scripts that work anywhere and have your ci/cd pipeline be a "dumb" executor of those scripts. unless you want to be stuck on jenkins forever.

what's old is new again!

TacticalCoder

> You can replace pretty much everything - skills system, subagents, etc with just tmux and a simple cli tool that the official clients can call.

I'm very interest by this. Can you go a bit more into details?

ATM for example I'm running Claude Code CLI in a VM on a server and I use SSH to access it. I don't depend on anything specific to Anthropic. But it's still a bit of a pain to "switch" to, say, Codex.

How would that simple CLI tool work? And would CC / Codex call it?

caspar

Not the OP but here is a good example: https://mariozechner.at/posts/2025-11-30-pi-coding-agent/

Initially I read it because just it was interesting but it has ended up being the harness I have stuck with - pi is well designed, nicely extensible and supports many model provider APIs. Though sadly gemini and claude's subscriptions can't really be used with it anymore thanks to openclaw.

RALaBarge

Check out github.com/ralabarge/beigebox -- OSS AI Harness, started as a way to save all of my data but has agentic features, MCP server, point it at any endpoint (or use any front end with it as well, transparent middleware)

So far what I am finding is that you just get the basics working and then use the tool and inference to improve the tool.

gck1

I wish I had lower standards towards sharing absolute AI slop, then I could just drop a link to my implementation. But since I don't, let me just describe it. I essentially had claude build the initial version in a single session which I've been extending as I noticed any gaps in my process.

First, you need an entrypoint that kicks things off. You never run `claude` or `codex`, you always start by running `mycli-entrypoint` that:

1. Creates tmux session 2. Creates pane 3. Spawns claude/codex/gemini - whichever your default configured backend is 4. Automatically delivers a prompt (essentially a 'system message') to that process via tmux paste telling it what `mycli` is, how to use it, what commands are available and how it should never use built-in tools that this cli provides as alternatives.

After that, you build commands in `mycli` that CC/Codex are prompted to call when appropriate.

For example, if you want a "subagent", you have a `mycli spawn` command that takes a role (just preconfigured markdown file living in the same project), backend (claude/codex/...) and a model. Then whenever CC wants to spawn a subagent, it will call that command instead, which will create a pane, spawn a process and return agent ID to CC. Agent ID is auto generated by your cli and tmux pane is renamed to that so you can easily match later.

Then you also need a way for these agents to talk to each other. So your cli also has a `send` command that takes agent ID and a message and delivers it to the appropriate pane using automatically tracked mapping of pane_id<>agent_id.

Claude and codex automatically store everything that happens in the process as jsonl files in their config dirs. Your cli should have adapters for each backend and parse them into common format.

At this point, your possibilities are pretty much endless. You can have a sidecar process per agent that say, detects when model is reaching context window limit (it's in jsonl) and automatically send a message to it asking it to wrap up and report to a supervisor agent that will spawn a replacement.

I also don't use "skills" because skills are a loaded term that each of the harnesses interprets and loads/uses differently. So I call them "crafts" which are again, just markdown files in my project with an ID and supporting command `read-craft <craft-id>`. List of the available "crafts" are delivered using the same initialization message that each agent gets. If I like any third party skill, I just copy it to my "crafts" dir manually.

My implementation is an absolute junk, just Python + markdown files, and I have never looked at the actual code, but it works and I can adapt it to my process very easily without being dependent on any third party tool.

type4

I have a directory of skills that I symlink to Codex/Claude/pi. I make scripts that correspond with them to do any heavy lifting, I avoid platform specific features like Claude's hooks. I also symlink/share a user AGENTS.md/CLAUDE.md

MCPs aren't as smooth, but I just set them up in each environment.

threecheese

Anecdotally, I get the same wall time with my Max x5 (100$) and my ChatGPT Teams (30$) subscriptions.

chis

It's surprisingly simple to switch. I mean both products offer basically identical coding CLI experiences. Personally I've been paying for Claude max $100, and ChatGPT $20, and then just using ChatGPT to fill in the gaps. Specifically I like it for code review and when Claude is down.

dannyw

Try GPT-5.5 as your daily driver for a bit. It felt a lot smarter, reliable, and I was much more productive with it.

zaptrem

I bumped from $20 -> $100 today but the Codex CLI lacking code rewind and "you can change files but ask me every time" mode from Claude Code is quite annoying. Sometimes I want to code, not vibe code lol.

hx8

I use Open Code as my harness. It's open source, bring your own API Key or OAuth token or self-hosted model. I've jumped from Opus 4.6 to Opus 4.7 to GPT 5.5 in the last 7 days. No big deal, intelligence is just a commodity in 2026.

The actual harness is great, very hackable, very extendable.

NoveltyEngine

Does Anthropic not actively ban people using oauth tokens in non-claude-code harnesses?

zackify

I use pi.dev.

I get openai team plan at work.

Claude enterprise too.

I have openrouter for myself.

I use minimax 2.7. Kimi 2.6. And gpt 5.5 and opus 4.7. I can toggle between them in an open source interface that's how I stay able to not be trapped.

Minimax is so cheap and for personal stuff it works fine. So I'm always toggling between the nre releases

peheje

what about just personal stuff in a syncing interface, what do you use for that?

zackify

What's a syncing interface?

beering

What is the switching cost besides launching a different program? Don’t you just need to type what you want into the box?

cube2222

Small tip, at least for now you can switch back to Opus 4.6, both in the ui and in Claude Code.

Daily Digest email

Get the top HN stories in your inbox every day.