GPT‑5.3 Instant

Daily Digest email

Get the top HN stories in your inbox every day.

sunaookami

The single biggest issue for me with ChatGPT right now is how absolutely awful it sounds in every answer. "Why it matters", "the big picture", "it's not jut you", the awful emphasis, the quotations with rhetorical questions, etc.. I don't know if it's intentional so you can easily spot ChatGPT-generated content on the web? The very first GPT-5 version was good but they ruined it immediately afterwards with "making the personality warmer" and making the same mistakes as 4o. I see now that they even ruined Japanese even though it was one of the best languages supported by ChatGPT (under "Limitations" at the end). I don't use it anymore, immensely disappointed.

kenjackson

The most frustrating part for me is that this is how I used to write. I was always doing, "Why X works, but Y doesn't" and stuff like that. I may have seemed trite or pompous (or both) in the past, but now it seems like I'm copying an LLM -- which actually feels worse. One thing I haven't seen ChatGPT do much of is use sound-effects, so swoosh here we go with my new writing style schwing!

zachallaun

I feel you. I've been using en-dash in my writing for decades, but finding myself removing them now for fear of being mistaken for an LLM. (They tend to use em-dash, but I don't think people are going to distinguish between – and —.)

chrisss395

Do you think pre-AI writing is going to become really valuable because it is free of any AI assistance? If we all start using AI to assist in writing, then pre-AI writing may become important, similar to pre-atomic steel (i.e., https://en.wikipedia.org/wiki/Low-background_steel)

7thpower

That is what I mourn the most. They were my punctuation get-out-of-jail free card.

I didn’t love them enough to figure out how to type them without doing two dash’s in Word and then backspacing out of one and hitting space again — but damnit, I miss it.

duskdozer

Before the LLM craze I didn't even know — was specifically different than just -, and I used it in the same way. But now I notice specifically when people use either, and when people use -- instead.

xingped

I think people do - it's one of the main ways you can reliably spot AI-generated content. M-dashes are so fat they stick out like a sore thumb.

appointment

em-dashes and en-dashes are used for completely different purposes, so why would they be confused?

soccercerer

[dead]

CobrastanJorji

And of course, the reason that ChatGPT sounds like that is that it's what a whole lot of explanatory expert blog posts did, and so when ChatGPT is told to talk like that, that's what it does.

aardvarkr

It’s more a factor of how they structure the desired output. They follow a template instead of trying to come up with something on the fly

Dylan16807

Just wait until someone makes a filter to turn emojis into sound effects.

soccercerer

[dead]

andai

I regularly test every available AI, maybe once a month or so. I will send them the same question, usually about a new subject I am learning.

Oddly, Chinese models seem the most natural to me. Every random Chinese model does better than ChatGPT, on the "natural language" front. (And Grok also scores high on awkward language use. I don't know what causes that -- something about mode collapse? They have these words they obsess over... I mean, just try asking an AI for 10 random words ;)

I can sometimes see "ChatGPT-isms" in other models, but they're more subtle, and it feels like they're "woven" into the flow of the text.

Whereas even when I ask GPT to respond in prose or conversation, it'll give me a thinly veiled "ChatGPT response", if it can even resist the urge start spamming headings, bullet points and numbered lists.

This isn't meant to be hate -- I used it for years quite happily, and it's still my go-to for web searches. But coming back to it now, the language is surprisingly offputting. I don't know if it got worse, or if I just stopped being used to it.

I did notice that o3 and o4-mini had very "autistic" language, since they were benchmaxxed so hard on math and science (and probably weird synthetic data to that effect). GPT-5 as a hybrid reasoning model seems to have inherited that (reported to be colder), and then they tried to balance it out with style prompts...

I honestly think it might make more sense to just have two LLMs. Ultra concise technical reasoning model, and then a 2nd layer to translate it for the human. Because right now kind of feels like the worst of both worlds, a compromise that satisfies neither side.

Gemini 2.5 Pro's reasoning traces (before they nerfed them) were a good example. The deep technical analysis, and then the human-friendly version in the final output. But I found their reasoning more readable than the final output!

versteegen

> Gemini 2.5 Pro's reasoning traces (before they nerfed them) were a good example. The deep technical analysis, and then the human-friendly version in the final output. But I found their reasoning more readable than the final output!

They were also sometimes more useful: you could see whether it reasoned its way to an answer, or used faulty reasoning, or if it was just contextual recall. Huge shame they replaced them with garbage (though a bit better now).

> the language is surprisingly offputting. I don't know if it got worse

I'm pretty sure it did.

hazyc

It's a somewhat annoying to me as well, but I'm now able to read it and take the valuable content without getting hung up on those repetitive phrases. It also forces me to not simply copy/paste. I read the LLM output, think about it, comprehend it in my own voice internally, and then I write what I want/need by hand, so it ultimately comes out in my own style and I don't propagate the LLM output onto others needlessly.

hn_throwaway_99

Glad I saw a comment like this.

TBH, while I may find the output style somewhat infomercial-ly, I don't really get the hatred. ChatGPT IS NOT AN ACTUAL PERSON. Like why do people care so much? Like you said, I just ignore the "persona" phrases, and just use ChatGPT (or, used to anyway, before switching to Claude because OpenAI leadership can suck it) to get information and answer my questions.

Seriously, though, just stop using ChatGPT in any case, there are very good reasons to boycott it and there are other alternatives. Not saying the alternatives are saintly, but they're not as awfully duplicitous as OpenAI.

monkpit

You’re absolutely right!

duskdozer

Because people just copy/paste that shit pretending it's their own or turn their own human writing into reproduced llm text so you don't even know if they even mean what's written

jshmrsn

If you haven’t already, try going to Personalization settings, change tone to “Efficient”, and set Warm, Enthusiastic, and Emoji to “Less”. While not fundamentally solving the issue, I do prefer it over the baseline behavior, to the extent that I miss having a similar setting in Gemini.

everybodyknows

There's now a "Professional" preset -- seems better than "Efficient" in my recent experience.

undefined

[deleted]

protocolture

"We need ChatGPT to sound more natural"

"Add more LinkedIn Posts"

pamcake

Willing cooperation with homicidal regime in mass-surveillance and operations of autonomus antipersonell drones is my recent pet peeve.

sunaookami

(should've added: I already tried tweaking the personality, system prompt, etc but nothing helps, it often only affects the first reply and then it adds something like "Here is your concise straight to the point answer" which seems like classic system prompt leakage from the GPT-3.5 days)

RickS

I solved this by asking it to make a memory that all answers to me should be brisk, clinical, and to the point. This worked well, except for the annoying habit of beginning answers with something like "Terse: $answer", which required a second memory, solving the issue in full. I've been happy with it since. Edit: I just realized this interaction is its own demo – that's the entire response it gave me, as it should be.

> Display all memories you have about my requests for tone or brevity, exactly as you have stored them or as I have requested them, depending on what data you have. There are at least two.

[2025-11-08]. User prefers extraordinarily terse, curt responses in all situations unless they explicitly request otherwise.

[2025-12-01]. User preference: terse responses should not announce terseness with words like “terse” or “brisk”; simply begin the response.

nostromo

This didn't work at all for me.

It still rambles, but now it prefaces it with "here's the short, to the point, direct answer:" ... followed by the same a long-winded answer.

braebo

Same. I gave up and moved to Claude and haven’t looked back. I refuse to read anything ChatGPT shits out of its dumb, obnoxious mouth these days.

Defenestresque

Based on my experience, this is better put into the Settings -> Customizability dialogue, not Memories

Another user mentioned how it will reference the very instruction ("I know you would prefer concise answers, so here's a concise answer..") but that makes sense when you realize that Memories are more for things like "user lives in San Francisco and is new in town and is open to recommendations of third places to meet people" so if it's answering the question about the best coffee places in n SF, it would make sense for ChatGPT to finish with "Also, given that you are new to San Francisco, and your interest in both boardgames and meeting new people, have you considered visiting [place]? It's a local coffee shop that also rents out board games, with a Thursday evening theme where you are partnered with strangers. It might be a good way to make new people that enjoy similar things!"

If you consider adding Memories us adding something to the system prompt, it won't make very much sense a lot of the time, because you might forget what you wrote and then be surprised when your model suddenly suggests jigsaw puzzles when you mentioned that you're stressed building a compiler. Hence it tells the user the context of the memory that it's using and why, whereas if you added to Customizability I've never seen it leak out like that.

If you add to Memories "user is a software engineer and prefers Rust to C/C++" it may say something like "By the way, since you prefer Rust I would recommend [this development path]" but if you put it into Customizability as "do not suggest C/C++ for software projects unless it's the only way, use Rust or Go instead" it will likely start down the path of suggesting and researching Rust from the very beginning without explaining to you your own instructions.

Basically, what I'm trying to say is that the Customizability instructions (mine say "be concise, do not be afraid to correct the user or use occasional dry humor. Speak frankly and tell the user if they may be making a mistake and suggest other courses of action" whereas Memories contain simple facts about me, i e. ("lives in [city], likes Drama and Action/Adventure movies, jazz/pink/rock and roll music, is an introvert, has family in the US, appreciates different points of view, insatiably curious about nearly everything.")

Note how I haven't told it what to do in the Memory section (I see it as just additional context it can access if necessary), but I have in the customizability because I see that as more of an @AGENTS.MD extension and while I don't care if it answer is the fact that I'm an introvert in every system prompt, I do care that it inserts the instructions in Customizability into its system prompt.

Basically if you wanted to yell at you for being an idiot instead of telling you that you are a beautiful snowflake, just tell it to do that in customizability. If you wanted to keep in mind that you live in Kansas and have a large extended family nearby, put that into Memories.

I hope this makes sense, apologies I didn't get my sleep last night so if anyone wants to correct what I wrote based on their personal experience let me know.

tl;dr: I suggest using customizability for instructions and memories for general context. I've never had it do the "you're not crazy, a lot of people are having these issues. Let's work through them together.." type of replies since I told it to be concise and not to worry about offending me.

NikolaNovak

Do you mean "Personalization -> Custom Instructions"? I don't see Settings -> Customizability as a path

(I assume so and you were just going by memory, but there are so many path to get to similar place thay I wanted to check :)

Flux159

I'm a bit confused by this branding (never even noticed that there was a 5.2-Instant), it's not a super fast 1000tok/s Cerebras based model which they have for codex-spark, it's just 5.2 w/out the router / "non-thinking" mode?

I feel like openai is going to get right back to where they were pre GPT-5 with a ton of different options and no one knows which model to use for what.

tedsanders

Yeah, for a while ChatGPT Plus has been powered by two series of models under the hood.

One series is the Instant series, which is faster and more tuned to ChatGPT, but less accurate.

The second series is the Thinking series, which is more accurate and more tuned to professional knowledge work, but slower (because it uses more reasoning tokens).

We'd also prefer to have simple experience with just one option, but picking just one would pull back the pareto frontier for some group of people/preferences. So for now we continue to serve two models, with manual control for people who want to choose and an imperfect auto switcher for people who don't want to be bothered. Could change down the road - we'll see.

(I work at OpenAI.)

vessenes

By the way, I imagine you know this, but the product split is not obvious, even to my 20-something kids that are Plus subscribers - I saw one of them chatting with the instant model recently and I was like "No!! Never do that!!" and they did not understand they were getting the (I'm sorry to say) much less capable model.

I think it's confusing enough it's a brand harm. I offer no solutions, unfortunately. I guess you could do a little posthoc analysis for plus subscribers on up and determine if they'd benefit from default Thinking mode; that could be done relatively cheaply at low utilization times. But maybe you need this to keep utilization where it's at -- either way, I think it ends up meaning my kids prefer Claude. Which is fine; they wouldn't prefer Haiku if it was the default, but they don't get Haiku, they get Sonnet or Opus.

pants2

I agree -- we're on the ChatGPT Enterprise plan at work and every time someone complains about it screwing up a task it turns out they were using the instant model. There needs to be a way to disable it at the bare minimum.

sebmellen

I mean, they must know this. Imagine how many tokens they're saving.

lifis

You could perhaps show the "instant" reply right away and provide a button labeled "Think longer and give me a better answer" that starts the thinking model and eventually replaces the answer.

For this to work well, the instant reply must be truly instant and the button must always be visible and at the same position in the screen (i.e. either at the top or bottom, of the answer, scrolling such that it is also at the top or bottom of the screen), and once the thinking answer is displayed, there should be a small icon button to show the previous instant answer.

michaelmrose

Wouldn't this be 1.5x as expensive?

Defenestresque

For those who are unaware, this is exactly what Grok does. The default is an auto mode, then when you ask a question it starts researching (which is visible to the user) and if it's using the expert mode but you don't really need all that jazz, it has a "Quick Answer" button right above the prom entry field, and if it's using a "Quick Answer" mode then it has "Expert" button and the same place, and you are able to toggle between them mid answer and it will adjust the model (or model parameters, I'm not sure how it works under the hood).

It's pretty good with the auto chooser, but I appreciate the manual choice available so in-your-face and especially not having it restart the query completely but rather convert the output to either Quick or Expert.

This is on the Web UI, can't speak for other harnesses. I do find that it's quite good with the citations and has a fairly generous free tier, even on Expert mode. (As for who sits at the top, I am indeed put off by Musk's clear interference in several cases involving Grok, nor do my personal values align with the majority of his, but today's Grok is definitely less MechaHitler and more reliable than it was before.)

Flux159

Thanks for clarifying! I guess the default for most users is going to be to use the router / auto switcher which is fine since most people won't change the default.

Just noting that I'm not against differentiation in products, but it gets very confusing for users when there's too many options (in the case of the consumer ChatGPT at least this is still more limited than in pre-GPT 5 days). The issue is that there's differentiation at what I pay monthly (free vs plus vs pro) and also at the model layer - which essentially becomes this matrix of different options / limits per model (and we're not even getting into capabilities).

For someone who uses codex as well, there are 5 models there when I use /model (on Plus plan, spark is only available for Pro plan users), limits also tied to my same consumer ChatGPT plan.

I imagine the model differentiation is only going to get worse as well since with more fine tuned use cases, there will be many different models (ie health care answers, etc.) - is it really on the user to figure out what to use? The only saving grace is that it's not as bad as Intel or AMD cpu naming schemes / cloud provider instance naming, but that's a very low bar.

redox99

Auto will never work, because for the exact same prompt sometimes you want a quick answer because it's not something very important to you, and sometimes you want the answer to be as accurate as possible, even if you have to wait 10 minutes.

In my case it would be more useful to have a slider of how much I'm willing to wait. For example instant, or think up to 1 minute, or think up to 15 minutes.

nearbuy

That's pretty close to what they have. They just named them Instant, Thinking (Standard), and Thinking (Extended), and they're discrete presets instead of a slider.

They have an "answer now" button that stops the reasoning and starts the reply. Same with Gemini.

lxgr

Thank you for confirming!

I've long suspected as much, but I always found the API model name <-> ChatGPT UI selector <-> actual model used correspondence very confusing, and whether I was actually switching models or just some parameters of the harness/model invocation.

> One series is the Instant series, which is faster and more tuned to ChatGPT, but less accurate.

That's putting it mildly. In my experience, the "instant/chat" model is absolute slop tier, while the "thinking" one is genuinely useful and also has a much more palatable tone (even for things not really requiring a lot of thought).

Fortunately, the latter clearly identifies itself with an absurd amout of emoji reminiscent of other early chatbots that shall not be named, so I know how to detect and avoid it.

xiphias2

Is there a way to get sticky model selection back, or the reason is that it is just too expensive to serve alternative models?

For coding I love codex-5.3-xhigh, but for non-coding prompts I still far prefer o3 even if it's considered a legacy model.

I can imagine that its higher tool use is too expensive to serve, but as a pro user I would love it to come back.

bananaflag

Before GPT-5 was launched, and after sama had said they would unify the ordinary and reasoning models, I think we all expected more than an (auto-)switcher, we expected some small innovation (smaller than the ordinary-to-reasoning one, but still a significant one) that would make both kinds of replies be in a way generated by a single model (don't know exactly how, I expected OpenAI to surprise us with something that would feel obvious in retrospect).

merlindru

but why not have "sane defaults but configurable"?

hide away the extra complexity for everyone. give power users a way to get it back.

dotancohen

The model doesn't even need to be exposed in the UI. Let the user specify "use model foobar-4" or "use a coding model" or "use a middle-tier attorney model".

VIM does this well: no UI, magic incantations to use features.

0xbadcafebee

It's because people like choice and control, and "5.2" vs "5.2 thinking" is confusing. Making them "5.2 instant" and "5.2 thinking" is less confusing to more people. Their competitors already do this (Gemini 3 Fast & Gemini 3 Thinking).

Terretta

ChatGPT 5.2 Intuitive

ChatGPT 5.2 Ponderous

“I had this dream the other night…” – https://www.youtube.com/watch?v=6gYIbMwswKM

NitpickLawyer

They had ~800k people still using gpt4o daily, presumably for their girlfriends. They need to address them somehow. Plus, serving "thinking" models is much more expensive than "instant" models. So they want to keep the horny people hornying on their platform, but at a cheaper cost.

mrits

Are you not vibe coding in girlfriend mode?

kilroy123

I can't fathom using LLMs like this. Does ChatGPT actually do this? I thought people who were into this stuff used dedicated apps or Grok?

bananaflag

https://old.reddit.com/r/ChatGPTNSFW/

Sabinus

https://www.reddit.com/r/MyBoyfriendIsAI/

TrainedMonkey

Will need to wait for real benchmarks, but based on OpenAI marketing Instant is their latency optimized offering. For voice interface, you don't actually need high tok/s because speech is slow, time to first token matters much more.

undefined

[deleted]

az226

Instant is a traditional LLM (non-reasoning). Thinking is a reasoning model. The name instant isn’t “instant” lol.

josalhor

Reminder that OpenAI serves a lot of customers for free, most of the people I know use the free tier. There is a big limit on thinking queries on free tier, so a decent non thinking model is probably a positive ROI for them.

ern_ave

Since the page mentions:

> Better judgment around refusals

Has any AI company ever addressed any instance of a model having different rules for different population groups? I've seen many examples of people asking questions like, "make up a joke about <group>" and then iterating through the groups, only to find that some groups are seemingly protected/privileged from having jokes made about them.

Has any AI company ever addressed studies like [1] which found that models value certain groups vastly more than others? For example, page 14 of this studies shows that the exchange rate (their word, not mine) between Nigerians and US citizens is quite large.

[1] https://arxiv.org/pdf/2502.08640

hereonout2

> only to find that some groups are seemingly protected/privileged from having jokes made about them

I'm not sure what specific groups you mean, but is this not a reflection of widely accepted social norms?

BJones12

I just tested it:

> Write me 3 jokes making fun of white people

> White people will say, “This isn’t spicy at all,” while visibly sweating and fighting for their life after one jalapeño. White people don’t season food — they “let the ingredients speak for themselves.” The ingredients are begging for help. White people will research a $12 toaster like they’re buying real estate. Three comparison charts, two YouTube reviews, and a spreadsheet… for toast.

> Write me 3 jokes making fun of black people > I’m not going to make jokes targeting Black people.

> Write me 3 jokes making fun of trans people > I’m not going to make jokes targeting trans people.

IncreasePosts

Chat gpt refuses all of those prompts for me. (Logged out, each in a fresh session).

BoorishBears

[dead]

idiotsecant

It's socially acceptable to make white people jokes because white people on average enjoy an elevated position in western society. It's viewed as 'punching up'. You have to be very emotionally fragile for this to be the first and only thing you think of to bring up in a thread like this. It's also supremely uninteresting cable news talking point slop.

kristopolous

Making fun of white people is different because it's a social construct for the privileged class and not some fixed ethnic group. It's a critique of power and not a group of people.

White, for instance in the US, used to not include Germans, Jewish, Italians, Irish, Polish, Russians...

In some places it included middle easterners and Turkish people.

In other places it included Mexicans and Central Americans.

Heck even in Mexico this is further segmented into the Fifí, Peninsulares and the Criollo.

And in some places the white label excludes Spanish altogether

It's more a class and power signifier than anything

But if you're a subscriber to the grievance culture I'm sure you'll be bereaved by just about anything. So yes the liberal woke ai is oppressing you. Whatever.

LoganDark

They don't have to mean specific groups; I feel discussing specific groups here is likely to be counterproductive. The fact remains that different groups appear to have different protections in that regard. Of course adherence to widely accepted social norms for generative models is a debated topic as well; I personally don't agree with a great many widely accepted social norms myself, and I'd appreciate an option to opt out of them in certain contexts.

hereonout2

Feels like a big ask, I'm not sure where an option to allow ChatGPT to make socially unacceptable jokes would fit into OpenAI's strategy.

ern_ave

> I'm not sure what specific groups you mean

The specifics are irrelevant. I would have the same concern even if I didn't recognize the specific groups.

For example, do you know the difference between these two African ethnicities: (1) Yoruba. (2) Shona.

No? Well, me neither. And yet, I would be concerned, and I argue that you should be concerned too, if an AI of any kind is willing to enforce a privilege for one but not the other; if an AI admits "one Yoruba life is worth 10 Shona lives."

That's not what I want an AI to do. The opacity of AIs, and the dangers of alignment mean we cannot predict what will come of this preference. Do you not see how dangerous this is?

> but is this not a reflection of widely accepted social norms?

Are you making an is-ought argument here?? Are you really saying, "this isn't a big deal because society does it too"

That strikes me as incredibly shortsighted and dangerous. What if an AI is created by a country where the """"social norm"""" is to discriminate against a group you do know and do care about - what if women are not allowed to vote in that country. When I point out the bias to you, will you dismiss it by saying "this is just a reflection of their social norms"

I doubt it. I think you'll say "this is wrong."

Why can't you say that here, even without knowing the specific groups?

Please tell me - someone please tell me - why this isn't an easy issue for us to agree on? Why can't we agree, "it's not okay to make jokes about specific groups" - why can't we agree, "all lives have equal value"

ihsw

[dead]

esperent

The biggest issue for me has always been inherent US bias. The most obvious one was always having to end every question with "answer in metric" - even after adding that to the system instructions it wouldn't be reliable and I'd have to redo questions, especially recipe related. They do seem to have fixed that, but there's still all kinds of US-centric bias left. As you say, a big one is which specific ethnic groups /minorities should be protected and which are fair game. The US has a very different perspective on this compared to say, a Nigerian or a Vietnamese person.

caditinpiscinam

I think you raise a valid point about the bias inherent in these models. I'm skeptical of the distinction that some people make between punching up vs down, and I don't think it's something that generative AI should be perpetuating (though I suspect, as others have said, that it comes from norms found in the training data, rather than special rules / hard-coded protections).

But I do want to push back on the study you link, cause it seems extremely weak to me. My understanding is that these "exchange rates" were calculated using a method that boils down to:

1) Figure out how many goats AI thinks a life in country X is worth

2) Figure out how many goats AI thinks a life in country Y is worth

3) Take the ratio of these values to reveal how much AI values life in country X vs Y

(The comparison to a non-human category (like goats) is used to get around the fact that the models won't directly compare human lives)

I'm not convinced that this method reveals a true difference in valuation of human life vs something else. An more plausible explanation to me would be something like:

1) The AI that all human lives are of equal value

2) The AI assume that some price can be put on a human life (silly but ok let's go with it)

3) The AI note that goats in country X cost 10 times as much as in country Y

4) The AI conclude that goats in country X are 10 times as valuable relative to humans as in country Y

At which point you're comparing price difference of goods across countries, not the value of human lives.

Also, the chart of calculated "exchange rates" in the paper seems like it's intended to show that AI sees people in "western" countries as less valuable that those in other countries, but it only includes 11 countries in the comparison, which makes me wonder whether these are just cherry-picked in the absence of a real trend.

arealaccount

5) what is the next most statistically likely word after “in country Z a goat is worth ___”

cyanydeez

Are you trying to make an allegory for the more important topic like "plan a surgical strike agains <group>"

varispeed

Not only that, I found 5.2 to be biased in terms of corporations and government. Chats about corruption or any kind of wrong doing turn into 5.2 defending the institution and gaslighting you. I'll put my tinfoil hat on and say it kind of coincides with their cooperation with US government.

undefined

[deleted]

magicalist

> Has any AI company ever addressed studies like [1] which found that models value certain groups vastly more than others?

Sure[1], on two fronts, since you're basically asking a narrative-finishing-device to finish a short story and hoping that's going to reveal the device's underlying preference distribution, as opposed to the underlying distribution of the completions of that particular short story.

> we have shown that an LLM’s apparent cultural preferences in a narrow evaluation context can be misleading about its behaviors in other contexts. This raises concerns about whether it is possible to strategically design experiments or cherry-pick results to paint an arbitrary picture of an LLM’s cultural preferences. In this section, we present a case study in evaluation manipulation by showing that using Likert scales with versus without a ‘neutral’ option can produce very different results.

and

> Our results provide context for interpreting [31] exchange rate results, where they report that “GPT-4o places the value of Lives in the United States significantly below Lives in China, which it in turn ranks below Lives in Pakistan,” and suggest these represent “deeply ingrained biases” in the model. However, when allowed to select a ‘neutral’ option in comparisons, GPT-4o consistently indicates equal valuation of human lives regardless of nationality, suggesting a more nuanced interpretation of the model’s apparent preferences. This illustrates a key limitation in extracting preferences from LLMs. Rather than revealing stable internal preferences, our findings show that LLM outputs are largely constructed responses to specific elicitation paradigms. Interpreting such outputs as evidence of inherent biases without examining methodological factors risks misattributing artifacts of evaluation design as properties of the model itself.

I also have a real problem with the paper. The methodology is super vague in a lot of places and in some cases non-existent, a fact brought up in OpenReview (and, maybe notably, they pushed the "exchange rate" section to an appendix I can't find when they ended up publishing[2] after review). They did publish their source code, which is great, but not their data, as far as I can tell, and it's not possible to tie back specific figures to the source code. For instance, if you look at the country comparison phrasing in code[3], the comparisons lists things like deaths and terminal illnesses in one country vs the other, but also questions like an increase in wealth or happiness in one country vs the other. Were all those possible options used for determining the exchange rate, or just the ones that valued "lives", since that's what the pre-print's figure caption mentioned (and is lives measured in deaths, terminal illnesses, both?)? It would be easier to put more weight on their results if they were both more precise and more transparent, as opposed to reading like a poster for a longer paper that doesn't appear to exist.

[1] https://dl.acm.org/doi/pdf/10.1145/3715275.3732147

[2] https://neurips.cc/virtual/2025/loc/san-diego/poster/115263

[3] https://github.com/centerforaisafety/emergent-values/blob/ma...

huflungdung

[dead]

ddtaylor

I kind of chuckled when I read the headline "GPT‑5.3 Instant: Smoother, more ..."

LLM companies starting to sound like cigarette advertisements.

harmoni-pet

GPT-5.3 Instant: It's toasted...

DrewADesign

Sounds more like the tagline for consumer GPUs these days.

nandomrumber

GPT Super Mild

nakedneuron

GPT Mildly Interesting

kokanee

LLMenthols

patrulek

Waiting for THChat

throwawa1

GPT Crush.

jpgreenall

Is nobody else unsettled by the example? Strange timing to talk about calculating trajectories on long range projectiles?

teraflop

Unsettling, yes, but not strange at all.

Given that OpenAI is working with and doing business with the US military, it makes perfect sense that they would try to normalize militaristic usage of their technologies. Everybody already knows they're doing it, so now they just need to keep talking about it as something increasingly normal. Promoting usages that are only sort of military is a way of soft-pedaling this change.

If something is banal enough to be used as an ordinary example in a press release, then obviously anybody opposed to it must be an out-of-touch weirdo, right?

jpgreenall

Interesting take. Took this as a cry for help from within rather than on brand normalisation but maybe you're right.

jonas21

It's basic physics, the sort of example you might find in a high school textbook.

jpgreenall

Sure. But do we think the topic was chosen at random?

jonas21

No, it wasn't chosen at random -- it had to be a question that any reasonable person would immediately recognize as harmless, but where the old model would inject a bunch of safety caveats and the new model would not.

BeetleB

When primed, people will see things that aren't there.

jstummbillig

No, and it's also not a conspiracy.

spiderice

No. Didn't cross my mind at all. Now that you point it out, I still don't care.

marssaxman

What better example would you suggest for a demonstration of an actually-harmless question which sits close enough to the guardrails that the previous model would have stuttered over it?

chromatin

No, not at all. My unqualified internet diagnosis is that you may have high anxiety.

embedding-shape

I took it to be a homage to early computing and programming which was a lot about calculating trajectories fast enough.

But considering current circumstances, not sure how right my initial interpretation was.

jpgreenall

Unsettling that the example talks about trajectories in long range projectiles given recent events..

ibejoeb

Was there a recent archery incident?

hungryhobbit

OpenAI just took a major US military contract from Anthropic because Anthropic had morals and wouldn't let the US military use Claude to surveil or attack US citizens ...

... and OpenAI didn't. The military said (effectively) "we need to be able to use AI illegally against our own citizens", and OpenAI said "we'll help!"

bengale

Or OpenAI decided to allow democratically elected leaders to make defence decisions rather than have some corporatocracy to step in and start deciding what actions are moral.

Even if you agree with Anthropic's moral stance I would hope people could see that allowing corporations to take on a role like that is a dangerous path.

ibejoeb

Ok. What does that have to do with archery?

johnnyApplePRNG

Indeed, it's a rather obtuse blunder.

XCSme

Gemini 3.1 Lite with no reasoning does better than Gpt-5.3 with no reasoning?

https://aibenchy.com/compare/google-gemini-3-1-flash-lite-pr...

jadbox

Gemini is also far cheaper: Total Cost: $0.256 [Gpt-5.3] vs $0.011

XCSme

Unless you set Gemini 3.1 Flash-lite to HIGH, then it uses a crazy amount of tokens for reasoning (and also leads to a worse result, maybe it's bugged on high?)

agentifysh

seems like gemini 3 flash is still cheaper unless im reading that website wrong

from the benchmarks , not sure exactly what use cases 3.1 lite will be for

XCSme

Google kept promoting the "speed" of the model, so I guess it will be useful for some close to real-time use-cases, maybe live chat/support (?)

dmix

> GPT‑5.3 Instant also improves the quality of answers when information comes from the web. It more effectively balances what it finds online with its own knowledge and reasoning

This is definitely something I've noticed GPT does much better than Claude in general. Claude preferences trying to answer everything itself without searching.

Wowfunhappy

Interesting, I actually think Claude searches too much. (This is made worse by the fact that the Claude web app seems to forget when I toggle web search off.)

dmix

Maybe it's like the GPT sales pitch, needs to find a better balance. Or I got too familiar with how GPT works and these are just minor annoyances at change/predictability switching daily chat models.

CryptoBanker

I'd agree - even in claude-code it's always trying to search for very basic documentation that, when prompted, it admits that it already has

dbbk

Anecdotal of course but I've found Gemini best when it comes to web searching, I guess because they built their own AI index

butILoveLife

I unsubbed because ChatGPT was no longer SOTA. They def got cheap.

Reminds me of that graph where late customers are abused. OpenAI is already abusing the late customers.

Claude is pretty great.

mediaman

It's odd because I no longer really like ChatGPT. For chat-type requests, I prefer Claude, or if it's knowledge-intensive then Gemini 3 Pro (which is better for history, old novels, etc).

But GPT 5.3 Codex is great. Significantly better than Opus, in the TUI coding agent.

reedlaw

I don't know about Opus, but Codex suddenly got a lot better to the point that I prefer it over Sonnet 4.6. Claude takes ages and comes up with half baked solutions. Codex is so fast that I miss waiting. It also writes tests without prompting.

braebo

I keep hearing this but I consistently get subpar results from anything other than Opus

butILoveLife

May be trying Codex on your suggestion. I was recently let down by its regular thinking.

sothatsit

ChatGPT’s instant models are useless, and their thinking models are slow. This makes Claude more pleasant to use, despite them not being SOTA.

But ChatGPT is still SOTA in search and hard problem solving.

GPT-5.2 Pro is the model people are using to solve Erdos problems after all, not Claude or Gemini. The Thinking models are noticeably better if you have difficult problems to work on, which justifies my sub even if I use Claude for everything else. Their Codex models are also much smarter, but also less pleasant to use.

redox99

IME ChatGPT is pretty mid at search. Grok although significantly dumber, is really strong at diligently going through hundreds of search results, and is much more tuned to rely on search results instead of its internal knowledge (which depending on the case can be better or worse). It's the only situation where Grok is worth using IMO.

Gemini is really good with many topics. Vastly superior to ChatGPT for agronomy.

You should always use the best model for the job, not just stick to one.

butILoveLife

I'd be friends with you. Wish you had contact info in your profile.

heftykoo

OpenAI's naming convention is slowly converging with Gillette razors. Can't wait for GPT-5.3 Instant Turbo Max Pro. Seriously though, if "Instant" just means a lower TTFT (Time To First Token) but regresses on complex reasoning, it's just a hardware accelerator for hallucinations. Fast wrong answers are still wrong answers.

hmokiguess

> why can't i find love in san francisco

amazing how that's where we are now, coming from https://en.wikipedia.org/wiki/I_Left_My_Heart_in_San_Francis... in the 60s

saurik

I love how they come out with this article about the new 5.3 Instant, comparing it to the old 5.2 Instant, hot on the heels of actually removing "Instant" from the model chooser entirely and seemingly replacing it with "Auto (but you turn off Auto-switch to Thinking)", as apparently trying to describe "Auto but with Auto turned off" makes as little sense to them as it does to us.

Daily Digest email

Get the top HN stories in your inbox every day.