Get the top HN stories in your inbox every day.
Tiberium
benjiro29
GLM 5.2 Max = Opus 4.8 Max in thinking behavior. The thinking chain is so similar, and so is the amount of token usage on the output.
If you want reasonable token usage, you need to run it GLM 5.2 at High. There is little drop in quality from Max to High (for most tasks). And it cuts token usage by 2 a 2.5x. GLM 5.2, Max is really something you only need for complex tasks.
In essence, GLM 5.2 is Opus 4.8 its little brother, at a way, WAY cheaper price.
There has been really no training on Opus models going on, really, none i tell you! /sarcasm
matheusmoreira
> GLM 5.2 Max = Opus 4.8 Max in thinking behavior
This is insane! I can't wait until technology progresses to the point we can run these things on consumer hardware!
harshit119
This is quite evident for personal AI but general intelligence with current scaling laws and how model keep getting better with more number of parameters, certainly the path does not converge. Personal AI is more deprived of context today than quality of token. Having a on-system knowledge base paired with Gemma works well to large extend.
chartpath
Are there any indications that this will be possible? Consumer hardware will continue getting better but I can't see 512GB RAM in a MacBook Pro any time soon. I'm hoping linear attention techniques plus MoE will make breakthroughs in size/compression and throughput.
muyuu
you need 8 x 96GB Blackwell or equivalent
so around US$150k which is Small/Medium-Enterprise territory already, but who knows when it will hit "reasonable" home consumer territory
I think there's hope future generations of unified memory machines may get this sort of memory availability when new fabs open in then next couple of years and then ramp up production for a few years afterwards - that makes ~2030s credible at this point, but nobody can really predict the market that far ahead
vitalyan123
distillation of thinking models is not particularly effective - both "Open"AI and Misanthropic don't show you the real chain of thought, only its severely downscaled version. both do everything in their power to combat such outrageous copyright infringement, so the bulk of unethically scrapped data the Chinese have is from several generations ago.
nyrikki
It is quite likely that the intermediate tokens don’t have ‘semantic import’[0]
There are methods like Habitual Reasoning Distillation or Inverted Reasoning Traces [1] that can help.
While there are reasons to hide the intermediate tokens from a IP protection stand point, there is also a need to hide more effective and efficient generating that doesn’t fit the R1 claims of an aha moment that has been debunked, but is a consumer expectation.
While hidden intermediate tokens do increase the difficulty, it is not a from barrier in itself, especially as they are billed, given information about their length.
duskdozer
>such outrageous copyright infringement
Sarcasm, considering the source of their own training data?
Bolwin
For Claude models at least, you can tell to just manually think in the output and it works fine. I do it reguralrly because for creative writing and summarization, they seem to believe they don't need to think at all, and get way worse results.
overfeed
FYI: model outputs are not protected by copyright.
kmeisthax
Chinese distillation attacks are about as unethical as Robin Hood stealing from the rich to give to the poor. The real unethical scraping was done by Anthropic to train Claude.
To be clear, if Anthropic was using totally licensed data, I'd be sympathetic to these claims. But if you're going to pirate the world's creativity you'd better be willing to gimme dat shit for free[0].
[0] As said by Hungry Santa.
mirekrusin
I don’t understand why there isn’t public dataset for reasoning that can be improved by humans/llms like Wikipedia (ie with auto judging contributions etc).
ComputerGuru
Supposedly there are “jailbreaks” that expose considerably more of the thinking traces.
undefined
maxdo
looking at the score this is rather a gemini 3.5 flash competitor, yes, for cheaper, but distance to opus and fable is as big as their price diff.
FooBarWidget
With such ridiculously long thinking traces I'm surprised max outperforms high. After all, performance falls off a hill after a certain amount of context, and long thinking traces can fill that up really quickly.
alexjplant
> It seems to really be a nice step-up and is getting quite close to the frontier.
IMHO it's already surpassed them. I vastly prefer my personal GLM and OpenCode setup to the Claude Code and Opus one that I have to use at work. The former makes way fewer StackOverflow brogrammer-tier mistakes and is considerably better at following instructions. The harness UX is also vastly superior as it doesn't ignore, randomly change, or incorrectly report settings.
Maybe it's the harness and I'd have even greater success with OpenCode and Anthropic, but I think it safe to say that Anthropic's moat is evaporating.
vorticalbox
This is a problem I find with opus is will spend so long thinking then going “but wait what if”
To point where I stop it and simple tell it to “start writing code you can work it out as you go along”
Seems writers block also effects LLM
robertkarl
https://arxiv.org/abs/2606.00206
In this paper they nerf an LLMs ability to emit waffling thinking tokens like "wait", "but", "alternatively", and the models (they're old, small models in the paper) terminate reasoning faster and perform better. I bet Anthropic is tuning this on their backend.
addandsubtract
Didn't they originally introduce those tokens to make the models smarter by second guessing their "thoughts"?
orbital-decay
I imagine Anthropic would rather train a small control model instead of resorting to sampling hacks
meatmanek
This is super cool. Do you know if any of the inference backends (llama.cpp, vllm, etc) support this technique?
giancarlostoro
I usually have Claude build a plan first, then I put it into an XML file it updates with phases, usually we talk about some of those tasks, and then once its good and I like it, I have Claude implement the plan.
Another thing I tell Claude to do is to not guess, but look at documentation, it messes up a lot less, might use some tokens reading docs, but at least it has a higher success rate code wise.
xstas1
XML??
mikeocool
Seriously. Whenever I read the thinking output I get mad and turn down effort to medium or low.
Just output the code and we’ll work through it!
I feel similarly about having codex review claude’s plans. I don’t think I’ve ever seen it catch a major issue. It just points out things that would have inevitably been addressed during implementation anyway.
SubiculumCode
A lot of times this is how humans work. Just start 'putting words on paper', 'think by doing', etc. sometimes it's more efficient to see why something won't work after writing a bit of it, and sometimes you get lucky and it works right off the bat
drob518
Qwen is notorious for this, too. It’ll sometimes spin in a long loop of “But wait…” paragraphs.
epolanski
Fable was 20 times worse on that.
It's clear it was the vibe coding model, as like no other model before, fully turned you into his assistant instead of the other way around.
RyanHamilton
Could it be possible, these firms are optimizing for two things: a) Better performance. b) Gathering data from you to further improve performance later. I've also found the huge amount of planning rather than iteration frustrating. I've felt like I'm teaching a junior!
undefined
thinkingtoilet
I've been having success with Opus but you REALLY have to tame it. Long prompts that list what files to look at, relationships between entities, etc... I went from regularly hitting my daily limit to almost never hitting it. Oh, and also I was being lazy with small changes and stopping that helped a lot too. As you said, it gets in these loops where it's just churning and if you don't stop it it can go on for way too long.
h14h
Hopefully the recent work Moonshot did with Kimi K2.7 Code trickles in to the other open-model labs.
Per AA, while K2.7 Code is roughly on par w/ K2.6 in terms of intelligence, it uses half the output tokens to get there.
bertili
This is GLM 5.2 Max. GLM 5.2 High which use less than half[1] the tokens.
HWR_14
I thought you could not compare tokens across models because their cost and speed was so different between models.
robmccoll
That's interesting. I gave nearly the same task to Gemma4 31b as a test yesterday. Write a symbolic math engine in Typescript that can perform evaluation and simple expression reductions over +-/*(). It performed the task correctly with minimal reasoning - much fewer reasoning tokens than output tokens.
gbingles
Tbh, so what? I googled "symbolic math engine in Typescript that can perform evaluation and simple expression reductions over +-/*()" and got what looks to be viable answers without using any AI model at all. Reciting well established things from memory isn't terribly interesting. Show it a novel codebase and have it implement something within it.
SubiculumCode
TBH, while your point is a fair one, your attitude is off-putting and needlessly condescending.
drob518
So, a natural question would be why a model would ever get it wrong?
xyzsparetimexyz
Reminiscent of https://en.wikipedia.org/wiki/Portia_(spider)
kristopolous
I have a script that ranks these based on codingindex from Artificial Analysis.
All it does is pull a json from their main table page and parses it with the fields I care about (coding).
There used to be a mailing list associated with it but eh ... there wasn't much interest. I use the script every day though.
Current partial output
score age size name
47.1 58 large Kimi K2.6
47.5 54 large DeepSeek V4 Pro (Reasoning, Max Effort)
47.5 70 - Muse Spark
47.6 132 - Claude Opus 4.6 (Non-reasoning, High Effort)
47.8 205 - Claude Opus 4.5 (Reasoning)
48.1 132 - Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
48.6 55 - GPT-5.5 (Non-reasoning)
48.7 188 - GPT-5.2 (xhigh)
50.1 29 - Qwen3.7 Max
50.7 1 large GLM-5.2 (max)
50.9 120 - Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
51.5 92 - GPT-5.4 mini (xhigh)
52.1 55 - GPT-5.5 (low)
52.5 62 - Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
53.1 132 - GPT-5.3 Codex (xhigh)
53.1 62 - Claude Opus 4.7 (Non-reasoning, High Effort)
55.5 118 - Gemini 3.1 Pro Preview
56.2 55 - GPT-5.5 (medium)
56.7 20 - Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
57.2 104 - GPT-5.4 (xhigh)
58.5 55 - GPT-5.5 (high)
59.1 55 - GPT-5.5 (xhigh)
62 8 - Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)
To see everything, run it like so $ curl day50.dev/art-analysis.sh | bash
The repo: https://github.com/day50-dev/aa-eval-emailsome key takeaways:
* open models are on about a 4-7 month lag right now depending on how you want to measure it
* if this keeps up, you might see an open-weights model doing claude fable 5 level work before the new year.
if people sign up for the free mailing list (that just does this) I'll go and put it back on ... emails when new model evals drop - it was pretty useful.
papersail
score age size name
62.0 8 - Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)
59.1 55 - GPT-5.5 (xhigh)
58.5 55 - GPT-5.5 (high)
57.2 104 - GPT-5.4 (xhigh)
56.7 20 - Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
56.2 55 - GPT-5.5 (medium)
55.5 118 - Gemini 3.1 Pro Preview
53.1 132 - GPT-5.3 Codex (xhigh)
53.1 62 - Claude Opus 4.7 (Non-reasoning, High Effort)
52.5 62 - Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
52.1 55 - GPT-5.5 (low)
51.5 92 - GPT-5.4 mini (xhigh)
50.9 120 - Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
50.7 1 large GLM-5.2 (max)
50.1 29 - Qwen3.7 Max
48.7 188 - GPT-5.2 (xhigh)
48.6 55 - GPT-5.5 (Non-reasoning)
48.1 132 - Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
47.8 205 - Claude Opus 4.5 (Reasoning)christoff12
Lol thank you for sorting.
Are the scores here normalized such that each point difference is equidistant?
papersail
rank score age size name
1 62.0 8 - Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)
2 59.1 55 - GPT-5.5 (xhigh)
3 58.5 55 - GPT-5.5 (high)
4 57.2 104 - GPT-5.4 (xhigh)
5 56.7 20 - Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
6 55.5 118 - Gemini 3.1 Pro Preview
7 53.1 62 - Claude Opus 4.7 (Non-reasoning, High Effort)
8 53.1 132 - GPT-5.3 Codex (xhigh)
9 52.5 62 - Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
10 51.5 92 - GPT-5.4 mini (xhigh)
11 50.9 120 - Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
12 50.7 1 large GLM-5.2 (max)
13 50.1 29 - Qwen3.7 Max
14 48.7 188 - GPT-5.2 (xhigh)
15 48.1 132 - Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
16 47.8 205 - Claude Opus 4.5 (Reasoning)
17 47.6 132 - Claude Opus 4.6 (Non-reasoning, High Effort)
18 47.5 70 - Muse Spark
19 47.5 54 large DeepSeek V4 Pro (Reasoning, Max Effort)
20 47.1 58 large Kimi K2.6
21 47.1 29 - Gemini 3.5 Flash (minimal)
22 46.7 449 - Gemini 2.5 Pro Preview (Mar' 25)
23 46.5 211 - Gemini 3 Pro Preview (high)
24 46.5 16 - Qwen3.7 Plus
25 46.4 120 - Claude Sonnet 4.6 (Non-reasoning, High Effort)
26 45.6 5 large Kimi K2.7 Code
27 45.6 104 - GPT-5.4 (low)
28 45.5 56 large MiMo-V2.5-Pro
29 45.1 43 - GPT-5.5 Instant (May 2026)
30 45.0 29 - Gemini 3.5 Flash (high)
31 44.9 58 - Qwen3.6 Max Preview
32 44.7 216 - GPT-5.1 (high)
33 44.2 188 - GPT-5.2 (medium)
34 44.2 126 large GLM-5 (Reasoning)
35 43.9 92 - GPT-5.4 nano (xhigh)
36 43.4 71 large GLM-5.1 (Reasoning)
37 43.4 16 large MiniMax-M3
38 43.2 54 large DeepSeek V4 Pro (Reasoning, High Effort)
39 43.0 188 - GPT-5.2 Codex (xhigh)
40 42.9 76 - Qwen3.6 Plus
41 42.9 205 - Claude Opus 4.5 (Non-reasoning)
42 42.6 182 - Gemini 3 Flash Preview (Reasoning)
43 42.2 99 - Grok 4.20 0309 (Reasoning)
44 42.1 56 large MiMo-V2.5
45 41.9 91 large MiniMax-M2.7
46 41.4 91 - MiMo-V2-Pro
47 41.3 121 large Qwen3.5 397B A17B (Reasoning)
48 41.0 48 - Grok 4.3 (high)
49 40.5 71 - Grok 4.20 0309 v2 (Reasoning)
50 40.5 342 - Grok 4
51 39.8 54 large DeepSeek V4 Flash (Reasoning, High Effort)
A longer curated list based on kristopolous’ list, with more models included. For each model, I kept only the two highest-scoring entries. I used DeepSeek V4 Flash as the cutoff, since I consider it the lowest acceptable model that is still locally deployable.matheusmoreira
These results are amazing! I can't believe an open weight model rivals Opus 4.6, my most used model!
cmrdporcupine
My observations:
Surprised to see MiniMax M3 so low on that list, not really my experience, I found it smarter than Gemini for a lot of things, that's for sure.
Also surprised to see Gemini 3.1 ranked that high there. It remains IMHO blatantly incompetent for tool use even in their own harnesses, so I can only assume this benchmark isn't ranking workflow things very high. Gemini can write code just fine. It just can't work well as an agent.
GLM 5.2 and Qwen3.7 max were from my experience fairly expensive to use on a per token price and hard to argue in favour of when the SOTA coding plans have a fixed price that makes them potentially more cost effective. (Yes I know z.ai has a coding plan but I've heard reliability nightmare stories, and it's not very cheap)
DeepSeek is clearly the best value for $$. With the right harness and prompting.
undefined
tcp_handshaker
Short comments...
- GPT 5.5 consistently the best, an opinion who gets me constant downvotes here by the Anthropic Marketeer strike force...
- China is going to eat the US lunch on AI
- What have European universities and companies been doing? Its like if, on a parallel past/future, Nikola Tesla and Edison would have created flying Cyberpunk machines, while Europeans researchers, would be getting together to request EU funds, for investigation on how to breed faster horses.
- If Zuckerberg could be fired, after spending a total of $235 billion on AI and having NOTHING to show for...should he be fired?
Certhas
None of these models come from universities, European or otherwise.
Mistral is clearly currently not competing for Frontier Model. Whether this is due to a lack of VC Funds or a lack of technical ability or the former arising from the latter would be interesting to know.
The top models are from startups. Among the FAANG only Google managed to get a Frontier model, and they litterally invented the architecture and have more money than they can possibly spend to throw at the problem. Facebook shows that even ungodly amounts of money don't get you there though.
So why did no EU based Startups succeed while two US start ups succeeded? I agree that that's a very important question the EU should ask. The Internet revolution was driven by US companies, and now AI will be as well, with Chinese Open Weights mixed in. The EU consistently can not turn its considerable economic output into fast moving tech firms.
marcus_cemes
To be honest, living in Switzerland and speaking with peers, we're just exhausted by the constant AI hype. For a lot of us, the fact that Europe isn't frantically trying to scrape the entire internet and every book in existence for the next massive model isn't a bad thing. The big players are doing their thing, like with the nuclear arms race. We regulate a lot, too much a lot of the time, but sometimes that trickles down to other places too. A lot was done right, imo.
ETH Zurich and EPFL universities recently put out an open model called Apertus (was on the HN front page a few months back), it's not a frontier model, but they built it properly regarding copyright and data transparency.
It might look a bit slow or old-fashioned, but focusing on doing things ethically and legally feels like a much better path than just joining the race to scrape everything.
wunderlotus
> - If Zuckerberg could be fired, after spending a total of $235 billion on AI and having NOTHING to show for...should he be fired?
Yes, if the premise was true but it’s not.
kristopolous
They did muse spark ... it's not garbage.
Also what are they building it for? I'd think it's to serve ads better or something like that. Maybe Muse Spark fits facebook's needs perfectly...
applicative
> China is going to eat the US lunch on AI
They will forever have superior weights?
ricardobayes
Well Europe is famously a laggard when it comes to new tech - in parts of Switzerland, two horses were required be mounted in front to carry cars up until 1925. UK required a person to walk in front of a car and wave a red flag.
JKCalhoun
"…Anthropic Marketeer strike force…"
Might also just be the result of "good will" (that the company has deftly fostered). Other companies might learn from Anthropic in that regard.
cmrdporcupine
I also get the downvotes for the GPT thing, and agree with you about 5.5's quality, but TBH I don't think it's Anthropic marketing as just two other things:
1. SamA and his company has a well-deserved bad reputation and Anthropic got some early good PR for basically not being SamA.
2. Claude Code got early head space, Boris and crew basically "invented" this kind of agent, and so has first mover advantage despite its known reliability and cost issues.
3. Most people I talk to haven't even tried Codex for some reason
Also it's uncool to complain about downvotes.
bel8
you left some models out like DeepSeek and Kimi, for example.
kristopolous
It was a truncated output from the script to demonstrate what it does ...
If you really want to see all of them:
Or run the script
ashenke
Because it's not in the top 20 in their benchmark, it's at #23
sosodev
Note that AA's coding index is only made up of two benchmarks: Terminal-Bench Hard and SciCode. I'm skeptical that it makes a good coding index. It ranks Gemma 4 31B above Deepseek V4 Flash. Having used both of those models for a broad variety of coding tasks I would choose Deepseek every day.
alecco
Consider using decrementing score order (best on top)
kristopolous
then I'd have to scroll up over 500 lines after running it every time to see what I care about.
But if that's your thing, here you go: https://github.com/day50-dev/aa-eval-email/commit/1853be6461...
add an argument (any argument) and it will be sorted as your specified. It just works as a toggle flipping the order ... so literally any string will do.
The original link has been updated accordingly with the new code.
datadrivenangel
Have it print paginated or just top 10?
spwa4
[dead]
bodhi_mind
Cool project! Side note: Kind of a bad practice imo to ask people to blindly execute bash from an unknown source.
slig
Thanks for sharing. I'm curious: why didn't you sort with the score descending?
kristopolous
Because it's currently 511 lines. Why would I want to scroll up to see the stuff I care about? Don't you want the relevant stuff to be right there in front of you?
duckmysick
I do and that's why I pipe the output to `head -n 20` or use `LIMIT 20` in SQL.
That aside, this is a good script you're running. Thanks.
fridder
Not OP but if you run this from the CLI it does make the ordering make a little more sense
snsnbsne
Because programmers can’t figure out how to have a CLI that prints in a normal order, with the newest stuff on top instead of on the bottom.
Setup a fresh new large monitor. Open CLI. Run command. Watch output at the bottom of your screen. Keep watching the bottom of your screen for the rest of the day.
Sure you can tile windows and it helps but come on. Just have the command/input section in the bottom and the “output” on top. Keep the command bit on the bottom.
undefined
undefined
jarjoura
Seems legit. My experiments with GLM-5.2 so far have resulted in strange hallucinations in the tiniest of places. Like a wrong variable name.
It seems like it's up for the task of complex code, but those little paper-cuts are scary to me. I wouldn't trust this model for anything remotely serious.
unrvl22
Why aren't more people talking about this? It's literally Opus 4.7 quality stupid prices. I know providers who are offering this at unlimited tokens for $50 a month. Some are even offering API rates at 3x lower than the official ZAI api rates which are already like 10x cheaper than Opus. (Crof and Umans btw)
This is a huge blow to Anthropic/OpenAI/Google and a massive win for the rest of the world. The official API prices and speeds mean nothing for open source models.
CuriouslyC
Be careful about unofficial providers, a lot of them misconfigure models or stealth quantize them. For a while the difference between Kimi on the official API and most third party providers was 20-40%.
thehamkercat
Kimi K2 had a vendor verifier: https://github.com/MoonshotAI/K2-Vendor-Verifier
(there's a table which shows comparison between vendors)
Also, it seems there's a general one as well (for all kimi models?): https://github.com/MoonshotAI/Kimi-Vendor-Verifier
cedws
OpenRouter should be penalising or banning for this.
kilroy123
This is my biggest complaint about OpenRouter and I'm a fan. Might be pretty tough at scale?
orbital-decay
They have an "exacto" category with providers they supposedly verified
alecco
Would that align with their VC-backed incentives?
unrvl22
the 2 I mentioned both have a fairly large following, who run benchmarks and absolutely will spot issues.
pranavj
[flagged]
stanac
> Some are even offering API rates at 3x lower than the official ZAI api rates
Looking at openrouter [1], some of the cheaper offerings are for quantized models. Not sure how much intelligence is lost in quantization. And they are not 3 times cheaper. Where did you find 3x lower prices for APIs? I am considering skipping open router and using them directly for that price.
edit:
I see, croft [2] 8bit for $0.50/$0.08/$2.20
scrlk
IME, unquantised -> FP8 is pretty much lossless. What matters more is having an unquantized KV cache - using an FP8 KV cache can result in a significant drop in quality.
johnnyApplePRNG
>unquantised -> FP8 is pretty much lossless
Claude Shannon is rolling in his grave.
osti
The official API is FP8, which should imply that it's lossless.
ComputerGuru
Do infra providers reveal that level of implementation detail?
benjiro29
Neuralwatt ... When you reverse calculate the actual energy usage / price on a token basis, the gap is large.
I do not have GLM 5.2 numbers because the whole default max setting is overkill. But GLM 5.1 numbers had it at 12x cheaper then API rates. And about 2.5x more tokens vs zai their own subscription service.
Yes, its FP8 but lets be honest, do we know for sure that even zai runs at FP16? I learned a long time ago with Claude and Codex how much cheating happens on model levels, even from the big boys.
spelk
Please correct me if you have contradicting data but: Neuralwatt's price per token vs price for energy comparison doesn't seem to take into account the cost savings from cache hits that other providers offer on pure token rates. The comparison seems to assume every input token is a cache miss.
On top of that, the cloud offering doesn't seem that well-run, they randomly blocked a colleague's API key for a couple days without any heads up, had a weird rate limiting bug and they have been deprecating models without redirects with very short notice, all while taking weeks to onboard new models. I assume some of these problems would be addressed if we had an SLA/enterprise contract.
It's a promising idea though. They offer a $5 trial credit (with an aggressive rate limit) though so no harm in trying it out.
Schiendelman
To answer the question in your first sentence - because it's VERY computationally (ha) expensive as a human being to keep up with all the options. It's also very hard to figure out how to run a model like this. There's no installer. If you really really care, which 99% of people do not, you have to google a guide, and then find out it's out of date...
I've tried a number of these, and the learning curve is very steep compared to "install Claude Code and pay $100/mo". There is no way saving me $50/month matters compared to figuring that out.
andai
But it just works with Claude Code? They have a guide on their website.
https://docs.z.ai/devpack/tool/claude
Here's my setup. I add this to my .bashrc
export ZAI_API_KEY="your_key_here"
alias claudez='ANTHROPIC_AUTH_TOKEN="$ZAI_API_KEY" ANTHROPIC_BASE_URL="https://api.z.ai/api/anthropic" ANTHROPIC_DEFAULT_OPUS_MODEL="glm-5.2[1m]" ANTHROPIC_DEFAULT_SONNET_MODEL="glm-4.7" ANTHROPIC_DEFAULT_HAIKU_MODEL="glm-4.7" claude'
Then I just run claudez
pro tip the same thing works with deepseek https://api-docs.deepseek.com/guides/anthropic_api
Even more pro tip: Claude Code can set this up for you haha
Schiendelman
Sure, I'm not saying I, a software engineer, cannot do this. I'm saying it's significant onboarding friction.
Unless this were a massive differentiator, people aren't going to be "talking about it" the way GP suggests!
chen66996
[flagged]
chillfox
install opencode, then either pay $10 for their plan, or add an openrouter api key.
gerryf2
I agree with this.
I'd pay for an out of the box solution. i.e. an Installer with updates
re-thc
> There's no installer.
There's ZCode (https://zcode.z.ai). Which is like the Codex App.
That's as "easy" as it is for non-devs that you're complaining about.
Schiendelman
I'm not complaining about anything. I'm answering a question.
qingcharles
How does it compare to OpenCode? I already have too many LLM CLIs installed :(
CamperBob2
It's also very hard to figure out how to run a model like this. There's no installer.
Yes, there is. It's called Claude Code. Point it at the HuggingFace URL and say "Download these weights and build whatever is needed to run them, then test the model."
PoignardAzur
I really miss the time when people thought that the idea of someone telling an un-sandboxed AI "do whatever is needed to X" was unrealistically stupid.
cedws
In my org everyone is extremely Claude-pilled to the point you’d think it’s the only LLM that exists, purely because it caters to non-engineers within enterprises.
sinatra
I've tried Chinese open models few times before. They were fine, but they didn't come close to the benchmarks they were claiming.
Now, maybe GLM 5.2 is close to Opus 4.7, but I don't wanna keep checking them and keep finding that they're still benchmaxing and aren't at GPT (my choice) or Opus level. The boy who cried wolf, I guess.
enraged_camel
Yes, my experience has been the same as yours. I find that the performance of open models is quite acceptable, even good, at one-off questions or small tasks. But they are quite unreliable at long horizon goals.
unrvl22
I cancelled my claude sub after realizing I can burn 300m tokens a day of this quality, for $50 a month.
spelk
Which coding plan are you using? How are you finding it?
embedding-shape
> Why aren't more people talking about this?
Wasn't this released like 2 days ago? Everyone is still evaluating and playing around with it, things like the submission is just starting to come out. Give it some days at least before jumping to conclusions, ideally weeks.
shostack
Which of those providers are:
1. Keeping your data private on in the US
2. Not training on it
3. Not quantizing the model
4. Offer reasonable latency adds rate limits
mrngld
Artificial Analysis coding benchmark shows GLM5.1 on high pretty close to GPT5.5 xhigh in cost to run, with GPT5.5 on medium significantly less expensive. Compared to GPT5.5 medium GLM5.1xhigh is twice the cost and half the intelligence. They don't have GLM5.2 on there yet, but that'd a big gap to bridge.
https://artificialanalysis.ai/agents/coding-agents?coding-ag...
I thought I was "holding it wrong" until DeepSWE came along -- personally it seems to match my own experiences pretty well. Really makes me wonder how legitimate some of the internet noise is about open models. There's surely some use cases for them, not everything needs the absolute frontier (GPT5.5 on low is awesome), but if you want to be near the frontier everyone needs to be honest about the fact that we're only talking about Opus, Fable, GPT5.5.
undecidabot
It got 46.2 on DeepSWE in Z.ai's own run[1]. That would put it between Opus 4.7 xhigh and Opus 4.8 medium.
mrngld
If that ends up being true, GPT5.5 at 70 (and presumably Fable a bit ahead of that) is still in a different league, which was partly my point. To listen to online chatter, GLM5.2 is a tectonic shift in the landscape. In reality, it's just interesting. Probably safe to bet once the DeepSWE benches all get fully updated it won't even be on the pareto frontier.
I'm not accusing anyone specifically, but I've noticed Chinese bots swamping certain YouTube channels that, for example, cover US defense industry news. They'll downplay any and all technical advances, play up China's dominance, US cowardice, etc. All very transparent. I suspect some of the online conversation about open Chinese models is driven by that. How often do you see people talking about Mistral or Trinity? Never. Because they don't play that game.
osti
There are definitely some Chinese bots + actual people (imagine that!) who like to talk up Chinese models, I'm one of them but I like to find out how good these models really are before saying anything.
GLM definitely isn't opus level yet but it's for sure good. I think it lacks some knowledge (when coding) that the frontier models possess, which is expected given that the model is probably quite small when compared to the frontier.
But people don't say much about Mistral, probably because they are nowhere as good.. And they don't have large population behind them to actually use them.
lukewarm707
with open models you can get a subscription with privacy, at the same cost as codex.
openai, google and anthropic subscriptions are not available with privacy.
looking at the link there it's interesting that going from cursor cli to codex cli take gpt 5.5 from 7th to 3rd. but they didn't do open model in codex.
so, hard to say it's for sure a model benchmark. maybe open models are just shit at swe agent harness...it's not the most parsimonious explanation though.
vadansky
> with open models you can get a subscription with privacy
Unless you're running it locally, aren't you just trusting some other entity?
lukewarm707
correct, you are trusting another entity.
however the legal terms are different, openai reads your data. they store it for 30 days, but of course once it hits the disk you can keep as long as you like in a civil case like nyt v openai.
the same for google and anthropic. so, it's not always nice if someone is paid to read your data for safety. people upload sensitive matters, personal videos and so on.
i wouldn't prioritise it myself but you can also know that the data will all come out in discovery if you are in a legal issue. maybe that's not important, but people thought it did matter to give some protections to patient records, legal advice and therapy. you upload that to gpt and it goes into discovery.
conception
While true - there are laws about saying you are doing the things you are doing, especially in certain regulated environments. If you are in the same country as the entity you are trusting, you have recourse if they are not living up to your trust usually in some form or another.
yieldcrv
right, and on prem being an option is a god send, however you manage to do it
it's not a recommendation, its an option. if you don't have capital then it doesn't apply to you and move on. it wasn't an option for even people with capital.
come back in a few years when its more accessible
additionally I like that there are providers with faster special purpose processors for faster tokens/sec, all at different pricing strategies
so just pick something that matches your personal risk tolerance
ttul
DeepSWE “feels” like the right benchmark in comparison to Artificial Analysis indices and other coding benchmarks. And by their metrics, GPT-5.5 is still king in token efficiency, speed, and overall intelligence per dollar.
Fable 5 is cool and all, but we have not yet seen GPT-5.6.
slagfart
GLM5.2 isn't even on this benchmark
cmrdporcupine
I gave GLM 5.2 a spin on openrouter yesterday and it was mostly fine but it racked up $5 in token use in 30 minutes of (relatively slow) work.
It's easily 4x the cost of DeepSeek V4 but I didn't actually feel the results were that much better. I had GPT 5.5 in Codex review it after it was done and there was plenty of slop to go around.
Having better luck with MiniMax M3, from a cost/benefit ratio.
pjerem
I really like DeepSeek V4 Pro. It's pretty smart and I get so much usage out of it on a $20 Ollama cloud plan.
With a good harness, that's my favorite model for any personal project. I use Opus 4.8 at work because i don't have to pay for it and of course I love it, but DeepSeek is like 80% there for one tenth of the price.
zooming
Try MiMo-2.5, I'm having astonishing success with it in opencode for cents per day. Not even the pro model.
spelk
I've found MiMo-2.5 is fun for front-end design since you can use its multimodal capabilities to drop in whatever it produced and correct it for you.
re-thc
> I had GPT 5.5 in Codex review it after it was done and there was plenty of slop to go around.
GPT can find fault in everything and anything including its own work.
gbingles
AI review generally will find fault in anything. Any non-trivial code has multiple solutions with different tradeoffs. Any code can be over-engineered for theoretical edge cases and future use cases you don't need. No matter which solution you pick you can always at a minimum say that some alternative just looks and reads better.
Code is somewhat artistic. If you don't have well defined standards and priorities, the AI review cycle can spiral infinitely figuratively debating what makes art good, and your code will be no better for it.
cmrdporcupine
That's what I love about it, and I wish I could find an open model that was as diligent.
Somehow it's just way more careful than the others, and also much better at empirical verification of its hypothesis, writing tests, etc. I am assuming a lot of RL done on that kind of flow, and on seeking out negative cases, failure points, race conditions.
simonw
I was surprised that GLM 5.1/5.2 are not vision models - they are text input only.
That's actually pretty uncommon these days. All of the OpenAI/Anthropic/Gemini models accept images, and so do the other leading open weight families - Gemma 4, Qwen 3.6, Kimi 2.x.
In GLM's case image input would be useful because it's a model that scores very highly for tasks like web design, but without image input it can't take a screenshot and output HTML+CSS.
Don't get me wrong, GLM is a phenomenal model, but the image thing is a bit of a gap.
0xbadcafebee
Configure a subagent in your coding harness to spin up a new sub-session with any vision model for those tasks and feed the result back to the main model. No need for "one model that does everything"
ricardobeat
That doesn’t work well in a lot of scenarios. The text LLM doesn’t know what to look for in an image before it sees a description, you might need multiple rounds of back and forth.
jarjoura
Vision decoding outside of the latent space of the model is lossy, but claude opus's vision isn't that great outside of UI screenshots. I mean it works in a pinch. At least in my testing, if you're looking at non UI images, there are better image to text models that can turn into a very precise documents that any LLM can easily parse.
WASDx
Are you suggesting it should summarize the image in text or generate it in HTML or something else?
abby3010
Agreed, that's actually one step that will make people adopt it widely for customer facing AI Agent!
_pdp_
I don't see this being such a big gap. There are some use-cases for sure but apart from UX/UI work it is not really needed. Besides, none of the frontier models can replicate actual images - the can approximate at least in my own experience.
simonw
One of my tests for a new model is dumping in a screenshot of a web page and seeing if it can recreate it from scratch in HTML and CSS.
Even the local models I run on my Mac are getting surprisingly good at that now.
kamranjon
a pretty fun and quick tests i do with vision models is to screenshot the hackernews homepage and ask the model to return a json representation of the screenshot - qwen 3.5 0.8b did surprisingly well at this.
tiahura
Using llms to generate docx. Being able to rasterize and review is an important part of the process.
undefined
x3cca
I've been using Google ai studio as a free vision bridge. Gemma 31B is dummy capable at vision and at 1500 rpd its basically unlimited.
ashenke
I had the same reaction with Deepseek V4 ! It would be more useful as a vision model
CuriouslyC
I've been playing with this model a fair amount over the last 24 hours, and I can confirm it's quite capable, while being a little bit verbose (I've seen it reconsider things 3-4 times in thinking traces before deciding on a path forward), and not being quite as good as GPT5.5 at working through complex abstract requirements.
Honestly it's good enough that I feel comfortable recommending a Z.AI sub + a $20/mo OpenAI sub for all but the most AI pilled multi-orchestrators, or the die hard Claude fans. GLM writing + GPT reviewing/debugging feels pretty unlimited and minimally worse than just doing everything in GPT with the $200/mo plan.
Havoc
> while being a little bit verbose
Discovered today that they set reasoning effort to max by default. So that’s probably why
sdesol
> GLM writing
This is honestly what I care bout the most now, which is how well they can write. I think we have reached a point now, if you know how to program, you can provide enough information for the models to pretty much do what you need.
What they still struggle immensely with is the writing which has too many nuances but they are truly getting better.
andai
This is my workflow. And then once a day I copy paste the code into the free Claude Sonnet so it comes out actually readable.
igravious
After having got a taste of Fable 5 for me Opus 4.8 doesn't cut it any more -- and I don't know how to put this, I don't know if it's just me, but it's rhetorical flourishes are starting to really grate on me, never mind that it is at times deliberately weasel-wordy and economical with the truth until pressed. Opus 4.8 is definitely a stronger coding agent than DeepSeek 4.0 or Kimi 2.7 succeeding where they flounder and fail but its way of expressing itself conversationally is making me reconsider my subscription …
elwebmaster
You are not alone. How about GPT 5.5? Does it come close to Fable 5?
theplumber
GPT 5.5 xhigh is smarter than Fable but Fable like Opus 4.8 as well is faster and seems more “agentic”. It’s easy to test this. Build a fairly complex software with Claude(opus or Fable).
Review the commits with both Claude and GPT 5.5 Xhigh. You can see that Fable is still sloppy(er) compared to GPT. You can test it the other way around as well(drive the dev with GPT and review with GPT and Claude). You get the same result Claude has an edge though and that’s on building more beautiful user interfaces.
fragmede
5.5 is pretty good. It's no Fable though. It is definitely better than opus tho.
CubsFan1060
Knowing very little about how to run these, how close are we to medium or larger businesses starting to buy hardware to run models like this to keep the models local?
It’s expensive, and not as capable as the frontier models, but would have some pretty big benefits around privacy and agency.
wongarsu
I know of multiple businesses in Europe that have been doing that for a while with 70B models, and are upgrading hardware to run the new crop of 700B-1T models (really started around Kimi K2, but buying and hosting that kind of hardware takes time)
Not everyone is willing (or even legally able) to send their trade secrets to OpenAI or Anthropic
user43928
While certainly there are such cases with trade secrets, it's worth noting that even large banks typically have a provider like Azure or AWS onboarded.
There they can deploy these models while using the existing legal frameworks.
CubsFan1060
What kind of hardware/price does it take to run those?
bitmasher9
Nvidia will sell you an entire server rack ready for inference. Or maybe you can roll out your own Blackwell based system.
We’re approaching a world where running a primer frontier model is possible on a workstation, probably will have something under $30k that looks like a desktop for Nvidia’s next generation. It sounds expensive, until you look at your Anthropic bill.
It’s similar unit economics as could computing for the open models. You can save a ton on the expenses by buying the hardware, but it requires a lot of in-house expertise, and you get the most value if you keep the system operating around the clock. The big kink is open models are usually 2 quarters behind frontier, and your competitors are probably trying to get access to mythos.
wongarsu
For an 8-bit quant (what people call "near lossless") you are looking at something like 4xMI350X, which comes out to about $150k after adding the rest of the server. More if you go with Nvidia instead of AMD. More if you want more than maybe 8x concurrency
But prices are changing rapidly, and not for the better
MikhailTal
This is not a new situation. This was happening also when good vision models like alexa net were coming through, especially for OCR. Companies had choice between cloud or self hosting with GPUs. But turns out, problem is usage patterns.
Your usage will peak during certain timezone work hours(even if you are a huge multinational company most of your engineers/users tend to be from only a few locations), so then you have a bunch of gpus doing nothing the rest of the day. especially with latency sensitive stuff, this is a decades old tradeoff problem, its not unique to llms
Havoc
It’s a ~750B model so still a hell of a lot of vram
Would need to be a pretty determined medium biz
moffkalast
So far there seems to be one major use-case for complete privacy, and that is legal work. You don't need top of the line models to search vast amounts of text in discovery and it needs to be completely confidential. There's quite a few lawyers over on r/localllama showing off their multi-GPU builds. Coincidentally they also have the vast funding required for it.
petesergeant
Unless you have genuine national security concerns, you’d be better off just negotiating a commercial agreement with privacy protections with a couple of existing vendors.
CubsFan1060
I think that's true until it isn't, which may end up being the problem. Fable/Mythos doesn't fall under the ZDR agreements with Anthropic. And I'm curious if others will follow suit.
tancop
if you can afford the investment you get stable low costs for years with better security (at least if your cyber team is good). its even better in regulated industries where some vendors might add a premium for hipaa/soc/pci dss compliance to the point its a lot cheaper to self host. for a smaller business its not worth it and you should just use a hosted open model.
petesergeant
> to the point its a lot cheaper to self host
I'm pretty skeptical, especially given typical utilization patterns. Do you have numbers, or this is just vibes?
re-thc
> how close are we to medium or larger businesses starting to buy hardware to run models like this to keep the models local?
Years.
Even Microsoft said they don't have enough for Github and need to call Amazon.
Getting a few even at decent prices is hard. Unless the shortages goes down...
gauravvij137
They've come along pretty far now.
I remember when there was hype around GLM 5 reaching great heights on benchmarks but eventually failing on practical coding and reasoning tasks. I guess this time the hype is real.
tensegrist
> On the Intelligence vs. Cost per Task Pareto Frontier: GLM-5.2 is on the Pareto frontier of the Intelligence vs Cost per Task chart, with the lowest cost per task among models at its intelligence level. GLM-5.2 costs ~$0.46 per task, compared to GLM-5.1 ($0.25), Kimi K2.6 ($0.31), MiniMax-M3 ($0.18) and DeepSeek V4 Pro (max, $0.05)
am i missing something?
OtherShrezzing
I think they’ve just picked poor peer examples. Instead of choosing other models near 5.2 on the intelligence scale, they’ve picked some open models from further down the scale.
acchow
pareto frontier does not mean cheapest.
xiaoyu2006
Some models are heavily subsidized. Total params & active params are better measurement of inference cost.
simianwords
No models are subsidised -- there are lots of third party hosting services that will still run at breakeven/profit. (except Deepseek after discount)
stymaar
> No models are subsidised
We have no proof in either direction, it's not like we had access to their financial numbers in details.
And the pricing itself muddies the water, as input tokens that are already in the KV cache are practically free for the provider, whereas other tokens are expensive. So they could still make money overall thanks to people having multi-turn conversation (and as such, paying multiple times for the same token), but lose money on actual compute done.
> there are lots of third party hosting services that will still run at breakeven/profit.
How can you be sure that they are making profit directly from token price, and are not billing at marginal cost (i.e. electricity price, without counting the cost of the GPUs) and aiming to make a profit later on from the valuable training data that they are collecting in the process?
SwellJoe
I added it to my benchmark based on Mythos-reported bugs, and it's better than GLM 5.1, but still behind several other models, maybe most directly comparable to Qwen 3.7 Max. But, several other open models, including small self-hostable ones (Gemma 4 and Qwen 3.6), found the same number of bugs, 3 of 9. Though it also gets partial credit for reporting one bug in the right spot, but kinda misunderstanding the bug. I also added Kimi K2.7-code in the same run, and it did poorly, consistent with 2.6 performance. Anyway, there are better, cheaper, models on this particular benchmark.
https://swelljoe.com/post/will-it-mythos/
(This small benchmark doesn't prove anything. It's a limited data set and each model only gets one shot at each file in the corpus. But, I find it useful for quickly sussing out if a model can reason about pretty complicated problems in code.)
be7a
[dead]
gertlabs
GLM 5.2 is the first model we've tested that is unambiguously on par with, or better than Opus 4.6 (although as usual, we have GLM 5.2 and most other Chinese models a bit below most other benchmarks with more vulnerable test methodologies).
Data at https://gertlabs.com/rankings
nsoonhui
I really have to take your score with a grain of salt because Opus 4.5 does better than Opus 4.6
gertlabs
They're within confidence intervals of each other, but remember how much discussion there was that Opus 4.6 had been nerfed in March. We averaged samples over the entire lifetime of Opus 4.6, which likely served many different underlying checkpoints. Even the best version of Opus 4.6 was hardly an upgrade.
We find a lot of interesting anomalies with our benchmark that hold up under large sample sizes.
minraws
[dead]
wongarsu
It's also third best overall on "AA-Omniscience Non-Hallucination Rate", far higher than DeepSeek, GPT 5.5 or Fable.
That's the one benchmark that allows LLMs to answer "I don't know" and punishes them for trying to bullshit their way through the questions
Get the top HN stories in your inbox every day.
It seems to really be a nice step-up and is getting quite close to the frontier. I wish they'd start focusing on the reasoning efficiency now, though. I have a simple (relatively) test task to evaluate LLMs: writing a simple math evaluator library in Nim (it's about 400-600 lines total max), and GLM 5.2 (xhigh which maps to max effort) spent over 15 minutes (!) reasoning, spending about 45k tokens, before it finally wrote the first file.
I know it's hard to improve on that, but now that their models are good enough at raw intelligence, I think this should become a higher priority task.
Currently on https://artificialanalysis.ai/#output-tokens GPT 5.5 xhigh spends 16k tokens total on average, GPT 5.5 high is 10k, Fable 5 33k, Opus 4.8 41k, GLM 5.2 is 42k. GPT 5.5 is extremely reasoning efficient.
Of course if you convert those values to actual request cost, GLM 5.2 will probably beat GPT 5.5/Opus 4.8, but speed matters for a lot of people, I think.