Brian Lovin
/
Hacker News
Daily Digest email

Get the top HN stories in your inbox every day.

cultofmetatron

I seriously dont' know all this big hullabaloo about one shot prompting.

by definition, a single prompt wont' constitute the complexity of a software project. ergo, what you'll get is a series of assumptions made by the model based on preexisting code in its training corpus.

I'd rather see a coding agent that can follow steps in a plan file to a T while following guardrails and adhering to the proper coding conventions in the human reviewed spec.

Id rather see performance in agent loops against human defined objectives where it can be verified to stick to defined guardrails and continue without drift till its objectives are complete.

I'd also like to see it identify bugs and potential performance increases by identifying existing code and suggesting refactors based on context it can pickup about the particular use case you are trying to create.

These are way more valuable metrics than "hey build X"

post-it

The streetlight effect:

> A policeman sees a drunk man searching for something under a streetlight and asks what the drunk has lost. He says he lost his keys and they both look under the streetlight together. After a few minutes the policeman asks if he is sure he lost them here, and the drunk replies, no, and that he lost them in the park. The policeman asks why he is searching here, and the drunk replies, "this is where the light is"

All of your suggestions are better but they're hard, so someone casually evaluating an AI isn't going to do them.

sanderjd

Sure, for casual evaluation, I agree. But are there serious analyses that are evaluating this kind of thing? I mean, these are the kinds of things I evaluate in my own work when a new model comes out, or when I'm evaluating a harness. But this is all very ad hoc and intuitional. I'd love to start bringing rigor to it, but I haven't found much prior art on this. In another thread someone said that's because it's probably impossible to do this rigorously because too much of it is subjective. And that does match my intuition. But I continue to suspect that intuition is wrong.

jerf

It's hard to bring much rigor to it. I'm not saying impossible, but it's not like it's completely obvious how to do it and people are just too lazy. Intrinsically, if I'm going to test a back-and-forth with a model I have a human in the loop making frequent decisions. Did the model fail or succeed at whatever rate it did that because of the model or the human? Did the testing protocols capture the actual problem, e.g., maybe if the model was given some particular bit of information that a normal human would have given it it would have done much better or worse, but the testing protocol in the interests of "rigor" excluded the human in the loop from doing it. Is the human going to be willing to sit down and do the same task 25 times, refreshing the model from scratch each time for a "valid" test? Can you get the same human to analyze every model in the test? Is their 10th pass of the problem an invalid test because you can't as easily erase the human's knowledge of the previous 9 tests? What do you do with a model that succeeds wildly 75% of the time and spins off into a loop the other 25%? Is that loop real or, again, did your "rigorous" testing protocol prevent the human from saving the model from the loop like any developer would?

And so on and so forth. Again, I'm not saying this is impossible but I am saying that if you tried to do it, and you got the money, and you built the test, and got the human subjects clearance, and you ignored that during the process of all that at least one more frontier model would come out, you can count on HN anklebiting your "rigorous" study even so, and probably being correct about a lot of the issues it could have because it would take several iterations of this to build a reasonable protocol... at which point it would quite possibly also be obsoleted by progress again.

abhgh

You usually see this kind of analyses in conference papers, esp. if they have a datasets track. The NeurIPS Datasets & Benchmarks (D&B) track is a good example. But you will have to monitor the proceedings yourself closely - there is little chance of being accidentally exposed to them, because most blogs, announcements and popular media only mention a handful of the popular ones, e.g., Tau^2. For ex., across the years 2022, 2023 and 2024, 900+ papers were accepted in the D&B track [1] - of course, not all of them are LLM-related. I find them interesting because they often focus on specific system behaviors, and like you said, study them scientifically, so you can draw authoritative conclusions (or at least know specifically what part of a model's behavior you now know about, and what parts you don't).

[1] https://blog.neurips.cc/2025/09/30/reflecting-on-the-2025-re...

nileshtrivedi

You might like SWE-WebDevBench which tries to do this comprehensive evals for webapp development. https://webdevbench.com/

undefined

[deleted]

newaccountman2

[flagged]

blanched

This kind of hamfisted snark tends to make people take the actual and justified criticism of police less seriously.

post-it

It could be a taxi driver if you like. Or an anarchist passing by on xir way to a protest.

layer8

…in the US.

echelon

The minute an open model breaks through and beats Claude Opus/Fable, it's over.

There are far more opportunities that can be served when the world's intellectuals have the raw weights and can fine tune, splice, distill, and reapply.

Imagine having raw unfettered access to Fable. It can be refit to structural biology. It can be fine tuned on the repo for smaller context requirements. It can be run cheaper and air gapped.

The world wants this.

barrenko

As crazy as this sounds, and as much I don't want to believe it myself, I think we're still underestimating LLMs, and we're gonna get to that point pretty soon.

digitaltrees

I don’t think we need them. I think the models we have are good enough. It’s the orchestration layer that makes the biggest difference at this point. The open source models we have are capable of calling tools and the work is getting them to be capable enough to know which tools to call and what to do in response.

I think we are leaving the main frame era of AI and entering the PC era already. If there wasn’t a RAM shortage and we all had 2TB of ram and GPUs we would all have large local models or personal APIs serving our teams.

That’s why all the labs are moving to the App layer and moving away from being the API for intelligence like they were originally.

jupr

The world does want this. Opus capabilities, in a box, securely tunneled to my family and I utilizing the resources I already have available to me which is, energy + network.

rdsubhas

IMHO, It's not the oneshotting.

It's the "starting from empty slate" greenfield that's the real problem.

We used to make fun of Engineers who follow a README on a framework, test it on an empty project, and say "this framework is the best for our 10 year running production app". Greenfield mentality is always the solution to all problems and problem to all solutions.

One should still measure oneshotting, it's an important self-measurement metric - but against an established, large codebase.

keheliya

There are upcoming benchmarks aimed at measuring the ability to work with brownfield tasks. (Of course, benchmarks can be gamed, but they are still better than unrealistic toy tasks that earlier generations of benchmarks used. Frontier labs are yet to use them in their tech reports or marketing material, though.:-)

* SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios https://arxiv.org/abs/2512.18470 * SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration https://arxiv.org/abs/2603.03823

dluxem

I think this (for me at least) is the biggest pain point. Use styles and practices from this existing code base, even if they aren't documented explicitly in AGENTS.md or something. If we're importing a library somewhere that does what the agent is doing, reuse that same library - don't chose another one. If we have a pattern for unit tests, follow the same style. Etc. etc.

That issue, and the issue of "aesthetics", are the biggest complaints I have today. I don't know exactly how to define aesthetics, but it's when AI is making decisions that no experienced developer or designer would. They may be functionally correct but "ugly" to another developer or and end user.

An example is an case I ran in to yesterday where parsing a config, and failing and logging on a configuration error. It logged a specific item where the config was invalid but not what group or any notion of where in the config this error was. Of course, specific item names could be duplicated in different parts of the config. It's small, but correcting these minor things take time and they are the types of decisions no one would have made who had any experience writing code and debugging a config problem. This was Opus 4.8/max too.

bluGill

At least they did some analysis. I've couple AI slop "X is the best tool for the job" that didn't even try it. (Worse, we are already using QT which has a tool for the job, and the QT tool works with the rest of the QT ecosystem unlike whatever AI told them)

gertlabs

One-shot performance often translates to the most difficult problems a model will be able to understand. We run an evaluation that tests both agentic and one-shot performance, and we find that Chinese models are almost universally very good at using tools and a harness to iterate towards a better solution, whereas their initial response ranks relatively low.

Compare that to Gemini models, which have impressive fluid intelligence on the first response, but fail to call tools or explore correctly which limits their usefulness for agentic coding.

Neither will be great for coding in a computational chemistry repo for different reasons, but the model with strong one-shot performance will be less likely to make subtle errors indicative of poor understanding, so we weight both capabilities into their final score.

The latest Anthropic and OpenAI models excel in both domains.

Data at https://gertlabs.com/rankings

mycall

> The latest Anthropic and OpenAI models excel in both domains.

Is that because OpenAI models are not a single model but a cluster of models which specialize different domains?

gertlabs

By domain, I really meant "tool calling" and "one-shot fluid intelligence"

Anthropic models were the original leaders in tool calling and agentic work, even when other models felt significantly smarter in (Claude Sonnet 3.5 vs Gemini 2.5 Pro, for example). OpenAI models were the opposite, starting smart (more correct solutions on the first try) and got better at exploring and iterating with tools in 2026. The latest releases (Opus 4.5+ and GPT 5.4+) excel at both.

hnfong

It's a proxy for what you actually want to measure.

Note that after the model generated a bunch of (intermediary) code, they still have to have it tested and get bugs fixed (via the agent/harness). In this "one shot" you still have agent loops against human defined objectives.

And these toy examples give some insight as to how the model performs. If the test were "here's some code written by $corp, please take these tickets and work on them" it may be a "real" example but nobody would be able to make sense of actually how "hard" it is, or how "well" the model did the job, besides the workers already familiar with the context.

At least everyone knows what a 3D game is.

bluGill

As someone who works at $corp - there is a massive different in tickets. I've seen "The is not spelled 'teh'", and I've seen some other service is writing to memory causing a crash in my service (the later took months to track down since our code was correct and nothing gives a hint of where to look). Both problems are important to fix, but the first is so simple I don't care how good AI is (the hard part is getting it through the process)

ulrikrasmussen

I guess the experiment is interesting to determine if a model can produce something subjectively valued as "good" based on fairly vague and open-ended specifications. The benchmark is not to determine if the output fits the input, but whether the output is internally consistent: it's a game, but does it behave as one would expect that any game behaves? Does it end when you each the goal, do you die when hitting the spikes, are there weird edge cases in behavior when you move around?

I think however that they should have used the same harness and also repeated the experiment a few times to judge the variance in results.

pu_pe

It's true that no one is trying to one shot anything serious right now, but it's still an important metric. Claude Code and Opus really took off when they improved the harnessing enough that it would self-correct many of its mistakes without needing user input. In fact I think long-term autonomy (in the range of several hours) and self-correcting is going to be where we see most improvements in coming years.

bogtog

> In fact I think long-term autonomy (in the range of several hours) and self-correcting is going to be where we see most improvements in coming years.

Right, model intelligence defines the scope of things they can one shot

I also suspect that users naturally calibrate to a model's useful scope, gradually getting positive/negative feedback and gradually making their requests bigger/smaller than before

dakolli

it wont happen, its all a money grab.

OtomotO

I think that LLMs will stay, but I also think we've plateaued and that big companies will fail and fall and we will have another years long "halt" of any real advancements coming to the public.

Similar to how ML was all the hype about 12 years ago and then it submerged again for a couple of years.

somenameforme

Unless I'm missing something, the prompt he gave must have been fairly detailed because both games are basically identical.

But for a more practical issue, the ultimate goal of LLMs is to replace software engineers, or at least enable everybody to become a software engineer, to use a more up-beat phrasing that's no less accurate. And so an LLM's ability to reliably construct something from a poorly defined, contradictory, or otherwise flawed prompt, while accurately inferring intent is probably the first finish line.

metadat

More likely is the models were trained on similar data.

throwawayffffas

It's the same prompt but they also gave them the same asset pack.

johnfn

I feel like on HN there is an endless cycle:

- Vibes are too subjective, I want an actual A/B test!

- An A/B test is too limited, I want a benchmark! (You are here.)

- Those benchmarks never seem to be reliable, I just go on vibes.

meander_water

> So we ran it head-to-head against Claude Opus 4.8: same one-shot prompt, build a 3D platformer in raw WebGL from scratch

Running a single one-shot prompt is not a benchmark, not is it representative of any sort of real-world usage.

Most agent usage is collaborative so you need to test things like reliability (when I delegate a task, does it complete it without making up test results for e.g.) and steerability (does it obey my instructions or does it just do what it thinks is best).

jameswhitford

Hi, I am the author, I completely agree! I set out to run a vibe test on this one, not a benchmark, the real benchmarks are listed. My test shows what the models can do when both tasked with a long-running, technically difficult, one-shot task.

I think your test you describe (collaborative, task delegation, task completion, TTD, steerability) is a great format for a future test that I will definitely try out.

wongarsu

Tbf, most of the "real benchmarks" have issues that are just as bad. Assessing LLM performance is just hard

oceansky

And personal too. Different engineers are using them for different use cases.

meander_water

Thanks, I didn't mean to be brusque, but I have seen a lot of these vibe tests lately that come to grand conclusions like "X model is better than Y" from the result of a single prompt.

Appreciate you sharing the results of your tests though!

jameswhitford

I appreciate the feedback!

ramraj07

The important point is that your benchmark is pretty much irrelevant for the actual usage. Thus whatever conclusion you draw is not just irrelevant but misleading.

esperent

On the other hand, I did just leave my pi agent running GPT 5.5 overnight on a clearly defined, long running task. It's been running about 10 hours now and it's mostly done. So this kind of use case is also valid.

Thinking about it, I would say that the majority of agentic work I do, by a long shot, is subagents which are launched from the main session, using a prompt of its choosing. Those could be considered short versions of these fully autonomous tasks.

jameswhitford

Yes, part of the reason I chose the one-shot test was really to test long-running tasks. A lot of people seem to be experimenting with this format, for example in the now trending loop-writing workflows. And really I am interested in diving into the murky waters of these novel workflows.

thunspa

Care to share more about your pi setup? I've recently started using it (after long-time Claude Code work) and was wondering how you'd achieve these long-running tasks. Do you allow it to spawn sub-agents? Thank you!

esperent

My pi usage over the past ~5 months went roughly like this:

* Install pi and a bunch of extensions from their package repo

* Realize that all the packages (with a few exceptions) are massively overcomplicated and vibe coded

* Ask pi to rebuild a very simple version of the packages I used. So e.g. subagents - all the default subagent extensions are massively complicated with named agents, recursion, communication. I made one that stripped all that out.

* Then whenever I hit an annoyance, spin up a parallel session and fix it.

It's less work than it appears because I have ~5 extensions: hooks, subagents, background processes, a custom footer, a loop command... Maybe that's it. Within a couple of days you can have a setup pretty close to Claude Code but with a fraction of the base context use. After gradual improvements over a few weeks/months you'll have a system far better, tuned to your exact preference.

Of course, just like Linux or any other highly tunable system equally important is having the restraint to not spend all your time tuning it. I've definitely had a couple of days where I was bored with my real work and did that, but whatever, it beats browsing reddit.

As for getting long running tasks, I set a looping message every ~20m and tell the agent to strictly track progress in a session doc, then reread and continue after each compaction.

segmondy

One shot prompt means you give the model and input, you get an output done. This was not a one shot prompt, but an agentic task as shown by the tool calls.

ritzaco

sure that's why we look at a mix of formal benchmarks, one longer analysis of a side-by-side, and various other people who we trust to form an opinion, all covered in the article - not intended to be a formal benchmark, there are enough of those.

patates

Then maybe you should add that caveat emptor to the article?

You make a very strong claim at the end that the hype is mostly real, and making it clear to what extent your claim holds should help the reader.

undefined

[deleted]

unliftedq

Totally agree, a single one-shot prompt can't prove anything.

habosa

At work we use Anthropic models and have basically no limits. So I am very familiar with what Opus can do. I also see the bills, I know what it costs.

At home I make a point of trying other models / tools on my side projects. So I've been using OpenCode and trying tons of models via OpenRouter. I tried Kimi, Deepseek, MiMo, etc.

GLM 5.2 is a _major_ step up from every other non-GPT/Claude/Gemini model I've tried. It's not as good as latest Claude Opus, but it feels every bit as good as Opus from ~4 months ago at a fraction of the price.

To me this model is the "it just works" moment for open weights models. We had this for closed weights models in late 2025 when Opus 4.5 landed. This is the same feeling I'm having with GLM 5.2. It's 90% as good as what I get from Anthropic for 1/5th of the cost and without any concern of lock-in.

cromka

And you can use it in complete privacy if you so need.

xlii

I've been checking out GLM 5.2 on some projects and few thoughts on it:

- it takes it sweet time to get code rolling, not the fastest model by any means

- it strays a lot during discovery/planning but then corrects

- it's not steering friendly, as it hallucinates things that it doesn't follow later on

- its output is quite good

A sample use case: I was optimizing rendering on Swift+Zig codebase. It chocked on 5k data entries.

GLM 5.2 spent 20 minutes building the benchmarks and getting data out, which made me frustrated so I blocked non-editing tool access and went AFK, after approx. 30 minutes I found that it used already-made benchmarks and some "conclusions" to optimize 3 choke points. Output pointed that it couldn't validate suspicions and asked for more data.

Implementation worked well, it was idiomatic and non-intrusive. I would even say that it was more idiomatic than GPT 5.5 effects on same repo.

I would opt in in using it more BUT GPT usually completes same requests 5x faster.

GLM 5.2 was spark for preparing and running inside isolated containers with JJ workspaces (so that multiple can be ran in parallel).

jeremyjh

Its also nice that you can see its entire reasoning trace. I can see it going off the rails - or see something I forgot to tell it - and stop and correct it. Or I'll learn WHY it made the choice it did and not have to question it after.

jauntywundrkind

Strong agree! I deeply appreciate this aspect of GLM. Watching it think & being able to nudge early is incredibly useful. Being able to point at bad assumptions is incredibly useful. Watching what it's seeing is super informative.

It's always a shock to me how opaque most other models are!

It also is pretty resilience to letting you inject in while it's working without going off course or while getting back on track after, which I appreciate

Sanzig

> It's always a shock to me how opaque most other models are!

This is (unfortunately) by design. The proprietary models hide their reasoning traces so they can't be used for model distillation. Sometimes even when they do show reasoning, it isn't the model's real trace - IIRC, someone was able to demonstrate that Opus' reasoning is usually a summary made with Haiku behind the scenes.

trollbridge

I used it the other day for something of low importance that other models simply weren't figuring out and I didn't want to burn up Opus 4.8 on. (It had to do with overriding left-click on a macOS menu bar and then making Ctrl+click or right click bring up the menu like left-click normally does, and doing all this conditionally.)

Switched the model to GLM-5.2 halfway in the middle of a troubleshooting session (didn't even bother to reprompt, just changed it in the middle of its reasoning), gave it a few minutes, problem fixed. This is with the subscription based allocation on OpenCode Go, where a problem like this would completely burn up my Opus for the current 5 hours or even the current week.

nijave

>it takes it sweet time to get code rolling, not the fastest model by any means

Which provider are you using? I got a z.ai Lite Coding Plan and it's my understanding z.ai is on the slower side of providers and the Lite plan gets lower priority on top of that. In the api key console, it shows dipping below 60 tok/sec which is quite slow.

xlii

I have Max access from a friend. It's not about token generation but time-to-first-edit. It tends to think 3-10 minutes before that.

Oras

Also pricing, I wanted to give a try, but when pricing is only 30% cheaper than Opus, I wouldn't go for it with these issues.

nijave

z.ai coding plan is a fairly decent deal at ~$16/mon USD considering it's supposed to have a fair bit more usage than the comparable $20/mon Claude plan. On the other hand, z.ai seems a bit on the slower side for raw model tok/sec throughput.

chpatrick

It's pricing is a lot cheaper if you can run it yourself.

nijave

Not this one. It's a SOTA-class model >800Gi VRAM required at fp8

jeremyjh

What?

It is less than 20% of the cost of Opus at API rates. 1.40/4.40 vs 5/25.

cmrdporcupine

Not when you factory in token efficiency. It burns a lot more tokens to do the same job, so when I compared to GPT5.5 I was frankly not really much ahead, and with weaker thinking.

Maybe makes sense if you have z.AI's (not greatly priced) subscription plan, but it's not competitive against an OpenAI or Anthropic monthly coding subscription plan. I burned through almost $10 worth of tokens just doing an hour of work.

Imanari

This mirrors my experience. I have been using it in Pi. It is smart and output is good but it is not efficient in getting there.

ju-st

which thinking level? max or high?

faxmeyourcode

I feel like another comparison worth looking at is purely cost.

Capability per dollar is something I care about:

    Opus API    $5/$25
    Sonnet API  $5/$15
    Haiku API   $1/$5

    GLM 5.2 API $1.4/$4.4
So you're really getting near opus level capability for the price of haiku.

cmrdporcupine

Not really, GLM uses more tokens to get work done.

throwawayffffas

In the article, they claim GLM used almost half the tokens 131,000, and the cost is about a quarter. For the cost to be the same GLM would have to use 4-5 times more tokens.

dymk

By how much? At least TFA provided numbers for one example, and they disagree with you (by a lot).

wiremine

I ran a fairly large experiment last week, and the token usage wasn't bad at all. What softs of use cases are you seeing large token usage by GLM 5.2?

w0m

> are you seeing large token usage by GLM 5.2

the statement isn't "GLM 5.2 has large token usage", it's "GLM 5.2 has large token usage vs modern Opus".

I haven't used it, but this wouldn't surprise me. I see ~30% lower token usage for better results with Opus 4.8 vs 4.6 (and i had great results with 4.6)

lukaslalinsky

I was never able to get these models to collaborate with me the way Opus does. I'm probably an outliner, I don't one-shot projects, I don't vibe code. I basically use LLMs are if I was working with a coworker, fairly smart one, but with short memory and often missing the big picture. Sometimes I can delegate more, sometimes less, but I know I always have to stay on top of what's happening, because it WILL create mess when it hits something hard. With the Antropic models, this kind of cooperation is easy (with the exception of Opus 4.6, which was bad for some reason).

Terretta

> Opus 4.6 which was bad for some reason

If I recall, that model had a couple issues. One was the issue of being monkeyed with, for which they gave everyone credits.

The other feature/bug, depending on your POV, was being Anthropic's least personable release, not papering over everything with self help guru therapy language.

Opus 4.6 didn't LARP. It was more direct, less fussy, less discussy, and very much less "wait, one more thing" within a couple edits after embarking on what should have been the spec, than 4.7 or 4.8 are.

When in engineer brain mode, working as as you describe (good old fashioned XP-style staff engineer pair programming with a language-savvy mentee not yet full-stack or system wise), I found the clearer I was about my goal and the better I could express it, the more often I'd get an expanded clarified response I could then iterate to steer for ever tighter cleaner more specified responses, then let it go build the whole thing without it agonizing and waffling.

The next two releases regressed on that dimension, wanting to figuratively "sit with" every decision and re-validate spiritual alignment along the way, no matter how clearly expressed.

Curiously to me, Fable seemed to hit the best of both worlds, I had the highest commit per turn with Fable, approaching 73%, where I'm usually under 17% of LOC written being good enough to commit, usually taking 9 - 11 turns to get the code where I'm comfortable with it.

Thanks to this, Fable cost more, but actually cost less, if that makes sense.

Arguably, Fable, and 4.6, played more outcome-correctness oriented than journey-experience oriented. It's easy to see how this could happen with human reinforced learning if not all judges are staff or principal engineer level, or constitution values are more Portlandia than Finlandia.

ANTHROP\C needs to balance these at the constitution level:

“We will work in a humane and thoughtful way, but production is the final judge. We will listen to people, but we will not let discussion replace decision. We will value craft, but not at the expense of usefulness. We will move fast, but not by hiding risk. We will measure outcomes, but not pretend that everything important is easy to measure.”

lukaslalinsky

I considered Opus 4.5 to be the peak for a while. Opus 4.6 tended to over think, and generally get lost in thinking. I asked something and Claude Code would just spin for 15 minutes. And it was not the harness, if I changed the model to 4.5, it was fine again. So I skipped the following releases. I've been working with Opus 4.8 the last weeks and while I don't like how talkative it is, but it is fine to work with interactively. I've also used Fable for the few days it was available, and indeed, that was model worth using for my use case. To the point, but still very interactive.

x312

A lot of open weight models don't understand intent well, they'll overfixate on a word in the prompt or just go off the rails trying to do much work.

GLM-5.2 actually has really good intent understanding though, on par with GPT-5.5 and Opus from my experience.

lukaslalinsky

I'll have to try it. I was using earlier GLM models, incluing 5.1, and was always disappointed.

therealdrag0

What do they do instead of collaborating?

jameson

> Opus 4.8 built in Claude Code; GLM-5.2 built in Pi over OpenRouter.

It would be more interesting and accurate to see the comparison on the same harness if the intent is to compare the frontier models.

Pi is relatively new and does not have many features built-in compared to Claude Code. It was chosen intentionally this way as Pi's goal is not to create a bloat builtin of tools most don't use but to allow the users to customize to fit their need -- similar to Neovim vs IDE.

The end-user "vibe coding" experience is *heavily* swayed by the harness because prompt effectively drives how a model outputs an answer.

postatic

I've signed up with Ollama to experiment with these open source models. For the past 3 months, it's just been experimenting, trying it out. GLM is the first model that I am using on a daily basis to do my coding work (as well as using Claude). It's good - I've been maxing out my Ollama usage limits everyday :)

jameswhitford

Cool to hear, what kind of tasks have you been using GLM for? And what other models have you found useful through Ollama?

toddmorey

I’m actually amazed at the output since GLM doesn’t have eyes. If GLM 5.2 costs 1/5 as much, seems like it could be set up to reach out to a multimodal model for vision tasks when required. Closer to parity but probably still significantly cheaper.

horsawlarway

I'm also very impressed at the output given the lack of image support.

They picked a task that heavily favors a model that can do multi-modal with images, and GLM still came within striking distance.

What I'm hearing from this article is that the next generation of open models that includes better multi-modal support are basically no-brainers for adoption.

Seems like a HUGE win for Z.ai and open models in general here.

killingtime74

Yes, it could just make one call to a multimodal llm to describe the scene

ulrikrasmussen

> Through an API it costs a fraction of Opus, and you can run it yourself for free if you have the hardware.

I haven't been keeping up on hardware costs for state of the art LLM inference, but this remark made me ask myself how many readers of the article would actually be able to run this model on hardware they own. How much would it cost to acquire such a setup?

trollbridge

GLM-5.2 performing like it would from a good provider - 8x B200s, so $450k. (No personal experience here)

GLM-5.2, severely quantised, 512GB Mac Studio, somewhere between $10k-$35k for a used M3. Or run it on a CPU with 768GB of RAM by getting an old PowerEdge with DDR4 for around $5,000.

Qwen-3.6-35b-q6, runs well on an RTX 5090 ($4000 + cost of a PC), runs medicore on an Intel Arc B70 ($1000 + cost of a PC plus lots of fiddling to get the setup to work right).

Gemma is a good candidate for the cheaper stuff, but I lack personal experience with using it locally

trollbridge

For anyone reading this, GLM-5.2 is actually a lot more accessible than that on the 1 or 2 bit quantised models - see https://unsloth.ai/docs/models/glm-5.2

Basically a 1x 24GB GPU (32GB would be better) plus 256GB of free system RAM, or a 256GB unified memory machine (like a Mac).

Kind of shocked they got the results they did.

jack_pp

This framing local LLMs as free is stupid. Basically pay 100+ months worth of API costs up front isn't free in the slightest. And it will be slower than non-local, your hardware will be outdated in 12 months and probably won't be able to run SOTA at anywhere near non-local speed in max 20 months

ulrikrasmussen

Yeah, it glosses over a gigantic capital expenditure. It's sort of like saying that an open source modern CPU architecture allows you to build your own CPU "for free" (provided that you own and operate a fab).

cicko

True. But there are other meanings of "free". I.e. nobody can say "from now on you no longer have access to model X because you're an asshole"

trollbridge

Some obvious examples of why you'd want to spend the capital on this would be, for example, making some kind of autonomous system which needs to be periodically be offline, or you need complete confidentiality of what you're using the model for, etc.

To be cost effective with inference providers, you have to find some way to be using it 24/7.

Der_Einzige

The ecosystem for inference is centralized around a few core projects, i.e. vLLM, sglang, and llamacpp.

If they decided to collude, they could absolutely say "from now on you no longer have access to model X because you're an asshole"

The commercial inference offering are also downstream of one of those 3 projects (or trt-LLM if they're nvidia). It would impact Ollama, and fireworks, together, and everyone else.

Don't tempt fate.

throwaway219450

Hardware outdated in 12 months is FUD. What that would mean in practice would be either affordable consumer GPUs with > 32GB of VRAM, which doesn't look like it's going to happen, or unified memory systems with much higher bandwidth. That also seems unlikely.

You're better off setting a budget and buying the best machine you can afford in that range, or picking a VRAM target and accepting the class of models you can run on it. Those models will almost certainly improve over time and your skills will adapt to the limitations. Hardware is so valuable right now that it's not even likely to be a significant loss if you had to sell.

Right now I think 24 GB is probably the best bang for your buck (used 3090), because you also get a high end gaming/gpgpu device which is nice anyway. 32 GB you can do with AMD or Intel, but NVIDIA is megabucks and at this point you're really paying for RAM. Unfortunately the ship has sailed on "reasonably" priced RTX 6000s, which at one point were about $7k and are being listed at $10k++.

yieldcrv

and I think asking for a whole dissertation on the hardware demands every single time is stupid. the point is that even people and organizations with capital couldn't do this before either.

if it doesn't apply to you then just come back in a couple years and see what the situation is then. 1 million context window, 1 million tiny layers to fit in 4gb RAM at a time, with 256gb of fast unified RAM in every consumer device? Or a different concept entirely

in the meantime, z.ai probably doesn't reply to US subpoenas so you can shift all your incriminating conversations over to that and use GLM anyway. who cares if the Party trains on your data and steals your IP and ignores you for legal matters, when the alternative in the US is just a thin corporate layer and party who steals your IP and will snitch on you for legal matters.

bestouff

The price of a small house.

crimsoneer

Practically nobody.

0xbadcafebee

Flaws in this test setup:

  - A zero-shot prompt, run once (in total)
  - No planning run (which improves output)
  - Different coding harnesses & system prompts
  - Unknown provider for GLM (there are 15 different GLM-5.2 providers with varying quality & latency)
  - No documentation of thinking effort level
  - No vision model supplement (you can provide a subagent w/a vision model)
You can't take this comparison seriously. There were many different variables, no control, no repeat test. It's as useful a comparison as picking a random tweet with both models' names

xg15

So GLM emits fewer tokens and does fewer tool calls, but still takes over twice as long to complete.

Can someone explain to me where that time usage is coming from if not from the model operation itself?

Are the individual tool calls more complex and take more time to complete? Or is the rate of tok/s lower because the model does more compute per token?

iagooar

I have noticed that Opus and GPT 5.5 are very good at adjusting their thinking / reasoning intensity depending on the task at hand, something the open weights models are still not as good at.

In addition to that, some of the open weights models like GLM 5.2 or DeepSeek v4 Pro tend to be MUCH slower when generating tokens, which contributes to the perceived slowness. Although I wouldn't call models like GLM 5.2 slow by any means, e.g. it is currently one of the fastest models inside Notion today.

twobitshifter

Probably the data center where the model is running more than anything. Another option is if Opus is using anything like a Mixture of Experts approach, in which case the amount of the model loaded in memory at one time could be smaller than GLM.

radu_floricica

Could just be infra. I'm betting Anthropic is much better prepared.

Daily Digest email

Get the top HN stories in your inbox every day.