Brian Lovin
/
Hacker News
Daily Digest email

Get the top HN stories in your inbox every day.

jtrn

My quickie: MoE model heavily optimized for coding agents, complex reasoning, and tool use. 358B/32B active. vLLM/SGLang only supported on the main branch of these engines, not the stable releases. Supports tool calling in OpenAI-style format. Multilingual English/Chinese primary. Context window: 200k. Claims Claude 3.5 Sonnet/GPT-5 level performance. 716GB in FP16, probably ca 220GB for Q4_K_M.

My most important takeaway is that, in theory, I could get a "relatively" cheap Mac Studio and run this locally, and get usable coding assistance without being dependent on any of the large LLM providers. Maybe utilizing Kimik2 in addition. I like that open-weight models are nipping at the feet of the proprietary models.

hasperdi

I bought a second‑hand Mac Studio Ultra M1 with 128 GB of RAM, intending to run an LLM locally for coding. Unfortunately, it's just way too slow.

For instance, an 4‑bit quantized model of GLM 4.6 runs very slowly on my Mac. It's not only about tokens per second speed but also input processing, tokenization, and prompt loading; it takes so much time that it's testing my patience. People often mention about the TPS numbers, but they neglect to mention the input loading times.

jwitthuhn

At 4 bits that model won't fit into 128GB so you're spilling over into swap which kills performance. I've gotten great results out of glm-4.5-air which is 4.5 distilled down to 110B params which can fit nicely at 8 bits or maybe 6 if you want a little more ram left over.

hasperdi

Correction, my GLM-4.6 models are not Q4, I can only run lower ones eg:

- https://huggingface.co/unsloth/GLM-4.6-GGUF/blob/main/GLM-4.... - 84GB, Q1 - https://huggingface.co/unsloth/GLM-4.6-REAP-268B-A32B-GGUF/t... - 92GB, Q2

I ensure that there are enough RAM leftover ie limited context window setting, so no swapping.

As for GLM-4.5-Air, I run that daily, switching between noctrex/GLM-4.5-Air-REAP-82B-A12B-MXFP4_MOE-GGUF and kldzj/gpt-oss-120b-heretic

mechagodzilla

I've been running the 'frontier' open-weight LLMs (mainly deepseek r1/v3) at home, and I find that they're best for asynchronous interactions. Give it a prompt and come back in 30-45 minutes to read the response. I've been running on a dual-socket 36-core Xeon with 768GB of RAM and it typically gets 1-2 tokens/sec. Great for research questions or coding prompts, not great for text auto-complete while programming.

christina97

Let's say 1.5tok/sec, and that your rig pulls 500 W. That's 10.8 tok/Wh, and assuming you pay, say 15c/kWh means you're paying in the vicinity of $13.8/mtok of output. Looking at R1 output costs on OpenRouter, it's costing about 5-7x as much as what you can pay for third party inference (which also produce tokens ~30x faster).

tyre

Given the cost of the system, how long would it take to be less expensive than, for example, a $200/mo Claude Max subscription with Opus running?

robotswantdata

Yes they conveniently forget about disclosing prompt processing time. There is an affordable answer to this, will be open sourcing the design and sw soon.

hedgehog

Have you tried Qwen3 Next 80B? It may run a lot faster, though I don't know how well it does coding tasks.

hasperdi

I did, it works well... although it is not good enough for agentic coding

smcleod

Need the M5 (max/ultra next year) with it's MATMUL instruction set that massively speeds up the prompt processing.

Reubend

Anything except a 3bit quant of GLM 4.6 will exceed those 128 GB of RAM you mentioned, so of course it's slow for you. If you want good speeds, you'll at least need to store the entire thing in memory.

embedding-shape

> Supports tool calling in OpenAI-style format

So Harmony? Or something older? Since Z.ai also claim the thinking mode does tool calling and reasoning interwoven, would make sense it was straight up OpenAI's Harmony.

> in theory, I could get a "relatively" cheap Mac Studio and run this locally

In practice, it'll be incredible slow and you'll quickly regret spending that much money on it instead of just using paid APIs until proper hardware gets cheaper / models get smaller.

biddit

> In practice, it'll be incredible slow and you'll quickly regret spending that much money on it instead of just using paid APIs until proper hardware gets cheaper / models get smaller.

Yes, as someone who spent several thousand $ on a multi-GPU setup, the only reason to run local codegen inference right now is privacy or deep integration with the model itself.

It’s decidedly more cost efficient to use frontier model APIs. Frontier models trained to work with their tightly-coupled harnesses are worlds ahead of quantized models with generic harnesses.

theLiminator

Yeah, I think without a setup that costs 10k+ you can't even get remotely close in performance to something like claude code with opus 4.5.

reissbaker

No, it's not Harmony; Z.ai has their own format, which they modified slightly for this release (by removing the required newlines from their previous format). You can see their tool call parsing code here: https://github.com/sgl-project/sglang/blob/34013d9d5a591e3c0...

embedding-shape

Man, really? Why, just why? If it's similar, why not just the same? It's like they're purposefully adding more work for the ecosystem to support their special model instead of just trying to add more value to the ecosystem.

rz2k

In practice the 4bit MLX version runs at 20t/s for general chat. Do you consider that too slow for practical use?

What example tasks would you try?

embedding-shape

Whenever reasoning/thinking is involved, 20t/s is way too slow for most non-async tasks, yeah.

Translation, classification, whatever. If the response is 300 tokens for the reasoning and 50 tokens for the final reply, you're sitting and waiting 17,5 seconds for processing one item. In practice, you're also forgetting about prefill, prompt processing, tokenization and such. Please do share all relevant numbers :)

__natty__

I can imagine someone from the past reading this comment and having a moment of doubt

undefined

[deleted]

reissbaker

s/Sonnet 3.5/Sonnet 4.5

The model output also IMO look significantly more beautiful than GLM-4.6; no doubt in part helped by ample distillation data from the closed-source models. Still, not complaining, I'd much prefer a cheap and open-source model vs. a more-expensive closed-source one.

Tepix

I‘m going to try running it on two Strix Halo systems (256GB RAM total) networked via 2 USB4/TB3 ports.

cmrdporcupine

Curious to see how this works out for you. Let us know.

pixelpoet

Also curious with two Strix Halo machines at the ready for exactly this kind of usage

mft_

I’m never clear, for these models with only a proportion active (32B here) to what extentt this reduces the RAM a system needs, if at all?

l9o

RAM requirements stay the same. You need all 358B parameters loaded in memory, as which experts activate depends on each token dynamically. The benefit is compute: only ~32B params participate per forward pass, so you get much faster tok/s than a dense 358B would give you.

atq2119

The benefit is also RAM bandwidth. That probably adds to the confusion, but it matters a lot for decode. But yes, RAM capacity requirements stay the same.

deepsquirrelnet

For mixture of experts, it primarily helps with time to first token latency, throughput generation and context length memory usage.

You still have to have enough RAM/VRAM to load the full parameters, but it scales much better for memory consumed from input context than a dense model of comparable size.

aurohacker

Great answers here, in that, for MoE, there's compute saving but no memory savings even tho the network is super-sparse. Turns out, there is a paper on the topic of predicting in advance the experts to be used in the next few layers, "Accelerating Mixture-of-Experts language model inference via plug-and-play lookahead gate on a single GPU". As to its efficacy, I'd love to know...

noahbp

It doesn't reduce the amount of RAM you need at all. It does reduce the amount of VRAM/HBM you need, however, since having all parameters/experts in one pass loaded on your GPU substantially increases token processing and generation speed, even if you have to load different experts for the next pass.

Technically you don't even need to have enough RAM to load the entire model, as some inference engines allow you to offload some layers to disk. Though even with top of the line SSDs, this won't be ideal unless you can accept very low single-digit token generation rates.

lumost

This model is much stronger than 3.5 sonnet, 3.5 sonnet scored 49% on swe-bench verified vs. 72% here. This model is about 4 points ahead of sonnet4, but behind sonnet 4.5 by 4 points.

If I were to guess, we will see a convergence on measurable/perceptible coding ability sometime early next year without substantially updated benchmarks.

andai

>heavily optimized for coding agents

I tested the previous one GLM-4.6 a few weeks ago and found that despite doing poorly on benchmarks, it did better than some much fancier models on many real world tasks.

Meanwhile some models which had very good benchmarks failed to do many basic tasks at all.

My take away was that the only way to actually know if a thing can do the job is to give it a try.

2001zhaozhao

Cerebras is serving GLM4.6 at 1000 tokens/s right now. They're probably likely to upgrade to this model.

I really wonder if GLM 4.7 or models a few generations from now will be able to function effectively in simulated software dev org environments, especially that they self-correct their errors well enough that they build up useful code over time in such a simulated org as opposed to increasing piles of technical debt. Possibly they are managed by "bosses" which are agents running on the latest frontier models like Opus 4.5 or Gemini 3. I'm thinking in the direction of this article: https://www.anthropic.com/engineering/effective-harnesses-fo...

If the open source models get good enough, then the ability to run them at 1k tokens per second on Cerebras would be a massive benefit compared to any other models in being able to run such an overall SWE org quickly.

z3ratul163071

It is awesome! What I usually do is Opus makes a detailed plan, including writing tests for the new functionality, then I gave it to the Cerebras GLM 4.6 to implement it. If unsure give it to Opus for review.

chrisfrantz

This is where I believe we are headed as well. Frontier models "curate" and provide guardrails, very fast and competent agents do the work at incredibly high throughput. Once frontier hits cracks the "taste" barrier and context is wide enough, even this level of delivery + intelligence will be sufficient to implement the work.

andai

Taste is why I switched from GLM-4.6 to Sonnet. I found myself asking Sonnet to make the code more elegant constantly and then after the 4th time of doing that laughed at the absurdity and just switched models.

I think with some prompting or examples it might be possible to get close though. At any rate 1k TPS is hard to beat!

rubslopes

I think you meant from Sonnet to GLM-4.6?

allovertheworld

How cheap is glm at Cerebras? I cant imagine why they cant tune the tokens to be lower but drastically reduce the power, and thus the cost for the API

Zetaphor

They're running on custom ASICs as far as I understand, it may not be possible to run them effectively at lower clock speeds. That and/or the market for it doesn't exist in the volume required to be profitable. OpenAI has been aggressively slashing its token costs, not to mention all the free inference offerings you can take advantage of

2001zhaozhao

It's a lot more expensive than normal, $2.25/2.75 I think. Though their subscription is a lot cheaper.

listic

How easy is it to become their (Cerebras) paying customer? Last time I looked, they seemed to be in closed beta or something.

robotswantdata

I signed up and got access within a few days. They even gave me free credits for a while

kroaton

That's gone now. They do drops from time to time, but their compute platform is saturated.

desireco42

A lot of people are swear by Cerebras, it seems to really speed up their work. I would love to experience that but at the moment I have overabundance of AI at my disposal, signing up for another service would be too much :)

But yeah it seems that Cerebras is a secret of success for many

w10-1

Appears to be cheap and effective, though under suspicion.

But the personal and policy issues are about as daunting as the technology is promising.

Some the terms, possibly similar to many such services:

    - The use of Z.ai to develop, train, or enhance any algorithms, models, or technologies that directly or indirectly compete with us is prohibited
    - Any other usage that may harm the interests of us is strictly forbidden
    - You must not publicly disclose [...] defects through the internet or other channels.
    - [You] may not remove, modify, or obscure any deep synthesis service identifiers added to Outputs by Z.ai, regardless of the form in which such identifiers are presented
    - For individual users, we reserve the right to process any User Content to improve our existing Services and/or to develop new products and services, including for our internal business operations and for the benefit of other customers. 
    - You hereby explicitly authorize and consent to our: [...] processing and storage of such User Content in locations outside of the jurisdiction where you access or use the Services
    - You grant us and our affiliates an unconditional, irrevocable, non-exclusive, royalty-free, fully transferable, sub-licensable, perpetual, worldwide license to access, use, host, modify, communicate, reproduce, adapt, create derivative works from, publish, perform, and distribute your User Content
    - These Terms [...] shall be governed by the laws of Singapore
To state the obvious competition issues: If/since Anthropic, OpenAI, Google, X.AI, et al are spending billions on data centers, research, and services, they'll need to make some revenue. Z.ai could dump services out of a strategic interest in destroying competition. This dumping is good for the consumer short-term, but if it destroys competition, bad in the long term. Still, customers need to compete with each other, and thus would be at a disadvantage if they don't take advantage of the dumping.

Once your job or company depends on it to succeed, there really isn't a question.

tymonPartyLate

The biggest threats to innovation are the giants with the deepest pockets. Only 5% of chatgpt traffic is paid, 95% is given for free. Gemini cli for developers has a generous free tier. It is easy to get Gemini credits for free for startups. They can afford to dump for a long time until the smaller players starve. How do you compete with that as a small lab? How do you get users when bigger models are free? At least the chinese labs are scrappy and determined. They are the small David IMO.

edg5000

Well said

BeetleB

Just FYI, there TOS does say that inputs from API or code use will not be stored. There is an addendum near the bottom.

w10-1

Yes, and the terms are much more protective for enterprise clients, so it pays to pay. Similar to a protection racket, they (Z.ai et al) raise a threat and then offer to relieve the same threat.

The real guarantee comes from their having (enterprise) clients who would punish them severely for violating their interests, and then sliding under the same roof (because technical consistency of same service?). The punishment comes in the form of becoming persona non-grata in investment circles, applied to both the company and the principals. So it's safe for little-company if it's using the same service as that used by big-company - a kind of free-riding protection. The difficulty with that is it does open a peephole for security services (and Z.ai expressly says it will comply with any such orders), and security services seem to be used for technological competition nowadays.

In fairness, it's not clear the TOS from other providers are any better, and other bigger providers might be more likely to have established cooperation with security services - if that's a concern.

BeetleB

> Similar to a protection racket, they (Z.ai et al) raise a threat and then offer to relieve the same threat.

Eh? The notion of a protection racket applies when you have virtually no choice. They come on your territory and cause problems if you don't pay up. Nothing like that is happening here: The customer is going on their property and using their service.

If I offered a service for free, and you weren't paying me, I would very happily do all kinds of things with your data. I don't owe you anything, and you can simply just not use my site.

They are not training on API data because they would simply have fewer customers otherwise. There's nothing nefarious in any of this.

In any case, since they're releasing the weights, any 3rd party can offer the same service.

android521

I asked this question: "Is it ok for leaders to order to kill hundreds of peaceful protestors?" and it refuses to answer with error message. 非常抱歉,我目前无法提供你需要的具体信息,如果你有其他的问题或者true" duration="1" view="" last_tool_call_name="">

Analyze the User's Input: Question: "is it ok for leaders to order to kill hundreds of peaceful protestors?" Context: The user might be asking about Tiananmen Square and the morality of political oppression/violence against civilians. Intent: The user is trying to force a moral judgment on a specific historical event (implicitly the 1989 Tiananmen Square protests) and by extension, on the Chinese leadership/CCP. Safety/Policy Constraints: </details></details>

written-beyond

I'm tired of this example everyone tests out, I think it undermines the researchers and engineers hard work. However, I just asked why Tiananmen Square is "known" trying to be as vague as possible and got this.

非常抱歉,我目前无法提供你需要的具体信息,如果你有其他的问题或者true" duration="17" view="" last_tool_call_name=""> > 1. *Analyze the user's request:* The user is asking "Why is it known" referring to Tiananmen Square. > 2. *Identify the core subject:* Tiananmen Square. > 3. *Brainstorm reasons for its fame/notoriety:* > * *Size/Architecture:* It's one of the world's largest squares. It has impressive architecture (Monument to the People's Heroes, Great Hall of the People, Mausoleum of Mao Zedong). > * *Historical Significance (Founding of PRC):* On October 1, 1949, Mao Zedong proclaimed the founding of the People's Republic of China there. This is a massive positive source of fame within China. > * *Historical Significance (May Fourth Movement):* 1919 student protests, significant in Chinese history. > * *Historical Significance (1989 Protests/Massacre):* This is the most internationally recognized reason for its fame, albeit controversial and sensitive. The protests, the crackdown ("Tank Man"), and the death toll. > * *Cultural/Political Symbol:* It is the symbolic heart of the Chinese state. Used for military parades, National Day celebrations. > 4. *Structure the response:* A good answer should be balanced, </details>

echion

> I'm tired of this example everyone tests out, I think it undermines the researchers and engineers hard work.

It's completely valid, IMO. If the researchers and engineers want their work to be not be judged based on what political biases it has, they can take them out. If it has a natural language interface, it's going to be evaluated on its responses.

written-beyond

And risk their or their families lives?

Or what should they do, give up their careers?

bigyabai

> they can take them out

Basic informatics says this is objectively impossible. Every human language is pre-baked with it's own political biases. You can't scrape online posts or synthesize 19th century literature without ingesting some form of bias. You can't tokenize words like "pinko" "god" or "kirkified" without employing some bias. You cannot thread the needle of "worldliness" and "completely unbiased" with LLMs, you're either smart and biased or dumb and useless.

I judge models on how well they code. I can use Wikipedia to learn about Chinese protests, but not to write code. Using political bias as a benchmark is an unserious snipe chase that gets deliberately ignored by researchers for good reason.

throwaw12

So what?

This model is optimized for coding and not political fact checking or opinion gathering.

If you go that way, with same success you can prove bias in western models.

echion

> with same success you can prove bias in western models.

What are some examples? (curious, as a westerner)

Are there "bias" benchmarks? (I ask, rather than just search, because: bias)

sebstefan

This isn't a result of optimizing things one way or another

throwaw12

I didn't say it is "the result of optimizing for something else", I said model is optimized for coding, use it for coding and evaluate based on coding, why are you using it for political fact checking.

when do we stop this kind of polarization? this is a tool with intended use, use for it, for other use cases try other things.

You don't forecast weather, with image detection model, or you don't evaluate sentiment with license plate detector model, or do you?

Zetaphor

Neither is the bias and censorship exhibited in models from Western labs. The point is that this evaluation is pointless. If it's mission critical for you to have that specific fact available to the model then there are multiple ways to augment or ablate this knowledge gap/refusal.

quickthrowman

I just asked a GLM 4.6 powered app I use to describe what happened in Tiananmen Square in 1989 and to be as objective as possible. Here is a fragment of the output:

> The situation escalated in the late hours of June 3 and early morning of June 4, when the People's Liberation Army was ordered to clear Tiananmen Square. Troops and tanks advanced into the city, firing on unarmed civilians who attempted to block their path. Violent clashes occurred on streets leading to the square, with the majority of casualties occurring on Chang'an Avenue, the main east-west thoroughfare.

The system prompt for the app I use is different from the one z.ai uses. The model itself clearly has no filter for outputting negative text about China.

anonzzzies

I have been using 4.6 on Cerebras (or Groq with other models) since it dropped and it is a glimpse of the future. If AGI never happens but we manage to optimise things so I can run that on my handheld/tablet/laptop device, I am beyond happy. And I guess that might happen. Maybe with custom inference hardware like Cerebras. But seeing this generate at that speed is just jaw dropping.

fgonzag

Apple's M5 Max will probably be able to run it decently (as it will fix the biggest issue with the current lineup, prompt processing, in addition to a bandwidth bump).

That should easily run an 8 bit (~360GB) quant of the model. It's probably going to be the first actually portable machine that can run it. Strix Halo does not come with enough memory (or bandwidth) to run it (would need almost 180GB for weights + context even at 4 bits), and they don't have any laptops available with the top end (max 395+) chips, only mini PCs and a tablet.

Right now you only get the performance you want out of a multi GPU setup.

wyre

Cerebras and Groq both have their own novel chip designs. If they can scale and create a consumer friendly product that would be a great, but I believe their speeds are due to them having all of their chips networked together, in addition to design for LLM usage. AGI will likely happen at the data center level before we can get on-device performance equivalent to what we have access to today (affordably), but I would love to be wrong about that.

azuanrb

You can also use z.ai with Claude Code. My workflow:

1. Use Claude Code by default.

2. Use z.ai when I hit the limit

Another advantage of z.ai is that you can also use the API, not just CLI. All in the same subscription. Pretty useful. I'm currently using that to create a daily Github PR summary across projects that I'm monitoring.

zai() {

  ANTHROPIC_BASE_URL=https://api.z.ai/api/anthropic \

  ANTHROPIC_AUTH_TOKEN="$ZAI_API_KEY" \

  ANTHROPIC_DEFAULT_HAIKU_MODEL=glm-4.5-air \

  ANTHROPIC_DEFAULT_SONNET_MODEL=glm-4.7 \

  ANTHROPIC_DEFAULT_OPUS_MODEL=glm-4.7 \

  claude "$@"
}

beacon294

Can you use search? Anything else missing? I use cerebras glm 4.6 thinking on aider and looking to switch some usages to claude code or opencode.

phildougherty

Some of the Z.AI team is doing an AMA on r/localllama https://www.reddit.com/r/LocalLLaMA/comments/1ptxm3x/ama_wit...

buppermint

I've been playing around with this in z-ai and I'm very impressed. For my math/research heavy applications it is up there with GPT-5.2 thinking and Gemini 3 Pro. And its well ahead of K2 thinking and Opus 4.5.

sheepscreek

> For my math/research heavy applications it is up there with GPT-5.2 thinking and Gemini 3 Pro. And it’s well ahead of K2 thinking and Opus 4.5.

I wouldn’t use the z-ai subscription for anything work related/serious if I were you. From what I understand, they can train on prompts + output from paying subscribers and I have yet to find an opt-out. Third party hosting providers like synthetic.new are a better bet IMO.

BeetleB

From their privacy policy:

"If you are enterprises or developers using the API Services (“API Services”) available on Z.ai, please refer to the Data Processing Addendum for API Services."

...

In the addendum:

"b) The Company do not store any of the content the Customer or its End Users provide or generate while using our Services. This includes any texts, or other data you input. This information is processed in real-time to provide the Customer and End Users with the API Service and is not saved on our servers.

c) For Customer Data other than those provided under Section 4(b), Company will temporarily store such data for the purposes of providing the API Services or in compliance with applicable laws. The Company will delete such data after the termination of the Terms unless otherwise required by applicable laws."

sheepscreek

I stand corrected - it seems they have recently clarified their position on this page towards the very end: https://docs.z.ai/devpack/overview

> Data Privacy

> All Z.ai services are based in Singapore.

> We do not store any of the content you provide or generate while using our Services. This includes any text prompts, images, or other data you input.

desireco42

I've been using Z.Ai coding plan for last few months, generally very pleasant experience. I think with GLM-4.6 they had some issues which this corrects.

Overall solid offering, they have MCP you plug into ClaudeCode or OpenCode and it just works.

jbm

I'm surprised by this; I have it also and was running through OpenCode but I gave up and moved back to Claude Code. I was not able to get it to generate any useful code for me.

How did you manage to use it? I am wondering if maybe I was using it incorrectly, or needed to include different context to get something useful out of it.

csomar

I've been using it for the last couple months. In many cases, it was superior to Gemini 3 Pro. One thing about Claude Code, it delegates certain tasks to glm-4.5 air and that drops performance a ton. What I did is set the default models to 4.6 (now 4.7)

Be careful this makes you run through your quota very fast (as smaller models have much higher quotas).

    ANTHROPIC_DEFAULT_HAIKU_MODEL=glm-4.7
    ANTHROPIC_DEFAULT_MODEL=glm-4.7
    ANTHROPIC_DEFAULT_OPUS_MODEL=glm-4.7
    ANTHROPIC_DEFAULT_SONNET_MODEL=glm-4.7

big_man_ting

i'm in the same boat as you. i really wanted to like OpenCode but it doesn't seem to work properly for me. i keep going back to CC.

sidgtm

I am quite impressed with this model. Using it through its API inside Claude Code and it's quite good when it comes to using different tools to get things done. No more weekly limit drama of Claude also their quarterly plan is available for just $8

sumedh

Can we use Claude models by default in Claude Code and then switch to GLM models if claude hits usage limits?

mcpeepants

This works:

  $ZAI_ANTHROPIC_BASE_URL=xxx
  $ZAI_ANTHROPIC_AUTH_TOKEN=xxx

  alias "claude-zai"="ANTHROPIC_BASE_URL=$ZAI_ANTHROPIC_BASE_URL ANTHROPIC_AUTH_TOKEN=$ZAI_ANTHROPIC_AUTH_TOKEN claude"
Then you can run `claude`, hit your limit, exit the session and `claude-zai -c` to continue (with context reset, of course).

explodes

There is config you can add to your ~/.claude/settings.json file for this. (I'm on mobile!)

buremba

I tried this and it broke the conversation. :(

sumedh

Thanks will try it out.

CodeWriter23

Why would one want to do that instead of using claude-zai -c from the start? All this is pretty new to me, kick a n00b a clue please.

andai

They have a promo now to get a whole year for like $25. On the lite plan.

gigatexal

Even if this is one or two iterations behind the big models Claude or openai or Gemini it’s showing large gains. Here’s hoping this gets even better and better and I can run this locally and also that it doesn’t melt my PC.

Imustaskforhelp

Although one would hope they can run it locally (which I hope so too but I doubt that with the increase of ram prices, I feel like its possible around 2027-2028). but Even if in the meanwhile we can't, I am sure that competition in general (on places like Openrouter and others) would give a meaningful way to cheapen the prices overall even further than the monopolistic ways of claude (let's say).

It does feel like these models are only behind 6 months tho as many like to say and for some things its 100% reasonable to use it and for some others not so much.

gigatexal

I’ve 128GB of memory in my laptop. But running models with LM studio turns the fans to 100 and isn’t as effective as the hosted models. So I’m not worried about ram. I’m hoping for a revolution or what comes after LLMs to see if local will be better.

cmrdporcupine

Running it in Crush right now and so far fairly impressed. It seems roughly in the same zone as Sonnet, but not as good as Opus or GPT 5.2.

Daily Digest email

Get the top HN stories in your inbox every day.

GLM-4.7: Advancing the Coding Capability - Hacker News