I ran Gemma 4 as a local model in Codex CLI

blog.danielvaughan.com

Daily Digest email

Get the top HN stories in your inbox every day.

gertlabs

Gemma 4 26B really is an outlier in its weight class.

In our little known, difficult to game benchmarks, it scored about as well as GPT 5.2 and Gemini 3 Pro Preview on one-shot coding problems. It had me re-reviewing our entire benchmarking methodology.

But it struggled in the other two sections of our benchmark: agentic coding and non-coding decision making. Tool use, iterative refinement, managing large contexts, and reasoning outside of coding brought the scores back down to reality. It actually performed worse when it had to use tools and a custom harness to write code for an eval vs getting the chance to one-shot it. No doubt it's been overfit on common harnesses and agentic benchmarks. But the main problem is likely scaling context on small models.

Still, incredible model, and incredible speed on an M-series Macbook. Benchmarks at https://gertlabs.com

neonstatic

I have very mixed feelings about that model. I want to like it. It's very fast and seems to be fit for many uses. I strongly dislike its "personality", but it responds well to system prompts.

Unfortunately, my experience with it as a coding assistant is very poor. It doesn't understand libraries it seems to know about, it doesn't see root causes of problems I want it to solve, and it refuses to use MCP tools even when asked. It has a very strong fixation on the concept of time. Anything past January 2025, which I think is its knowledge cutoff, the model will label as "science fiction" or "their fantasy" and role play from there.

seemaze

Thats funny, it failed my usual ‘hello world’ benchmark for LLM’s:

“Write a single file web page that implements a 1 dimensional bin fitting calculator using the best fit decreasing algorithm. Allow the user to input bin size, item size, and item quantity.”

Qwen3.5, Nematron, Step 3.5, gpt-oss all passed first go..

datadrivenangel

Overall it's a very good open weights model! Notably I thought it makes more dumb coding mistakes than GPT-OSS on my M5, but it's fairly close overall.

prettyblocks

For me the vision/OCR is much better than other models in weights class.

iknowstuff

Gemma 31B scoring below 26B-A4B?

gertlabs

In one shot coding, surprisingly, yes, by a decent amount. And it isn't a sample size issue. In agentic, no: https://gertlabs.com/?agentic=agentic

My early takeaway is that Gemma 26B-A4B is the best tuned out of the bunch, but being small and with few active params, it's severely constrained by context (large inputs and tasks with large required outputs tank Gemma 26B's performance). We're working on a clean visualization for this; the data is there.

It's not uncommon for a sub-release of a model to show improvements across the board on its model card, but actually have mixed real performance compared to its predecessor (sometimes even being worse on average).

adrian_b

In early tests the performance of gemma-4-31B was affected by tokenizer bugs in many of the existing backends, like llama.cpp, which were later corrected by their maintainers.

Moreover, tool invocation had problems that were later corrected by Google in an updated chat template.

So any early benchmarks that have shown the dense model as inferior to the MoE model are likely to be flawed and they must be repeated after updating both the inference backend and the model.

All benchmarks that I have seen after the bugs were fixed have shown the dense model as clearly superior in quality, even if much slower.

mhitza

> The finding I did not expect: model quality matters more than token speed for agentic coding.

I'm really surprised how that was not obvious.

Also, instead of limiting context size to something like 32k, at the cost of ~halving token generation speed, you can offload MoE stuff to the CPU with --cpu-moe.

triceratops

Why would token speed matter for anything other than getting work done faster? It's in the name - "speed".

dminik

This would be true if the models were capable of always completing the tasks. But, since their failure rate is fairly high, going in a wrong direction for longer could mean that you take more time than a faster model, where you can spot it going wrong earlier.

dangoodmanUT

Yeah, it’s like drinking coffee when being really tired. You’re still tired, just “faster”, it’s a weird sensation.

kingstnap

It's even more strange how its not obvious to someone who uses codex extensively daily.

The rate limiting step is the LLM going down stupid rabbit holes or overthinking hard and getting decision paralysis.

The only time raw speed really matters is if you are trying to add many many lines of new code. But if you are doing that at token limiting rates you are going to be approaching the singularity of AI slop codebase in no time.

undefined

[deleted]

adam_patarino

[dead]

tuzemec

I'm currently experimenting with running google/gemma-4-26b-a4b with lm studio (https://lmstudio.ai/) and Opencode on a M3 Ultra with 48Gb RAM. And it seems to be working. I had to increase the context size to 65536 so the prompts from Opencode would work, but no other problems so far.

I tried running the same on an M3 Max with less memory, but couldn't increase the context size enough to be useful with Opencode.

It's also easy to integrate it with Zed via ACP. For now it's mostly simple code review tasks and generating small front-end related code snippets.

usagisushi

I have a similar setup. It might be worth checking out pi-coding-agent [0].

The system prompt and tools have very little overhead (<2k tokens), making the prefill latency feel noticeably snappier compared to Opencode.

[0] https://www.npmjs.com/package/@mariozechner/pi-coding-agent#...

jtbaker

Just set up Pi after listening to Marios talk at AIE Europe[0] and have solid initial impressions! Especially on limited hardware like a MB Air, seems a lot more resource efficient

[0] https://www.youtube.com/live/_zdroS0Hc74?t=3633s

tuzemec

Thanks! I just ran a quick test with pi, and it's working a bit faster.

theshrike79

Pi is _really_ good for personal stuff, but since it lacks every single safety imaginable, it's not realy something one can deploy in a corporate environment :D

rsolva

I run this model on my AMD RX7900XTX with 24GB VRAM with up to 4 concurrent chats and 512K context window in total. It is very fast (~100 t/s) and feels instant and very capable, and I have used Claude Code less and less these days.

jwr

I do the same thing on a MacBook Pro with an M4 Max and 64GB. I had problems until the most recent LM Studio update (0.4.11+1), tool calling didn't work correctly.

Now both codex and opencode seem to work.

declan_roberts

Which do you prefer? And what lmstudio api works best for these tools?

jwr

I use the OpenAI API for everything. I think codex is more polished, but I don't really prefer anything: I haven't used them enough. I mostly use Claude Code.

davidwritesbugs

I did the same using the mlx version on an M1 Macbook using LMStudio integrated into XCode. I had to up the context size I ran it a against a very modest iOS codebase and it didn't do well, just petered out at one point. Odd. Pretty good chatbot and maybe against other code it'll work but not useful with XCode for me

ozgrakkurt

Not sure if you already tried but both GLM Flash and Qwen models are much better than Gemma for that in my experience.

I am using a 24GB GPU so it might be different in your case, but I doubt it.

qingcharles

I spun up a GPU on Runpod and tried the 31b full res and it was really impressive. I'm now using it via the Google API which gives you 1500 requests a day for free, IIRC.

hak8or

Be very careful about using googles apis as a consumer, they have poor rate limiting and ineffective anomoly protection.

I (a hobbyist running a small side project for a dollar or two a month in normal usage, so my account is marked as "individual") got hit with a ~$17,000 bill from Google cloud because some combination of key got leaked or my homelab got compromised, and the attacker consumed tens of thousands in gemini usage in only a few hours. It wasn't even the same Google project as for my project, it was another that hasn't seen activity in a year+.

Google refuses to apply any adjustments, their billing specialist even mixed up my account with someone else, refuses to provide further information for why adjustments are being rejected, refuses any escalation, etc. I already filed a complaint with the FTC and NYS attorney General but the rep couldn't care any less.

My gripe is not that the key was potentially leaked or compromised or similar and then I have to pay as a very expensive "you messed up" mistake, it's that they let an api key rack up tens of thousands in maybe 4 hours or so with usage patterns (model selection, generating text vs image, volume of calls, likely different IP and user agent and whatnot). That's just predatory behavior on an account marked as individual/consumer (not a business).

qingcharles

Agree totally. I'm super paranoid and anxious about this issue. I've seen too many horror stories posted on Reddit. I did set alarms at $10 a day on the account, but those are only alarms and it could be thousands over before I see them.

I think Google did finally implement hard limits this month and I need to go and find that setting, but it's useless if, like you say, they have shitty rate limiting and measurement so that you're way over the limit before they stop you.

smrtinsert

gguf or mlx? edit, just tried a community mlx and lm studio said it didn't support loading it yet.

fortyseven

I've been VERY impressed with Gemma4 (26B at the moment). It's the first time I've been able to use OpenCode via a llamacpp server reliably and actually get shit done.

In fact, I started using it as a coding partner while learning how to use the Godot game engine (and some custom 'skills' I pulled together from the official docs). I purposely avoided Claude and friends entirely, and just used Gemma4 locally this week... and it's really helped me figure out not just coding issues I was encountering, but also helped me sift through the documentation quite readily. I never felt like I needed to give in and use Claude.

Very, very pleased.

logicallee

Thanks for sharing that. What kind of hardware are you running this on?

fortyseven

4090, 128gb of ram (long before you'd have to take out a loan). I'm fairly sure it would run just as fine on a 3090.

Thanks to the settings suggestions in the article, I was able to squeeze in the 31b model. Still testing, but it's real tight in 24gb of vram. A bit slower, too, but usable. Not sure I'm seeing much of a quality boost yet, but I'm still testing.

hnrodey

Probably a silly/obvious suggestion but are you using onboard GPU for display out?

segmondy

"The reason I had not done this before is that local models could not call tools. "

Rubbish, we have been calling tools locally for 2 years, and it's very false that gemma3 scored under 7% in tool calling. Hell, I was getting at least 75% tool calling with llama3.3

StrLght

I was also surprised by this sentence. It sounds like this is the author's first attempt at running models locally.

Or maybe the author has been running heavily quantized small models all that time — Gemma 4 gguf he's using is Q4 and only 16 GB. In my experience quants like this tend to perform much worse.

nphard85

To be fair, the author does mention the huge difference between Gemma 3 and Gemma 4 on Tau function calling benchmark.

girvo

This entire article reads like AI slop anyway.

I also recommend anyone with a GB10 device to go try out the spark-vllm-docker setup, and check the Nvidia GB10 forums for the recently released optimised Qwen 3.5 122B A10B setup: 50tk/s is quite impressive for a decent local model!

egorfine

Related: I have upgraded my M4 Pro 24GB to M5 Pro 48GB yesterday. The same Gemma 4 MoE model (Q4) runs about 8x more t/s on M5 Pro and loads 2x times faster from disk to memory.

Gonna run some more tests later today.

Confiks

> The same Gemma 4 MoE model (Q4)

As you have so much RAM I would suggest running Q8_0 directly. It's not slower (perhaps except for the initial model load), and might even be faster, while being almost identical in quality to the original model.

And just to be sure: you're are running the MLX version, right? The mlx-community quantization seemed to be broken when I tried it last week (it spit out garbage), so I downloaded the unsloth version instead. That too was broken in mlx-lm (it crashed), but has since been fixed on the main branch of https://github.com/ml-explore/mlx-lm.

I unfortunately only have 16 GiB of RAM on a Macbook M1, but I just tried to run the Q8_0 GGUF version on a 2023 AMD Framework 13 with 64 GiB RAM just using the CPU, and that works surprisingly well with tokens/s much faster than I can read the output. The prompt cache is also very useful to quickly insert a large system prompt or file to datamine although there are probably better ways to do that instead of manually through a script.

egorfine

> As you have so much RAM I would suggest running Q8_0 directly

On the 48GB mac - absolutely. The 24GB one cannot run Q8, hence why the comparison.

> And just to be sure: you're are running the MLX version, right?

Nah, not yet. I have only tested in LM Studio and they don't have MLX versions recommended yet.

> but has since been fixed on the main branch

That's good to know, I will play around with it.

egorfine

> That too was broken in mlx-lm (it crashed), but has since been fixed on the main branch

Unfortunately I have got zero success running gemma with mlx-lm main branch. Can you point me out what is the right way? I have zero experience with mlx-lm.

Confiks

Get into a venv, and run:

> pip3 install git+https://github.com/ml-explore/mlx-lm.git

> ./venv/bin/mlx_lm.generate --model "$MODEL" --temp 1.0 --top-p 0.95 --top-k 64 --max-tokens 128000 --prompt "Hello world"

Where $MODEL is an unsloth model like:

- unsloth/gemma-4-E4B-it-UD-MLX-4bit

- unsloth/gemma-4-26b-a4b-it-UD-MLX-4bit

minimaxir

Gemma 4 is not supported by the MLX engine yet.

Confiks

It is, as I'm running it; it has been added this week. As I said I'm running the main version from Github and doing nothing special, see: https://news.ycombinator.com/item?id=47761308

zihotki

For coding it makes no sense to use any quantization worse than Q6_K, from my experience. More quantized models make more mistakes and if for text processing it still can be fine, for coding it's not.

segmondy

I don't think most people realize that. Quality of tokens beats quantity of token. I always tell folks to go as high a quant as you can only go lower if you just don't have the memory capacity.

hmokiguess

what do you mean with that, I’m not sure I understood what you said

m348e912

AI models like gemma4 are available in different quant "sizes", think about it as an image available in various compression levels.

The best image is the largest, takes up the most memory when loading, and while it is large and looks the best, it uses up much of your system resources.

On the other end of the spectrum there is a smaller much more compressed version of that same image. It loads quickly, uses less resources, but is lacking detail and clarity of the original image.

AI models are similar in that fashion, and the parent poster is suggesting you use the largest version of the AI model your system can support, even if it runs a little slower than you like.

stavros

Better go for a less-quantized model even if it's slower than go for a faster, quantized one.

shaz0x

On mobile the Q4 vs Q6 tradeoff flips. Gemma 4 E2B at Q4_K_M barely fits in RAM on a 6GB Android, so Q6 isn't on the table. In practice the Q4 hit shows up in tool-call reliability more than general reasoning, which is usually fine for a constrained skill surface.

meander_water

I would have liked to see quality results between the different quantization methods - Q4_K_M, Q_8_0, Q_6_K rather than tok/s

dajonker

I don't really have the hardware to try it out, but I'm curious to see how Qwen3.5 stacks up against Gemma 4 in a comparison like this. Especially this model that was fine tuned to be good at tool calling that has more than 500k downloads as of this moment: https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-...

notpublic

Jackrong has published the finetuning steps here. It seems to be quite thorough with notebooks etc. I am going through it myself now...

https://github.com/R6410418/Jackrong-llm-finetuning-guide

mapontosevenths

I'm just some guy on hackernews, but I actually did try this on my DGX Spark. I went back to Gemma 4 after a few rounds. My orchestration model kept having to send the Qwen model back to fix mistakes that Gemma wouldn't have made. I wound up with less working code per hour due to the mistakes.

Technically, I use OpenWebUI with Ollama, so I used the weights below, but it should be the same.

https://ollama.com/kwangsuklee/Qwen3.5-27B-Claude-4.6-Opus-R...

estimator7292

I'd be super interested to hear about your workflow with OpenWebUI. I haven't figured out how to use it for anything other than the basic chatbot UI. I haven't been able to hook anything else into it

mapontosevenths

What I said above was a bit confused. What I've actually done is connect OpenCode and OpenWebUI both to Ollama. I just use OpenWebUI to manage the models and for testing/etc. Once you have it working it's very nice. You can pull a new model just by typing the name and waiting while it downloads, etc.

Connecting Ollama to OpenCode and OpenWebUI is relatively trivial. In OpenWebUI there's a nice GUI. In OpenCode You just edit the ~/.config/opencode/opecode.json to look something like this. The model names have to match the ones you seen in OpenWebUI, but the friendly "name" key can be whatever you need to be able to recognize it.

  {
    "$schema": "https://opencode.ai/config.json",
    "provider": {
   "ollama": {
     "npm": "@ai-sdk/openai-compatible",
     "name": "Ollama",
     "options": {
    "baseURL": "http://localhost:11434/v1"
     },
     "models": {
    "qwen3.5:122b": {
      "name": "Qwen 3.5 122b"
    },
    "qwen3-coder:30b": {
      "name": "Qwen 3 Coder"
    },
    "gemma4:26b": {
      "name": "Gemma 4"
    }
     }
   }
    }
  }

anana_

It's rather surprising that a solo dev can squeeze more performance out of a model with rather humble resources vs a frontier lab. I'm skeptical of claims that such a fine-tuned model is "better" -- maybe on certain benchmarks, but overall?

FYI the latest iteration of that finetune is here: https://huggingface.co/Jackrong/Qwopus3.5-27B-v3

1dom

I feel that's a little bit misleading.

That link doesn't have much affiliation with Qwen or anyone who produces/trained the Qwen models. That doesn't mean it's not good or safe, but it seems quite subjective to suggest it's the latest latest or greatest Qwen iteration.

I can see huggingface turning into the same poisoned watering-hole as NPM if people fall into the same habits of dropping links and context like that.

anana_

I'm not saying it's the latest Qwen iteration - that would be Qwen3.6.

I'm saying it's the latest iteration of the finetuned model mentioned in the parent comment.

I'm also not suggesting that it's "the latest and greatest" anything. In fact, I think it's rather clear that I'm suggesting the opposite? As in - how can a small fine tune produce better results than a frontier lab's work?

NitpickLawyer

> can squeeze more performance out of a model with rather humble resources vs a frontier lab.

That's the idea behind distillation. They are finetuning it on traces produced by opus. This is poor man's distillation (and the least efficient) and it still works unreasonably well for what it costs.

adam_patarino

[dead]

2001zhaozhao

I think it might be a good idea to make some kind of local-first harness that is designed to fully saturate some local hardware churning experiments on Gemma 4 (or another local model) 24/7 and only occasionally calls Claude Opus for big architectural decisions and hard-to-fix bugs.

Something like:

* Human + Claude Opus sets up project direction and identifies research experiments that can be performed by a local model

* Gemma 4 on local hardware autonomously performs smaller research experiments / POCs, including autonomous testing and validation steps that burn a lot of tokens but can convincingly prove that the POC works. This is automatically scheduled to fully utilize the local hardware. There might even be a prioritization system to make these POC experiments only run when there's no more urgent request on the local hardware. The local model has an option to call Opus if it's truly stuck on a task.

* Once an approach is proven through the experimentation, human works with Opus to implement into main project from scratch

If you can get a complex harness to work on models of this weight-class paired with the right local hardware (maybe your old gaming GPU plus 32gb of RAM), you can churn through millions of output tokens a day (and probably like ~100 million input tokens though the vast majority are cached). The main cost advantage compared to cloud models is actually that you have total control over prompt caching locally which makes it basically free, whereas most API providers for small LLM models ask for full price for input tokens even if the prompt is exactly repeated across every request.

rahimnathwani

Nico Bailon (author of many extensions for Mario's Pi coding agent) has something that might be a good starting point for this:

https://x.com/i/status/2043054831947198640

https://github.com/nicobailon/pi-model-switch

blackmanta

With a nvidia spark or 128gb+ memory machine, you can get a good speed up on the 31B model if you use the 26B MoE as a draft model. It uses more memory but I’ve seen acceptance rate at around 70%+ using Q8 on both models

foobar10000

1 token ahead or 2?

It's interesting - imo we'll soon have draft models specifically post-trained for denser, more complicated models. Wouldn't be surprised if diffusion models made a comeback for this - they can draft many tokens at once, and learning curves seem to top out at 90+% match for auto-regressive ones so quite interesting..

electroglyph

flow matching is making some strides right now, too

taf2

I did this with qwen 3.5 - tool calling was the biggest issue but for getting it to work with vllm and mlx I just asked codex to help. The bulk of my the time was waiting on download. For vllm it created a proxy service to translate some codex idioms to vllm and vice versa. In practice I got good results on my first prompt but followup questions usually would fail due to the models trouble with tool calling - I need to try again with gemma4

Daily Digest email

Get the top HN stories in your inbox every day.