Brian Lovin
/
Hacker News
Daily Digest email

Get the top HN stories in your inbox every day.

simjnd

I'm not sure what people are on in the comments. It doesn't beat the other models, but it sure competes despite its size.

GLM 5.1 is an excellent model, but even at Q4 you're looking at ~400GB. Kimi K2.5 is really good too, and at Q4 quantization you're looking at almost ~600GB.

This model? You can run it at Q4 with 70GB of VRAM. This is approaching consumer level territory (you can get a Mac Studio with 128GB of RAM for ~3500 USD).

For the Claude-pilled people, I don't know if you only run Opus but when I was on the Pro plan Sonnet was already extremely capable. This beats the latest Sonnet while running locally, without anyone charging you extra for having HERMES.md in your repo, or locking you out of your account on a whim.

Mistral has never been competitive at the frontier, but maybe that is not what we need from them. Having Pareto models that get you 80% of the frontier at 20% of the cost/size sounds really good to me.

Aurornis

> This model? You can run it at Q4 with 70GB of VRAM. This is approaching consumer level territory (you can get a Mac Studio with 128GB of RAM for ~3500 USD).

The one thing I would want everyone curious about local LLMs to know is that being able to run a model and being able to run a model fast are two very different thresholds. You can get these models to run on a 128GB Mac, but we need to first tell if Q4 retains enough quality (models have different sensitivities to quantization) and how fast it runs.

For running async work and background tasks the prompt processing and token generation speeds matter less, but a lot of Mac Studio buyers have discovered the hard way that it's not going to be as responsive as working with a model hosted in the cloud on proper hardware.

For most people without hard requirements for on-site processing, the best use case for this model would be going through one of the OpenRouter hosted providers for it and paying by token.

> This beats the latest Sonnet while running locally

Almost every open weight model launch this year has come with claims that it matches or exceeds Sonnet. I've been trying a lot of them and I have yet to see it in practice, even when the benchmarks show a clear lead.

nijave

>Almost every open weight model launch this year has come with claims that it matches or exceeds Sonnet. I've been trying a lot of them and I have yet to see it in practice, even when the benchmarks show a clear lead.

This has been my experience as well. I've been testing an agent built with Strands Agents which receives a load balancer latency alert and is expected to query logs with AWS Athena (Trino) then drill down with Datadog spans/traces to find the root cause. Admittedly, "devops" domain knowledge is important here

My notes so far:

"us.anthropic.claude-sonnet-4-6" # working, good results

"us.anthropic.claude-sonnet-4-20250514-v1:0" # has problems following the prompt instructions

"us.anthropic.claude-sonnet-4-5-20250929-v1:0" # working, good results

"us.anthropic.claude-opus-4-5-20251101-v1:0"

"us.anthropic.claude-opus-4-6-v1" # best results, slower, more expensive

"amazon.nova-pro-v1:0" # completely fails

"openai.gpt-oss-120b-1:0" # tool calling broken

"zai.glm-5" # seems to work pretty well, a little slow, more expensive than Sonnet

"minimax.minimax-m2.5" # didn't diagnose correctly

"zai.glm-4.7" # good results but high tool call count, more expensive than Sonnet

"mistral.mistral-large-3-675b-instruct" # misdiagnosed--somehow claimed a Prometheus scrape issue was involved

"moonshotai.kimi-k2.5" # identified the right endpoints but interpreted trace data/root cause incorrectly

"moonshot.kimi-k2-thinking" # identified endpoint, 1 correct root cause, 1 missing index hallucination

Using models on AWS Bedrock. I let Claude Code w/ Opus 4.7 iterate over the agent prompt but didn't try to optimize per model. Really the only thing that came close to Sonnet 4.5 was GLM-5. The real kicker is, Sonnet is also the cheapest since it supports prompt caching

The Kimi ones were close to working but didn't quite make the mark

pbgcp2026

" it supports prompt caching" May I ask if you checked that? I use "{"cachePoint": { "type": "default" }" and I found 2 things: * 1) even if stated in the Doco, Bedrock Converse API does not allow 1hr expiry time, only 5m - gives error when attempted; * 2) Bedrock Converse API does accept up to 4 cachePoint's but does NOT cache and returns zeroes. LOL. It was confirmed by some other people on Github. (Note: VertexAI does cache properly reducing the bill drastically, so I use Vertex instead of OpenRouter.)

simjnd

> The one thing I would want everyone curious about local LLMs to know is that being able to run a model and being able to run a model fast are two very different thresholds. You can get these models to run on a 128GB Mac, but we need to first tell if Q4 retains enough quality (models have different sensitivities to quantization) and how fast it runs.

Very valid. This is an active area of research, and there are a lot of options to try out already today.

- People have successfully used TurboQuant to quantize model weights (TQ3_4S), not just the context KV, to achieve smaller sizes than Q4 (~3.5 bpw) with much better PPL and faster decoding.

- Importance-weighted quantization (e.g. IQ4) also provides way better PPL, KDL, etc. at the same size as a Q4.

- DFlash (block diffusion for speculative decoding) needs a good drafting model compatible with the big model, but can provide an uplift up to 5x in decoding (although usually in the 2-2.5x range)

- Forcing a model's thinking to obey a simple grammar has been shown to improve results with drastically lower thinking output (faster effective result generation) although that has been more impactful on smaller models.

We should be skeptical, but it's definitely trending in the right direction and I wouldn't be surprised if we are indeed able to run it at acceptable speeds.

> Almost every open weight model launch this year has come with claims that it matches or exceeds Sonnet. I've been trying a lot of them and I have yet to see it in practice, even when the benchmarks show a clear lead.

This hasn't been my experience. After Anthropic's started their shenanigans I've switched to exclusively using open-weights models via OpenRouter and OpenCode and I can't really tell a difference (for better or for worse).

tredre3

> - Importance-weighted quantization (e.g. IQ4) also provides way better PPL, KDL, etc. at the same size as a Q4.

All the Q quants from big quant providers are importance-weighted (imatrix) nowadays.

The main (possibly only?) difference between Q and IQ today is that IQ uses a lookup table to achieve better compression. That is also why IQ suffers more when it can't fully fit into VRAM.

It's important to teach people the distinction and not perpetuate wrong assumptions of the past. If one needs/wants static quants, ignoring IQ_ isn't enough.

sroussey

Super interesting!

> - People have successfully used TurboQuant to quantize model weights (TQ3_4S), not just the context KV, to achieve smaller sizes than Q4 (~3.5 bpw) with much better PPL and faster decoding.

Where can I find more info on this? I’d like to convert models to onnx this way.

> - Importance-weighted quantization (e.g. IQ4) also provides way better PPL, KDL, etc. at the same size as a Q4.

Where can I find more info on this? I’d like to convert models to onnx this way.

The most difficult environment for small models is in the browser. Would be great to push the SOTA in that environment.

parsimo2010

> being able to run a model and being able to run a model fast are two very different thresholds

Specifically speaking, on my Strix Halo machine with (theoretical) memory bandwidth of 256 GB/s, a 70 GB model can't generate faster than 256/70= 3.65 t/s. The logic here is that a dense model must do a full read of the weights for each token. So even if the GPU can keep up, the memory bandwidth is limiting.

A Mac M5 Pro is faster with a bandwidth of 307 GB/s, but that's only a little faster.

This thing is going to be slow on consumer hardware. Maybe that is useful for someone, but I probably prefer a faster model in most cases even if the model isn't quite as smart. Qwen3.6 35B-A3B generates about 50 t/s on my machine, so it can make mistakes, be corrected, and try again in the same time that this model would still be thinking about its first response.

zozbot234

Recent models support multi-token prediction, which can guess multiple future tokens in a single decode step (using some subset of the model itself, not a separate drafting model) and then verify them all at once. It's an emerging feature still (not widely supported) and it's only useful for speeding up highly predictable token runs, but it's one way to do better in practice than the common-sense theoretical limit might suggest.

zozbot234

Cloud hardware is not inherently more "proper" than what's being proposed here, there's nothing wrong per se about targeting slower inference speeds in an on prem single-user context.

Aurornis

> Cloud hardware is not inherently more "proper" than what's being proposed here

Cloud hardware can run the original model. Quantization will reduce quality. The quality drop to Q4 is not trivial.

Cloud hardware is also massively faster in time to first token and token generation speed.

> there's nothing wrong per se about targeting slower inference speeds in a local single-user context.

If that's what the user wants and expects then it's fine

Most people working interactively with an LLM would suffer from slower turns.

cbg0

The quantization for some models can be very detrimental and their quality can drop considerably from the posted benchmarks which are probably at bf16, this is why having considerable RAM can be important.

notatoad

being able to run a model fast is definitely more useful, but being able to run a model slowly for free is still super useful. agentic workflows are maturing all the time.

yes, if i'm directly interacting with the LLM, i want it to be reasonably fast. but lately i've been queueing up a bunch of things when i go for lunch, or leaving things running when i go home at the end of the day. and claude doesn't keep working on that all night, it runs for an hour or so, gets to a point where it needs more input from me, and gives me some stuff to review in the morning. that could run 16x slower and still be just as useful for me.

Computer0

Sure but for a casual conversational use case I have not found speed to be a huge barrier. I chatted with a 100b model using ddr5 only on a plane recently and it was fine. It's mainly that I cannot do data classification and coding tasks in a timely manner.

gregsadetsky

I didn't know about HERMES.md ... (??) - found information here for others who are curious https://github.com/anthropics/claude-code/issues/53262

gnulinux

This github thread is incredible, thanks for sharing. This link should be its own HN topic.

giancarlostoro

That is insane, if you billed me an extra $200 for a bug in your system I'd flat out cancel my subscription. If you're not going to credit that back to me, you don't deserve anymore of my money. I'm a Claude first guy, but if you're going to bill me incorrectly, that's on you, own it, fix it.

xcrjm

They did credit it back to him. There's a comment in the linked issue.

giancarlostoro

> For the Claude-pilled people, I don't know if you only run Opus but when I was on the Pro plan Sonnet was already extremely capable.

Before February I was able to use Opus on High exclusively on my Max plan no problem. Now I've shifted to just using Sonnet on high and yeah, its pretty capable. I love that, Claude Pilled. ;)

simjnd

Yeah I love Claude, amazing models. Anthropic has very quickly burned most of the goodwill I had for it so I still ended up cancelling my subscription.

WhitneyLand

“This beats the latest Sonnet while running locally”

Not really.

- The benchmarks are based on F8_E4M3 and you’re not running that on any Mac.

- Sonnet has a 1M token context window. This is 256k but again you’re probably not even getting that locally.

- Sonnet is fast over the wire. This is going to be much slower.

trvz

> Sonnet is fast over the wire.

Except when it’s unavailable. For sovereignity, the downsides are worth it to some.

trueno

the benchmarks we're using to measure llm's do no justice when everyone's mental-benchmark is simply "is it going to feel like using claude" and the answer is still no. the entire llm space is stuffed with tons of crazy datapoints and vernacular that barely paint the picture of the mental benchmark everyone is after.

i too am desperate to just sever ties with these big providers, my fingers are crossed we get there within the constraints of local hardware even if that means me spending 3-5k i just want off this wild ride.

varispeed

Not sure if 1M token window is meaningful with Sonnet/Opus. The models go dumb quickly as context increases making them unusable (that is if you get routed to actual Opus, otherwise they are just dumb regardless of context window).

ksubedi

Let's not forget Qwen 35B A3B MoE. It gets better performance than this in all the metrics for a fraction of the memory / compute footprint.

Sad to see all the non Chinese open source models being at least one generation behind.

simjnd

Qwen3.6 27B is even more impressive IMO. Dense so it doesn't run as fast but it's so good.

trueno

im kinda torn on which to download. i have the headroom to run either, mostly just want the occasional "do a coding thing im too lazy to do"

UncleOxidant

Yeah, you can run it locally if you have enough VRAM, but the reports trickling in are saying about 3 tok/sec. This was on a Strix Halo box which definitely has the needed VRAM, but isn't going to have as high mem bandwidth as a GPU card, it's going to be similar on a Mac - that's the dilemma... the unified memory machines have the VRAM, but the bandwidth isn't great for running dense models. This size of a dense model is only going to be runnable (usefully) by very few people who have multiple GPU cards with enough memory to add up to about 70GB.

simjnd

I don't think this is quite correct, a Strix Halo box usually has 256 GB/s memory bandwidth. An M5 Max has 614 GB/s. An M3 Ultra (no M4 or M5 Ultra) has 820 GB/s. It's still not GDDR or HBM territory, but still significantly faster.

That's the edge of Apple Silicon for AI. When they scale up the chip they add more memory controllers which adds more channels and more bandwidth.

But yeah in the end it's still going to be only a handful of people that can run it.

What I meant is that I think researching and developing smaller more powerful model is more interesting than chasing the next 3T parameter model while burning through VC money and squeezing your customer base more and more aggressively.

YetAnotherNick

It has similar SWE bench score to qwen 3.6 27b[1]. No one is comparing it to frontier.

[1]: There is no other common benchmark in the blog.

simjnd

That's more a testament of how good Qwen3.6 27B is (it really is great) more than how bad this one is IMO. Gemma 4 31B was already good, but Qwen3.6 27B is incredible for its size.

reissbaker

Good models vs bad models are relative: if this was released in 2020 it would be earth shattering. But releasing a model today that's only on par with open-source dense models a quarter of the size and soundly beaten by open-source MoEs with active param counts a quarter of the size is kind of a flop. The niche for this is basically no one. It'll run at near-zero TPS for the few local model aficionados with enough hardware to try it out, and is lower throughout and lower quality for people trying to use it at scale.

I'm rooting for Mistral, I want them to release good models. This just isn't one. It's a little sad since they once were so prominent for open-source.

Who knows — if they have the compute to train this, they have the compute to train an MoE that's 3-4T total params with 128B active. Maybe they'll make a comeback (although using Llama 2 attention is... not promising). I hope they do.

2ndorderthought

The point is it's open weight and is tiny compared to a lot of it's competitors. 4gpus for world class performance - sweet!

vessenes

As always, rooting for these guys — model and national diversity is great. This looks like a solid foundation to build on; hopefully the 3.6/3.7 will dial in more gains. It looks like maybe from the computer use benchmarks that their vision pipeline could use improvement, but that’s just speculation.

The different results on some benchmarks vibes as if this is truly an independently trained model, not just exfiltrated frontier logs, which I think is also really important - having different weight architectures inside a particular model seems like a benefit on its own when viewed from a global systems architecture perspective.

antirez

The problem with this model is that DeepSeek v4 Flash runs quite well quantized to 2 bit (see https://github.com/antirez/llama.cpp-deepseek-v4-flash), at 30 t/s generation and 400 t/s prefill in a M3 Ultra (and not too much slower on a 128GB MacBook Pro M3 Max). It works as a good coding agent with opencode/pi, tool calling is very reliable, and so forth. All this at a speed that a 120B dense model can never achieve. So it has to compete not just with models that fit 4-bit quantized the same size, but with an 86GB GGUF file of DeepSeek v4 Flash, and it is not very easy to win in practical terms for local inference.

Note: I have more uncommitted speed improvements in my tree that I'll push soon, the current tree could be a little bit slower but not much, still super usable.

I don't understand one thing about Mistral, which I'm a fan being in Europe: they opened the open weights MoE show with Mixtral. Why are they now releasing dense models of significant sizes? In this way you don't compete in any credible space, nor local inference, nor remote inference since the model is far from SOTA and not cheap to serve. So why they are training such dense big models? Dense models have a place in the few tens of billion parameters, as Qwen 3.6 27B shows, but if you go 5 times that, it is no longer a fit, unless you are crushing with capabilities anything requiring the same VRAM, which is not the case.

zozbot234

Your GitHub link only says "The model quantized in this way behaves very very well in the chat, frontier-model vibes, but it was not extensively tested." This is hardly relevant to how it behaves in agentic workflows, we're aware of how often they degrade severely with Q2 quantization. If this quantized Flash can keep up reasonable quality and performance at larger context lengths (which seems to be a key feature of the V4 series) it could be a very reasonable competitor to models in the same weight class like Qwen 3 Coder-Next 80B.

antirez

Nope it works great with opencode as a agent, you can build a game or things like that. It works. The trick is a mix among the quantization I used, which is very asymmetric, and the fact that I guess DeepSeek v4 Flash tolerates extreme quantizations better than anything I saw in the past.

What I used was up/gate of routed models, IQ2_XXS, out -> Q2_K, then I quantized routing, projections, shared experts to Q8. The trick is that the very sensible parts are a small amount of the weights, and they are kept very high quality.

Mashimo

Compared to all other hosted LLMs that I have tested, Mistral seems to be the only one with rather strict CSP headers. When you ask them to create a website with some javascript library it will not preview, even though le chat offers canvas mode.

Sometimes when a new release comes around from any provider I just want to test it a bit on the web. without paying and using an agent harness.

Why are they like this ;_;

Edit: Christ on a bike it's bad at drawing SVGs https://chat.mistral.ai/chat/23214adb-5530-4af9-bb47-90f5219...

SyneRyder

> Edit: Christ on a bike it's bad at drawing SVGs

On the bike would be an improvement. Geez.

I know SVGs may not be the best benchmark, but that matches my experience of trying to run a (previous) Mistral model in Mistral Vibe, asking it to help me configure an MCP server in Vibe. It confidently explained that MCP is the MineCraft Protocol and then began a search of my computer looking for Minecraft binaries.

2ndorderthought

I have never wanted, needed or hoped to draw svgs with an LLM. All of the models suck at it, some are just more fun or something.

Mashimo

I can't speak for what you consider sucking, but there is a significant difference between Mistral and Kimi or Gemini. I find the others to be usable for my needs.

2ndorderthought

I agree there is a difference but does that translate to anything? It's not the same operations used to write code, and it's kind of useless. I wouldn't waste my power bill ensuring a model I was releasing was good at it.

andai

Claude volunteered this the other day:

https://iili.io/BsfyNXR.jpg

(I think the hair was unintentional, but it is impossible to be sure.)

ffsm8

Are those cherries overlayed ontop of boobs with a bus to the side, driving towards a rock? Scnr

deferredgrant

Mistral continuing to ship credible models is good for the market. Buyers need more than a two-company choice if they want pricing and deployment leverage.

postalcoder

This release Mistral really reminds you of the gap between the frontier labs and everyone else.

Pre-agent, there wasn't always an obvious difference between models. Various models had their charms. Nowadays, I don't want to entertain anything less than the frontier models. The difference in capability is enormous and choosing anything less has a real cost in terms of productivity.

I've been a big fan of the smaller labs like Mistral and especially Cohere but it's been a while since I've been excited by a release by either company.

That said, I'm using mistral voxtral realtime daily – it's great.

deaux

Can't agree at all. Productivity gap just 1 year ago was much larger for frontier model vs non-frontier. Let alone 2 years ago.

2ndorderthought

Same. The gap is almost paper thin for anyone who hasn't gone full uninformed vibe code.

postalcoder

When I was thinking pre-agentic, I was actually thinking more pre-"coding seen as the main use case for these models".

deaux

Coding has always been the main real-world business usecase since day one. There has been no point since the very first public availability of GPT 3.5 in November 2022, that it wasn't.

A lot of us have been agentic coding since almost 2 years ago, mid-2024. I have. The productivity gap of "best vs 2nd vs 3rd best model" was biggest back then and has slowly been shrinking ever since.

onlyrealcuzzo

> Pre-agent, there wasn't always an obvious difference between models. Various models had their charms. Nowadays, I don't want to entertain anything less than the frontier models. The difference in capability is enormous and choosing anything less has a real cost in terms of productivity.

It's just apples to oranges.

There is not a clear, across the board, winner on non-agentic tasks between Gemini, ChatGPT, and Claude - the simple chatbot interface.

But Claude Code is substantially better than Codex which itself is notably better than Gemini-cli.

In this vein, it should not be surprising that Claude Code is way better than non-frontier models for agentic coding... It's substantially better than other frontier models at specialized agentic tasks.

philipbjorge

I’ve been comparing Claude Code and Codex extensively side by side over the past couple of weeks with my favorite prompting framework superpowers…

From my perspective, Claude Code is decidedly not better than Codex. They’re slightly different and work better together. I would have no issues dropping CC entirely and using codex 100%.

If you’re working off of “defaults”, in other words no custom prompting, Claude Code does perform a lot better out of the box. I think this matters, but if you’re a professional software developer, I’d make the case that you should be owning your tools and moving beyond the baked in prompts.

postalcoder

I think there's a fair amount of evidence that the heavy harnesses actually drag down performance compared to bare harnesses.

nothinkjustai

CC is not better than Codex, nor is it better than OpenCode, Crush, Pi etc…

locknitpicker

> Pre-agent, there wasn't always an obvious difference between models. Various models had their charms. Nowadays, I don't want to entertain anything less than the frontier models.

This is a very naive and misguided opinion. In most tasks, including complex coding tasks, you can hardly tell the difference between a frontier model and something like GPT4.1. You need to really focus on areas such as context window, tool calling and specific aspects of reasoning steps to start noticing differences. To make matters worse, frontier models are taking a brute force approach to results which ends up making them far more expensive to run, both in terms of what shows up on your invoice and how much more you have to wait to get any resemblance of output.

And I won't even go into the topic or local models.

postalcoder

> You need to really focus on areas such as context window, tool calling and specific aspects of reasoning steps to start noticing differences.

This is like saying "the current models and the old models are the same if you ignore every important advance they've made"

locknitpicker

> This is like saying "the current models and the old models are the same if you ignore every important advance they've made"

Please go ahead and list the single most important advance a frontier model has over, say, gpt4.1.

Reasoning is one of the main features, and in practice all it does is waste compute to rewrite your original prompt. See how GPT 5.4 burns through compute by running additional prompts where it acts like your own interpreter, with lengthy reasoning prompts on "the user is asking for (...)" as if you are completely unable to stitch together a usable prompt. That's your frontier model.

seb_lz

I'm using mistral-medium-2508 for some text transformation operations. It's giving me better results than mistral-large for my use cases. Looking forward to testing this new model, although I'm not sure if it's really meant at replacing the previous medium model since it's a lot more expensive and presented more as a coding / agentic model (mistral-medium-2508 was priced $0.4/$2 per 1M tokens, mistral-medium-3.5 is $1,5/$7.5).

hulk-konen

I actually use Mistral Large to go through some large text chunks (in production). It gives about the same level of results as Sonnet, while being 90% cheaper. Definitely wouldn't use it for coding, but for this text-analyzing task it has been great. Much better than all the latest Chinese models, for example.

So I was waiting for this release and it's... 5x more expensive than the latest Mistral Large. So now I'm worried they'll pull the plug on the cheap Large when their releases roll over to that one.

barrell

Yeah I use Mistral Large for a lot of formatting work. For this one use case of mine, it outperforms frontier models by a significant margin. I've found tons of use cases for mistral small as well.

I'd love to use Mistral for more tasks, but Mistral Large doesn't quite cut it for all tasks. So on the one hand, I'm excited there is another model, and presumably more performant based on the price? But the fact it's a "Medium" and 5x the price of the Large definitely concerns me.

The entire release is also about Vibe Coding, and so I'm not even sure if this model is applicable outside of coding, or even worth testing.

zozbot234

Why does this matter if the model is open? It can be offered by competitive third-party providers, there's no rug pull.

hulk-konen

Right now it's really not offered by third parties. I found it via a single provider (BitDeer). I'm not sure I'd trust them with my customers' data. Also, considering the model is getting a bit old, I wouldn't expect them to keep offering it forever.

Anyhow, competition is fierce. I'll have some model I can use in the future, even if it's not dirt cheap like current Mistral Large is.

mark_l_watson

I like the idea of Mistral, but the last time I evaluated Mistral Vibe it was really nice for $15/month but not as effective as Gemini Plus with AntiGravity and gemini-cli. I am currently running Gemini Ultra on a 3 month 'special deal' and AntiGravity with Opus 4.7 tokens is pretty much fantastic.

That said, when I stop spending money on Gemini Ultra, I will give Mistral Vibe another 1-month test.

I like the entire business model and vibe of Mistral so much more than OpenAI/Anthropic/Google but I also have stuff to get done. I am curious if Mistral Vibe for $15/month is a stable business model (i.e., can they make a profit).

danelski

How do you feel about the responsiveness of gemini-cli? I tried it on a paid plan and the 10-minute hang-ups (per step, not the whole plan execution) really break the illusion of performance gains, unless you run it in the background and do something else in the meantime. It's more noticeable when Americans are awake.

mark_l_watson

it is usually fast, but if gemini-cli or any other coding agent is sluggish I quit using it for a while.

amunozo

I'm testing it right now and it seems very buggy and unstable, just like before.

mtct88

It's okay, nothing exceptional, but any news from non US and non Chinese models is still good news.

pb7

This is the bar for Europe, huh?

deaux

Where are the competitive models from Singapore, Japan, Taiwan, Korea, Russia, Canada, India, the UK? From anywhere that isn't China or the US?

There are none. Mistral Small 4 is pareto-competitive in its pricing bracket at $0.15/$0.60, at worst it's second to Gemma 4 26B A4B. The above countries have never had a model that is even close to being so.

This particular Mistral Medium looks to be uncompetitive at that pricing. I'm surprised it's so expensive given its size. Wonder if we'll see other providers offer it for cheaper.

but that doesn't mean Mistral has never produced anything useful.

johndough

> Korea

EXAONE from LG AI Research https://huggingface.co/LGAI-EXAONE

They had one of the best small models a few months ago and they released a new model just last week.

There's also HyperCLOVA X (haven't tested it, but maybe it is also good) https://huggingface.co/naver-hyperclovax

> India

India has the Sarvam model series, which admittedly are not SotA, but they have pretty good voice capabilities https://huggingface.co/sarvamai

The UAE (not part of the list above) also has a few noteworthy models: https://huggingface.co/tiiuae

argsnd

DeepMind, which is headquartered in London, probably had a significant role in the development of the Gemini and Gemma models.

Yes, it might be a problem that the UK allows companies like this to be bought up by foreign countries.

pama

What does Pareto competitive mean here? Look at the pricing of the V4-flash model: https://api-docs.deepseek.com/quick_start/pricing

class4behavior

Although the Manus decision might change things for AI, Singapore-washing is quite rampant among Chinese companies, so I wouldn't call this place of origin an alternative market.

undefined

[deleted]

amunozo

This is the bar for anybody that's not the frontier labs.

undefined

[deleted]

locknitpicker

> This is the bar for Europe, huh?

A few months ago China was being criticized left and right on how somehow it was not able to compete, and once DeepSeek showed up then all the hatred shifted onto how China was actually competing but exploring unfair competitive advantages.

Funny how that works.

Also, aren't the likes of OpenAI burning through over $2 of investment for each $1 of revenue?

pb7

[flagged]

Matl

I mean, at least we're not melting the planet trying to predict the next token that sounds about right.

pb7

Europeans use AI as much as anyone else.

wg0

[flagged]

gadders

[flagged]

saulapremium

[flagged]

pb7

The fact that this comment is still up hours later but my comment below participating in the discussion got flagged should tell one everything they need to know about the intellectual rigor here.

minimaxir

It's funny that 128B is now considered Medium. I remember back in the day when 355M parameters was considered medium with GPT-2.

speedgoose

And GPT-2 1.5B was considered too dangerous to release.

They were perhaps right.

Matl

considered that by OpenAI for marketing purposes that is

But yes, perhaps it would have been better for all of us if they haven't.

refulgentis

In lockstep over the past month, a subset of people, un-labelable, unprompted, share this train of thought:

- Mythos wasn't released widely.

- But Anthropic shared info on it and said it was dangerous.

- Anthropic is a company.

- Companies like money.

- Therefore Mythos is marketing hype.

- Remember GPT-2? That also wasn't released. They said it was dangerous.

- But, GPT-3, GPT-4, GPT-5, etc. were released.

- Therefore GPT-2 being dangerous was marketing hype.

I've seen the idea that GPT-2 not being released was marketing hype at least 6 times since Mythos was shared.

It's Not Even Wrong, in the Pauli sense: they weren't selling anything! They weren't raising funding! What were they marketing!?

And there's a lot more elided from history, ex. they didn't have an API yet.

GPT-3 was released, a year or two later, and did have an API. But, no one used it, it wasn't good enough yet. And they did treat it as dangerous, it was wildly over-the-top manually monitored for anything resembling not-intended-use. I got permanently suspended for using the word "twink"

maelito

Given what Vibe already did in the previous versions with codestral-v2, that's great news. Keep up the good work ! I don't want to depend on the world's two hungry superpowers.

andhuman

The Vibe CLI is really bad on Windows, sure they don’t officially support it, so can’t blame them, but a FYI for anyone wanting to try it. It can’t get find and replace right.

Daily Digest email

Get the top HN stories in your inbox every day.