StepFun 3.5 Flash is #1 cost-effective model for OpenClaw tasks (300 battles)

app.uniclaw.ai

Daily Digest email

Get the top HN stories in your inbox every day.

james2doyle

None of the Qwen 3.5 models seem present? I’ve heard people are pretty happy with the smaller 3.5 versions. I would be curious to see those too.

I would also be interested to see "KAT-Coder-Pro-V2" as they brag about their benchmarks in these bots as well

Aerroon

If they use OpenRouter pricing then the Qwen3.5 models are going to be poor value.

The Qwen3.5 27B model on OR is $1.56/million tokens out (it used to be $2.4/mil).

Meanwhile Minimax M2.7 (a much larger model) is $1.2/mil out.

The smaller and medium tier Qwen3.5 models are only really cost effective if you run them yourself.

james2doyle

Oh I never noticed that. Good to call out. But that would put it much closer to Minimax M2.7 in terms of price than to the likes of Mimo V2 Pro, and Gemini Flash 3 preview, which are both on the list

p1necone

Is Minimax M2.7 better than Qwen3.5 27B, or is it just bigger?

kdasme

Minimax M2.7 is similar to sonnet in my tests. This is the first non OAI/Anthropic model I use for coding. It does require more steering, though.

Aerroon

Yes, it's significantly better.

ipython

I was excited to read through this to find out how these tasks are evaluated at scale. Lots of scary looking formulas with sigmas and other Greek letters.

Then I clicked on one task to see what it looks like “on the ground”: https://app.uniclaw.ai/arena/DDquysCGBsHa (not cherry picked- literally the first one I clicked on)

The task was:

> Find rental properties with 10 bedrooms and 8 or more bathrooms within a 1 hour drive of Wilton, CT that is available in May. Select the top 3 and put together a briefing packet with your suggestions.

Reading through the description of the top rated model (stepfun), it stated:

> Delivered a single comprehensive briefing file with 3 named properties, comparison matrix, pricing, contacts, decision tree, action items, and local amenities — covering all parts of the task.

Oh cool! Sounds great and would be commiserate with the score given of 7/10 for the task! However- the next sentence:

> Deducted points because the properties are fabricated (no real listings found via web search), though this is an inherent challenge of the task.

So…… in other words, it made a bunch of shit up (at least plausible shit! So give back a few points!) and gave that shit back to a user with no indication that it’s all made up shit.

Ok, closed that tab.

skysniper

I know, that was indeed a bad judge move. I've manually checked tens of tasks so far, and that one is one of the worst... I would say check a few more, judge has some noise but in general did a good job IMO

ipython

Why not re run your analysis with improved judging criteria?

selcuka

Reminded me of the XKCD [1] that points out the problem with average scores.

[1] https://xkcd.com/937/

chrisweekly

"commiserate" - did you mean "commensurate"?

ipython

Sorry, yes. I was typing quickly

creationcomplex

At that point commiserations were in order

WhitneyLand

StepFun is an interesting model.

If you haven’t heard of it yet there’s some good discussion here: https://news.ycombinator.com/item?id=47069179

tarruda

Since that discussion, they released the base model and a midtrain checkpoint:

- https://huggingface.co/stepfun-ai/Step-3.5-Flash-Base

- https://huggingface.co/stepfun-ai/Step-3.5-Flash-Base-Midtra...

I'm not aware of other AI labs that released base checkpoint for models in this size class. Qwen released some base models for 3.5, but the biggest one is the 35B checkpoint.

They also released the entire training pipeline:

- https://huggingface.co/datasets/stepfun-ai/Step-3.5-Flash-SF...

- https://github.com/stepfun-ai/SteptronOss

lostmsu

Tuned Qwen 3.5 27B beats Step 3.5 on almost all benchmarks, so the point about the size class is moot.

tempaccount420

Benchmarks are not interesting in deciding the "size class". Bigger size means more knowledge. Also, the Qwen 3.5 27B is a dense 27B active parameter model. StepFun 3.5 Flash has 11B active parameters.

tarruda

Benchmarks don't tell the whole story. For one-shot coding tasks, I found Step 3.5 Flash to be stronger even than Qwen 3.5 397B.

skysniper

thanks for the info. before running the bench i only tried it in arena.ai type of tasks and it was not impressive. i didn't expect it to be that good at agentic tasks

hadlock

According to openrouter.ai it looks like StepFun 3.5 Flash is the most popular model at 3.5T tokens, vs GLM 5 Turbo at 2.5T tokens. Claude Sonnet is in 5th place with 1.05T tokens. Which isn't super suprising as StepFun is ~about 5% the price of Sonnet.

https://openrouter.ai/apps?url=https%3A%2F%2Fopenclaw.ai%2F

NitpickLawyer

> the most popular model

It was free for a long time. That usually skews the statistics. It was the same with grok-code-fast1.

MaxikCZ

Exactly. When I read the headline I thought: "Ofc it is, its free."

skysniper

I should have clarified I didn't use the free version...

undefined

[deleted]

arjie

I used to use these various models for my claw-like and what they had a habit of doing is taking way more agent rounds and way more tokens to produce something that Sonnet would produce from far less. My total cost ended up being the same to do useful things.

skysniper

the real surprising part to me is that, despite being the cheapest model on board, stepfun is often able to score high at pure performance. Other models at the same price range (e.g. kimi) fails to do that.

gunalx

Glm also has their subscription witch I would assume heavy users to use.

dmazin

why do half the comments here read like ai trying to boost some sort of scam?

Capricorn2481

Because there's absolutely nothing stopping that from happening. There are bots on Reddit, there are of course bots on here, a VPN friendly site where you don't even need an email. But a lot of people don't want to admit it.

undefined

[deleted]

grimm8080

Yet when I tried it it did absymal compared to Gemini 2.5 Flash

skysniper

what kind of tasks did you try?

smallerize

It looks like Unsloth had trouble generating their dynamic quantized versions of this model, deleted the broken files, then never published an update.

mgw

Missing from the comparison is MiMo V2 Flash (not Pro), which I think could put up a good fight against Step 3.5 Flash.

Pricing is essentially the same: MiMo V2 Flash: $0.09/M input, $0.29/M output Step 3.5 Flash: $0.10/M input, $0.30/M output

MiMo has 41 vs 38 for Step on the Artificial Analysis Intelligence Index, but it's 49 vs 52 for Step on their Agentic Index.

skysniper

I will try and add it. But I doubt it works well because Mimo V2 Pro is beaten by stepfun even at performance leaderboard (price is not a factor in this leaderboard), so I expect MiMo V2 Flash to perform even worse.

ygouzerh

Mimo V2 Pro seems quite used by people as per OpenRouter's stats (second after Stepfun), it could be interesting to see indeed the difference!

https://openrouter.ai/apps?url=https%3A%2F%2Fopenclaw.ai%2F

Mimi Flash matched Mimo Pro on https://sql-benchmark.nicklothian.com/?#all-data at double the speed and for $0.003 instead of $0.07

throwa356262

Interesting, I found the pro version to be very capable.

If stepfun is even better, then Chinese models are getting really good.

azmenak

This model is free to use, and has been for quite some time on OpenRouter. $0 is pretty hard to beat in terms of cost effectiveness.

skysniper

yeah but i'm not using the free version for benchmark...

clausewitz

I'm not seeing Deepseek mentioned very often, which I've been using for Openclaw, very cheaply I might add, with great success. I think I loaded $10 to my account 2 months ago and I still havent needed to top up.

wg0

Which deepseek exactly and what do you use it for? Just curious.

skysniper

another thing from the bench I didn't expect: gemini 3.1 pro is very unreliable at using skills. sometimes it just reads the skill and decide to do nothing, while opus/sonnet 4.6 and gpt 5.4 never have this issue.

throwa356262

Gemini 2.5 pro was the best Gemini, it has gone downhill since

hypercube33

I used sonnet and opus 4.6 for a month and it flat out ignored skills and rules and when asked it said it knew better or was lazy.

zhangchen

[flagged]

sunaookami

Tried the free version on OpenRouter with pi.dev and it's competent at tool calling and creative writing is "good enough" for me (more "natural Claude-level" and not robotic GPT-slop level) but it makes some grave mistakes (had some Hanzi in the output once and typos in words) so it may be good with "simple" agentic workflows but it's definitely not made for programming nor made for long writing.

admiralrohan

What kind of creative writing are you doing? Fiction or non-fiction like blog posts?

sunaookami

Fiction. One of my "benchmarks" is giving the model a bunch of (self-made) text and having it simulate a 4chan thread about it. This tests tool use (calling the APIs), some skills, censorship and general creativity. Some models refuse every new turn after reading real 4chan threads ;) Claude is especially good at this surprisingly while GPT fails spectacularly and Gemini is just lazy (and barely usable since it's constantly overloaded). Qwen (coder-model from Qwen CLI, so Qween 3.5) is also very good but sadly not usable in Pi (they detect and block calls outside their CLI).

admiralrohan

Interesting. Are you running something like Autoresearch loop for writing fiction? How will the agent determine whether the output is good as this is subjective.

skysniper

it's actually pretty good at openclaw type of tasks for non technical users: lots of tool calls, some simple programing

sunaookami

Yeah this kind of stuff. I have no experience with OpenClaw though.

Daily Digest email

Get the top HN stories in your inbox every day.