Ollama Turbo

Daily Digest email

Get the top HN stories in your inbox every day.

extr

Nice release. Part of the problem right now with OSS models (at least for enterprise users) is the diversity of offerings in terms of:

- Speed

- Cost

- Reliability

- Feature Parity (eg: context caching)

- Performance (What quant level is being used...really?)

- Host region/data privacy guarantees

- LTS

And that's not even including the decision of what model you want to use!

Realistically if you want to use an OSS model instead of the big 3, you're faced with evalutating models/providers across all these axes, which can require a fair amount of expertise to discern. You may even have to write your own custom evaluations. Meanwhile Anthropic/OAI/Google "just work" and you get what it says on the tin, to the best of their ability. Even if they're more expensive (and they're not that much more expensive), you are basically paying for the priviledge of "we'll handle everything for you".

I think until providers start standardizing OSS offerings, we're going to continue to exist in this in-between world where OSS models theoretically are at performance parity with closed source, but in practice aren't really even in the running for serious large scale deployments.

coderatlarge

true but ignores handing over all your prompt traffic without any real legal protections as sama has pointed out:

[1] https://californiarecorder.com/sam-altman-requires-ai-privil...

I_am_tiberius

I wouldn't be surprised if those undeleted chats or some inferred data that is based on it is part of the gpt-5 training data. Somehow I don't trust this sama guy at all.

supermatt

> OpenAI confirmed it has been preserving deleted and non permanent person chat logs since mid-Might 2025 in response to a federal court docket order

> The order, embedded under and issued on Might 13, 2025, by U.S. Justice of the Peace Decide Ona T. Wang

Is this some meme where “may” is being replaced with “might”, or some word substitution gone awry? I don’t get it.

SickOfItAll

Clearly the author wrote the article with multiple uses of "may" and then used find/replace to change to "might" without proofreading.

wkat4242

Yeah noticed this too. Really weird for a professional publication

kekebo

:)) Apparently. I don't have a better guess. Well spotted

beowulfey

auto correct gone awry

mattmaroon

Or May in another language?

wkat4242

Gpt-oss comes only in 4.5 bit quant. This is the native model, so there's no fp16 original

jnmandal

I see a lot of hate for ollama doing this kind of thing but also they remain one of the easiest to use solutions for developing and testing against a model locally.

Sure, llama.cpp is the real thing, ollama is a wrapper... I would never want to use something like ollama in a production setting. But if I want to quickly get someone less technical up to speed to develop an LLM-enabled system and run qwen or w/e locally, well then its pretty nice that they have a GUI and a .dmg to install.

mchiang

Thanks for the kind words.

Since the new multimodal engine, Ollama has moved off of llama.cpp as a wrapper. We do continue to use the GGML library, and ask hardware partners to help optimize it.

Ollama might look like a toy and what looks trivial to build. I can say, to keep its simplicity, we go through a deep amount of struggles to make it work with the experience we want.

Simplicity is often overlooked, but we want to build the world we want to see.

dcreater

But Ollama is a toy, it's meaningful for hobbyists and individuals to use locally like myself. Why would it be the right choice for anything more? AWS, vLLM, SGLang etc would be the solutions for enterprise

I knew a startup that deployed ollama on a customers premises and when I asked them why, they had absolutely no good reason. Likely they did it because it was easy. That's not the "easy to use" case you want to solve for.

mchiang

I can say trying many inference tools after the launch, many do not have the models implemented well, and especially OpenAI’s harmony.

Why does this matter? For this specific release, we benchmarked against OpenAI’s reference implementation to make sure Ollama is on par. We also spent a significant amount of time getting harmony implemented the way intended.

I know vLLM also worked hard to implement against the reference and have shared their benchmarks publicly.

jnmandal

Honestly, I think it just depends. A few hours ago I wrote I would never want it for a production setting but actually if I was standing something up myself and I could just download headless ollama and know it would work. Hey, that would also be fine most likely. Maybe later on I'd revisit it from a devops perspective, and refactor deployment methodology/stack, etc. Maybe I'd benchmark it and realize its fine actually. Sometimes you just need to make your whole system work.

We can obviously disagree with their priorities, their roadmap, the fact that the client isn't FOSS (I wish it was!), etc but no one can say that ollama doesn't work. It works. And like mchiang said above: its dead simple, on purpose.

undefined

[deleted]

leopoldj

> Ollama has moved off of llama.cpp as a wrapper. We do continue to use the GGML library

Where can I learn more about this? llama.cpp is an inference application built using the ggml library. Does this mean, Ollama now has it's own code for what llama.cpp does?

guipsp

https://github.com/ollama/ollama/tree/main/model/models

buyucu

This kind of gaslighting is exactly why I stopped using Ollama.

GGML library is llama.cpp. They are one and the same.

Ollama made sense when llama.cpp was hard to use. Ollama does not have value preposition anymore.

mchiang

It’s a different repo. https://github.com/ggml-org/ggml

The models are implemented by Ollama https://github.com/ollama/ollama/tree/main/model/models

I can say as a fact, for the gpt-oss model, we also implemented our own MXFP4 kernel. Benchmarked against the reference implementations to make sure Ollama is on par. We implemented harmony and tested it. This should significantly impact tool calling capability.

Im not sure if im feeding here. We really love what we do, and I hope it shows in our product, in Ollama’s design and in our voice to our community.

You don’t have to like Ollama. That’s subjective to your taste. As a maintainer, I certainly hope to have you as a user one day. If we don’t meet your needs and you want to use an alternative project, that’s totally cool too. It’s the power of having a choice.

scosman

> GGML library is llama.cpp. They are one and the same.

Nope…

steren

> I would never want to use something like ollama in a production setting.

We benchmarked vLLM and Ollama on both startup time and tokens per seconds. Ollama comes at the top. We hope to be able to publish these results soon.

ekianjo

you need to benchmark against llama.cpp as well.

apitman

Did you test multi-user cases?

jasonjmcghee

Assuming this is equivalent to parallel sessions, I would hope so, this is like the entire point of vLLM

sbinnee

vllm and ollama assume different settings and hardware. Vllm backed by the paged attention expect a lot of requests from multiple users whereas ollama is usually for single user on a local machine.

romperstomper

It is weird but when I tried new gpt-oss:b20 model locally llama.cpp just failed instantly for me. At the same time under ollama it worked (very slow but anyway). I didn't find how to deal with llama.cpp but ollama definitely doing something under the hood to make models work.

miki123211

> I would never want to use something like ollama in a production setting

If you can't get access to "real" datacenter GPUs for any reason and essentially do desktop, clientside deploys, it's your best bet.

It's not a common scenario, but a desktop with a 4090 or two is all you can get in some organizations.

moralestapia

Ollama is great but I feel like Georgi Gerganov deserves way more credit for llama.cpp.

He (almost) single-handedly brought LLMs to the masses.

With the latest news of some AI engineers' compensation reaching up to a billion dollars, feels a bit unfair that Georgi is not getting a much larger slice of the pie.

mrs6969

Agreed. Ollama itself is kind a wrapper around llamacpp anyway. Feel like the real guy is not included to the process.

Now I am going to go and write a wrapper around llamacpp, that is only open source, truly local.

How can I trust ollama to not to sell my data.

Patrick_Devine

Ollama only uses llamacpp for running legacy models. gpt-oss runs entirely in the ollama engine.

You don't need to use Turbo mode; it's just there for people who don't have capable enough GPUs.

rafram

Ollama is not a wrapper around llama.cpp anymore, at least for multimodal models (not sure about others). They have their own engine: https://ollama.com/blog/multimodal-models

iphone_elegance

looks like the backend is ggml, am I missing something? same diff

benreesman

`ggerganov` is one of the most under-rated and under-appreciated hackers maybe ever. His name belongs next to like Carmack and other people who made a new thing happen on PCs. And don't forget the shout out to `TheBloke` who like single-handedly bootstrapped the GGUF ecosystem of useful model quants (I think he had a grant from pmarca or something like that, so props to that too).

freedomben

Is Georgi landing any of those big-time money jobs? I could see a conflict-of-interest given his involvment with llama.cpp, but I would think he'd be well positioned for something like that

apwell23

https://ggml.ai/

> ggml.ai is a company founded by Georgi Gerganov to support the development of ggml. Nat Friedman and Daniel Gross provided the pre-seed funding.

moralestapia

(This is mere speculation)

I think he's happy doing his own thing.

But then, if someone came in with a billion ... who wouldn't give it a thought?

webdevver

really a billion bucks is far too much, that is beyond the curve.

$50M, now thats just perfect. you're retired, nor burdened with a huge responsibility

am17an

Seriously, people astroturfing this thread by saying ollama has a new engine. It literally is the same engine that llama.cpp uses and georgi and slaren maintain! VC funding will make people so dishonest and just plain grifters

guipsp

No one is astroturfing. You cannot run any model with just GGML. It's a tensor library. Yes, it adds value, but I don't think that saying that ollama also does is unfair.

jasonjmcghee

Interested to see how this plays out - I feel like Ollama is synonymous with "local".

Aurornis

There's a small but vocal minority of users who don't trust big companies, but don't mind paying small companies for a similar service.

I'm also interested to see if that small minority of people are willing to pay for a service like this.

jillesvangurp

The issue is not companies but governance. OSS licenses and companies are fine. Companies have a natural conflict of interest that can lead them to take software projects they control in a direction that suits their revenue goals but not necessarily the needs/wants of its users. That happens over and over again. It's their nature. This can means changes in direction/focus or worst case license changes that limit what you can do.

The solution is having proper governance for OSS projects that matter with independent organizations made up of developers, companies, and users taking care of the governance. A lot of projects that have that have last for decades and will likely survive for decades more.

And part of that solution is to also steer clear of projects without that. I've been burned a couple of times now getting stuck with OSS components where the license was changed and the companies behind it had their little IPOs and started serving share holders instead of users (elastic, redis, mongo, etc). I only briefly used Mongo and I got a whiff of where things were going and just cut loose from it. With Elastic the license shenenigans started shortly after their IPO and things have been very disruptive to the community (with half using Opensearch now). With Redis I planned the switch to Valkey the second it was announced. Clear cut case of cutting loose. Valkey looks like it has proper governance. Redis never had that.

Ollama seems relatively OK by this benchmark. The software (ollama server) is MIT licensed and there appears to be no contributor license agreement in place. But it's a small group of people that do most of the coding and they all work for the same vc funded company behind ollama. That's not proper governance. They could fail. They could relicense. They could decide that they don't like open source after all. Etc. Worth considering before you bet your company on making this a foundational piece of your tech stack.

recursivegirth

Ollama, run by Facebook. Small company, huh.

mchiang

Ollama is not run by Facebook. We are a small team building our dreams.

threetonesun

I view it a bit like I do cloud gaming, 90% of the time I'm fine with local use, but sometimes it's just more cost effective to offload the cost of hardware to someone else. But it's not an all-or-nothing decision.

theshrike79

Yep, if you just want to play one or two games at 4k HDR etc. it's a lot cheaper to pay 22€ for GeForce Now Ultimate vs. getting a whole-ass gaming PC capable of the same.

liuliu

Any more information on "Privacy first"? It seems pretty thin if just not retaining data.

For Draw Things provided "Cloud Compute", we don't retain any data too (everything is done in RAM per request). But that is still unsatisfactory personally. We will soon add "privacy pass" support, but still not to the satisfactory. Transparency log that can be attested on the hardware would be nice (since we run our open-source gRPCServerCLI too), but I just don't know where to start.

pagekicker

I see no privacy advantage to working with Ollama, which can sell your data or have it subpoenaed just like anyone else.

liuliu

In theory, "privacy pass" should help, as you can subpoena content, but cannot know who made these. But that is still thin (and Ollama not doing that too anyway).

jmort

I don't see a privacy policy and their desktop app is closed source. So, not encouraging.

[full disclosure I am working on something with actual privacy guarantees for LLM calls that does use a transparency log, etc.]

pbronez

I’d love to learn more about your project. I’m using socialized cloud regions for AI security and they really lag the mainstream. Definitely need more options here.

Edit: emailed the address on the site in your profile, got an inbox does not exist error.

pogue

I would pay more if they let you run the models in Switzerland or some other GDPR respecting country, even if there was extra latency. I would also hope everything is being sent over SSL or something similar.

seanmcdirmid

I had to do a double take here. Switzerland surely isn’t in the GDPR, so you mean their own privacy laws or GDPR in the EU?

jacekm

What could be the benefit of paying $20 to Ollama to run inferior models instead of paying the same amount of money to e.g. OpenAI for access to sota models?

daft_pink

I feel the primary benefit of this Ollama Turbo is that you can quickly test and run different models in the cloud that you could run locally if you had the correct hardware.

This allows you to try out some open models and better assess if you could buy a dgx box or Mac Studio with a lot of unified memory and build out what you want to do locally without actually investing in very expensive hardware.

Certain applications require good privacy control and on-prem and local are something certain financial/medical/law developers want. This allows you to build something and test it on non-private data and then drop in real local hardware later in the process.

jerieljan

> quickly test and run different models in the cloud that you could run locally if you had the correct hardware.

I feel like they're competing against Hugging Face or even Colaboratory then if this is the case.

And for cases that require strict privacy control, I don't think I'd run it on emergent models or if I really have to, I would prefer doing so on an existing cloud setup already that has the necessary trust / compliance barriers addressed. (does Ollama Turbo even have their Trust center up?)

I can see its potential once it gets rolling, since there's a lot of ollama installations out there.

fluidcruft

Me at home: $20/mo while I wait for a card that can run this or dgx box? Decisions, decisions.

dawnerd

Quickly test… the two models they support? This is just another subscription to quantized models.

daft_pink

it looks like the plan is to support way more models though. gotta start somewhere.

rapind

I'm not sure the major models will remain at $20. Regardless, I support any and all efforts to keep the space crowded and competitive.

adrr

Running models without a filter on it. OpenAI has an overzealous filter and won’t even tell you what you violated. So you have to do a dance with prompts to see if it’s copyright, trademark or whatever. Recently it just refused to answer my questions and said it wasn’t true that a civil servant would get fired for releasing a report per their job duties. Another dance sending it links to stories that it was true so it could answer my question. I want a LLMs without training wheels.

michelsedgh

I think its the data privacy is the main point and probably more usage before you hit limits? But mainly data privacy i guess

undefined

[deleted]

ibejoeb

I run a lot of mundane jobs that work fine with less capable models, so I can see the potential benefit. It all depends on the limits though.

_--__--__

Groq seems to do okay with a similar service but I think their pricing is probably better.

woadwarrior01

Groq's moat is speed, using their custom hardware.

Geezus_42

Yeah, the NAZI sex not will be great for business!

fredoliveira

Groq (the inference service) != Grok (xAI's model)

gabagool

You are thinking of Elon Grok, not Groq

AndroTux

Privacy, I guess. But at this point it’s just believing that they won’t log your data.

dcreater

Called it.

It's very unfortunate that the local inference community has aggregated around Ollama when it's clear that's not their long term priority or strategy.

Its imperative we move away ASAP

tarruda

Llama.cpp (library which ollama uses under the hoods) has its own server, and it is fully compatible with open-webui.

I moved away from ollama in favor of llama-server a couple of months ago and never missed anything, since I'm still using the same UI.

mchiang

totally respect your choice, and it's a great project too. Of course as a maintainer of Ollama, my preference is to win you over with Ollama. If it doesn't meet your needs, it's okay. We are more energized than ever to keep improving Ollama. Hopefully one day we will win you back.

Ollama does not use llama.cpp anymore; we do still keep it and occasionally update it to remain compatible for older models for when we used it. The team is great, we just have features we want to build, and want to implement the models directly in Ollama. (We do use GGML and ask partners to help it. This is a project that also powers llama.cpp and is maintained by that same team)

am17an

I’ve never seen a PR on ggml from Ollama folks though. Could you mention one contribution you did?

kristjansson

> Ollama does not use llama.cpp anymore;

> We do use GGML

Sorry, but this is kind of hiding the ball. You don't use llama.cpp, you just ... use their core library that implements all the difficult bits, and carry a patchset on top of it?

Why do you have to start with the first statement at all? "we use the core library from llama.cpp/ggml and implement what we think is a better interface and UX. we hope you like it and find it useful."

tarruda

> Ollama does not use llama.cpp anymore

That is interesting, did Ollama develop its own proprietary inference engine or did you move to something else?

Any specific reason why you moved away from llama.cpp?

daft_pink

So I’m using turbo and just want to provide some feedback. I can’t figure out how to connect raycast and project goose to ollama turbo. The software that calls it essentially looks for the models via ollama but cannot find the turbo ones and the documentation is not clear yet. Just my two cents, the inference is very quick and I’m happy with the speed but not quite usable yet.

halJordan

Fully compatible is a stretch, it's important we dont fall into a celebrity "my guy is perfect" trap. They implement a few endpoints.

jychang

They implement more openai-compatible endpoints than ollama at least

benreesman

I won't use `ollama` on principle. I use `llama-cli` and `llama-server` if I'm not linking `ggml`/`gguf` directly. It's like, two extra commands to use the one by the genius that wrote it and not the one that the guys just jacked it.

The models are on HuggingFace and downloading them is `uvx huggingface-cli`, the `GGUF` quants were `TheBloke` (with a grant from pmarca IIRC) for ages and now everyone does them (`unsloth` does a bunch of them).

Maybe I've got it twisted, but it seems to be that the people who actually do `ggml` aren't happy about it, and I've got their back on this.

om8

It’s unfortunate that llama.cpp’s code is a mess. It’s impossible to make any meaningful contributions to it.

kristjansson

I'm the first to admit I'm not a heavy C++ user, so I'm not a great judge of the quality looking at the code itself ... but ggml-org has 400 contributors on ggml, 1200 on llama.cpp and has kept pace with ~all major innovations in transformers over the last year and change. Clearly some people can and do make meaningful contributions.

A4ET8a8uTh0_v2

Interesting, admittedly, I am slowly getting to the point, where ollama's defaults get a little restrictive. If the setup is not too onerous, I would not mind trying. Where did you start?

tarruda

Download llama-server from llama.cpp Github and install it some PATH directory. AFAIK they don't have an automated installer, so that can be intimidating to some people

Assuming you have llama-server installed, you can download + run a hugging face model with something like

    llama-server -hf ggml-org/gpt-oss-20b-GGUF -c 0 -fa --jinja

And access http://localhost:8080

theshrike79

Isn't the open-webui maintainer heavily against MCP support and tool calling?

mchiang

hmm, how so? Ollama is open and the pricing is completely optional for users who want additional GPUs.

Is it bad to fairly charge money for selling GPUs that cost us money too, and use that money to grow the core open-source project?

At one point, it just has to be reasonable. I'd like to believe by having a conscientious, we can create something great.

dcreater

First, I must say I appreciate you taking the time to be engaged on this thread and responding to so many of us.

What I'm referring to is a broader pattern that I (and several) others have been seeing. Of the top of my head: not crediting llama.cpp previously, still not crediting llama.cpp now and saying you are using your own inference engine when you are still using ggml and the core of what Georgi made, most importantly why even create your own version - is it not better for the community to just contribute to llama.cpp?, making your own propreitary model storage platform disallowing using weights with other local engines requiring people to duplicate downloads and more.

I dont know how to regard these other than being largely motivated out of self interest.

I think what Jeff and you have built have been enormously helpful to us - Ollama is how I got started running models locally and have enjoyed using it for years now. For that, I think you guys should be paid millions. But what I fear is going to happen is you guys will go the way of the current dogma of capturing users (at least in mindshare) and then continually squeezing more. I would love to be wrong, but I am not going to stick around to find out as its risk I cannot take.

tomrod

Everyone just wants to solarpunk this up.

dcreater

In an ideal world yes - as we should - especially for us Californian/Bay Area people, that's literally our spirit animal. But I understand that is idle dreaming. What I believe certainly is within reach is a state that is much better than what we are in.

sitkack

I believe that is what https://github.com/containers/ramalama set out to do.

janalsncm

Huggingface also offers a cloud product, but that doesn’t take away from downloading weights and running them locally.

idiotsecant

Oh no this is a positively diabolical development, offering...hosting services tailored to a specific use case at a reasonable price ...

SV_BubbleTime

They can’t keep getting away with this.

mrcwinn

Yes, better to get free sh*t unsustainably. By the way, you're free to create an open source alternative and pour your time into that so we can all benefit. But when you don't — remember I called it!

rpdillon

What? The obvious move is to never have switched to Ollama and just use Llama.cpp directly, which I've been doing for years. Llama.cpp was created first, is the foundation for this product, and is actually open source.

wkat4242

But there's much less that works with that. OpenWebUI for example.

Aurornis

> Its imperative we move away ASAP

Why? If the tool works then use it. They’re not forcing you to use the cloud.

dcreater

There are many, many FOSS apps that use Ollama as a dependency. If Ollama rugs, then all those projects suffer.

Its a tale we seen played out many times. Redis is the most recent example.

Hasnep

Most apps that integrate with ollama that I've seen just have an OpenAI compatible API parameter which defaults to port 11434 which ollama uses, but can be changed easily. Is there a way to integrate ollama more deeply?

prettyblocks

Local inference is becoming completely commoditized imo. These days even docker has a local models you can launch with a single click (or command).

captainregex

I am so so so confused as to why Ollama of all companies did this other than an emblematic stab at making money-perhaps to appease someone putting pressure on them to do so. Their stuff does a wonderful job of enabling local for those who want it. So many things to explore there but instead they stand up yet another cloud thing? Love Ollama and hope it stays awesome

janalsncm

The problem is that OSS is free to use but it is not free to create or maintain. If you want it to remain free to use and also up to date, Ollama will need someone to address issues on GitHub. Usually people want to be paid money for that.

captainregex

money is great! I like money! but if this is their version of buy me a coffee I think there’s room to run elsewhere for their skillset/area of expertise

mchiang

hmm, I don't think so. This is more of, we want to keep improving Ollama so we can have a great core.

For the users who want GPUs, which cost us money, we will charge money for it. Completely optional.

ahmedhawas123

So much that is interesting about this

For one of the top local open model inference engines of choice - only supporting OSS out of the gate feels like an angle to just ride the hype knowing OSS is announced today "oh OSS came out and you can use Ollama Turbo to use it"

The subscription based pricing is really interesting. Other players offer this but not for API type services. I always imagine that there will be a real pricing war with LLMs with time / as capabilities mature, and going monthly pricing on API services is possibly a symptom of that

What does this mean for the local inference engine? Does Ollama have enough resources to maintain both?

timmg

It says “usage-based pricing” is coming soon. I think that is the sweet spot for a service like this.

I pay $20 to Anthropic, so I don’t think I’d get enough use out of this for the $20 fee. But being able to spin up any of these models and use as needed (and compare) seems extremely useful to me.

I hope this works out well for the team.

ac29

> It says “usage-based pricing” is coming soon. I think that is the sweet spot for a service like this.

Agreed, though there are already several providers of these new OpenAI models available, so I'm not sure what ollama's value add is there (there are plenty of good chat/code/etc interfaces available if you are bringing your own API keys).

wongarsu

A flat fee service for open-source LLMs is somewhat unique, even if I don't see myself paying for it.

Usage-based pricing would put them in competition with established services like deepinfra.com, novita.ai, and ultimately openrouter.ai. They would go in with more name-recognition, but the established competition is already very competitive on pricing

Aeolun

I mean $20/month for API access is definitely new.

paxys

A subscription fee for API usage is definitely an interesting offering, though the actual value will depend on usage limits (which are kept hidden).

mchiang

we are learning the usage patterns to be able to price this more properly.

turnsout

Man, busy day in the world of AI announcements! This looks coordinated with OpenAI, as it launches with `gpt-oss-20b` and `gpt-oss-120b`

sambaumann

Yep, on the ollama home page (https://ollama.com/) it says

> OpenAI and Ollama partner to launch gpt-oss

hobofan

I do hope Ollama got a good paycheck from that, as they are essentially help OpenAI to oss-wash their image with the goodwill that Ollama has built up.

Daily Digest email

Get the top HN stories in your inbox every day.