Llama.cpp: Full CUDA GPU Acceleration

Daily Digest email

Get the top HN stories in your inbox every day.

adeon

llama.cpp is great. It started off as CPU-only solution and now looks like it wants to support any computation device it can.

I find it interesting that it's an example of an ML software that's totally detached from Python ML ecosystem and also popular.

Is Python so annoying to use that when a compelling non-Python solution appears, everyone will love it? Less hassle? Or did it take off for a different reason? Interested in hearing thoughts.

slabity

> Is Python so annoying to use that when a compelling non-Python solution appears, everyone will love it? Less hassle? Or did it take off for a different reason? Interested in hearing thoughts.

For me it's less about the language and more about the dependencies. When I want to run a Python ML program, I need to go through the hassle of figuring out what the minimum version is, which package manager/distribution I need to use, and what system libraries those dependencies need to function properly. If I want to build a package for my distribution, these problems are dialed up to 11 and make it difficult to integrate (especially when using Nix). On top of that, those dependencies typically hide the juicy details of the program I actually care about.

For something like C or C++? Usually the most complicated part is running `make` or `cmake` with `pkgconfig` somewhere in my path. Maybe install some missing system libraries if necessary.

I just don't want to install yet another hundred copies of dependencies in a virtualenv and just hope it's set up correctly.

xingped

> For me it's less about the language and more about the dependencies. When I want to run a Python ML program, I need to go through the hassle of figuring out what the minimum version is, which package manager/distribution I need to use, and what system libraries those dependencies need to function properly.

This is exactly why I hate Python. They even have a pseudo-package.json style dependencies file that you should supposedly be able to just run "install" with, but it NEVER works. Not once have I ever downloaded someone's Python project from github and tried to install the dependencies and run it has it ever gone smoothly and without issue.

The Python language itself may be great, I don't know, but I'm forever put off from learning or using it because clearly in all the years it's been around they have yet to figure out reproducibility of builds. And it's obviously possible! JavaScript manages to accomplish it just fine with npm and package.json. But for some reason Python and its community cannot figure it out.

tryauuum

Would the problem "how can I run this cool ML project from GitHub" be solved if developers would publish their container images on dockerhub? The only downside I see is enormous image sizes

KETpXDDzR

I use pypoetry for dependency management in Python projects. It helps a lot, but doesn't resolve the issue of pip packages to fail installing because you're missing system libraries. At least it specified the Python version to use. With many open source ML repos I have to guess what Python version to use.

I'd really like to see more Docker images (images, not Dockerfiles that fail to build). Maybe flatpack or snap packages do the trick, too.

buzzert

> Not once have I ever downloaded someone's Python project from github and tried to install the dependencies and run it has it ever gone smoothly and without issue.

Same here. If it can't resolve dependencies or whatever, then there will almost certainly be some kind of showstopping runtime error (probably because of API changes or something). I avoid Python programs at all cost nowadays.

rubicks

This.

`pip` is the package manager that almost works.

Python is the language that almost supports package distribution.

I'll keep using `apt` on vanilla Debian.

troad

Strong agree. I'll willingly install a handful dependencies from my distro package manager, where the dependencies are battle-hardened Unixy tools and I can clearly see what they do and how they do it.

I'm not going to install thousands of dodgy-looking packages from pip, the only documentation for which is a Discord channel full of children exchanging 'dank memes'.

I like Python, but I simply do not trust the pip ecosystem at this point (same for npm, etc.).

barbariangrunge

> I'm not going to install thousands of dodgy-looking packages from pip, the only documentation for which is a Discord channel full of children exchanging 'dank memes'.

This made me laugh. It’s true, isn’t it? That’s really what we deal with day to day (for me in the js world, the create react app dependencies make my head spin)

the__alchemist

> For something like C or C++? Usually the most complicated part is running `make` or `cmake` with `pkgconfig` somewhere in my path. Maybe install some missing system libraries if necessary.

I don't mean to detract from your main point re Python dependencies, but I find this about C to be rarely true. `make` etc build flows usually result in dependency-related compile errors, or may require a certain operating system. I notice this frequently with OSS or academic software that doens't ship binaries.

andybak

Yep. I've given up on any C or C++ projects because I find they almost never work and waste hours of my time. Part of the issue might be the fact that I'm often using Windows or MacOS but I've had bad experiences on Linux also.

pessimizer

> `make` etc build flows usually result in dependency-related compile errors,

Which are displayed to me during the `./configure` step before the `make`, and usually require me to type "apt-get install [blah] [blah] [blah]", and to run configure again.

potatolicious

So much this. As someone whose bread and butter is systems programming for things that run on end-user devices, every time I dig into a Python project I feel like I've been teleported into the darkest timeline, where everything is environment management hell.

Even the more complex and annoying scenarios in native-land for dependency management still feels positively idyllic in comparison to Python venvs.

pjmlp

When I initially started to learn Python (1.6), virtualenv was starting to be adopted, and since then thing have hardly changed.

It also helps that even minor versions introduce breaking changes.

I doubt anyone really knows Python that well, unless they are on the core team.

undefined

[deleted]

bhy

Well I just spent an hour to diagnose the build failure of llama.cpp due to it picking up wrong nvcc path.

Dependency problem still happens even with C/C++.

Tostino

I had the same issue... Turned out it was because I used the flat pack version of intellij idea and it had problems with paths. Running from a plain terminal worked fine.

adeon

This is also the reason I like when I see a project in C or C++. It's often a ./configure && make or something. Sometimes running a Python project even if dependencies install, there might be some mystery crash because package dependencies were not set correctly or something similar (I had a lot of trouble with AUTOMATIC1111 StableDiffusion UI when using some extensions that installed their own requirements that might be in conflict with the main project).

With a boring C project, if it compiles it probably works without hassle.

Feels validating that other people have these thoughts too and I'm not just some old fart.

lynx23

I recently hit the "classic" case. Saw a CLI tool for an API I'd like to use, written in Python. Tried it and found out it didn't work on my machine. I later found out it was a bug in a dependency of that tool. 100 lines of shell script later, I had the functionality I needed, and a codebase which was actually free of unexpected surprises. I know, this is an extreme example, but as personal anecotes go, Python has lost a lot of trust from my side. I also wonder how people can write >10k codebases without static types, but that is just me ....

raxxorraxor

I think it is the opposite for me but I am also a fan of system independent package mangers, provided they support easy package configuration.

Otherwise you not only bind to system architecture and OS, you also bind yourself to a distribution.

I find that Automatic1111 plugins tend to not share dependencies and instead redownloads them for their own use. Can make your hdd cry because some of these are larger models. Advantages and disadvantages probably...

There are package managers for C and some are quite good. But for most projects you are quite dependent on the package manager of your distro to supply you a fitting foundation. Sometimes it is easy, but if there is a problem, I think handling C is far harder than python. And I write quite a bit of C while I can only perhaps read python code.

No code is completely platform independent, especially a stable diffusion project, but Python is still more flexible as C by a long shot here.

Of course Llama is great. Time to get those LLMs on our devices for our personal dystopian AIs running amok.

drdaeman

> which package manager/distribution I need to use, and what system libraries those dependencies need to function properly

I don't understand why things are so complicated in Python+ML world.

Normally, when I have a Python project, I just pick the latest Python version - unless documentation specifically tells me otherwise (like if it's still Python 2 or if 3.11 is not yet supported). If the project maintainer had some sense, it will have a requirements list with exact locked versions, so I run `pip install -r requirements.txt` (if there is a requirements.txt), `pipenv sync` (if there is a Pipfile), or `poetry install` (if there's pyproject.toml). That's three commands to remember, and that's not one just because pip (the one de-facto package manager) has its limitations but community hadn't really decided on the successor. Kinda like `make` vs automake vs `cmake` (vs `bazel` and other less common stuff; same with Python).

External libraries are typically not needed - because they'll be either provided in binary form with wheels (prebuilt for all most common system types), or automatically built during the installation process, assuming that `gcc`, `pkgconfig` and essential headers are available.

Although, I guess, maybe binary wheels aren't covering all those Nvidia driver/CUDA variations? I'm not a ML guy, so I'm sure how this is handled - I've heard there are binary wheels for CUDA libraries, but never used that.

Ideally, there's Nix (and poetry2nix) that could take care of everything, but only a few folks write Flakes for their projects.

> Usually the most complicated part is running `make` or `cmake` with `pkgconfig` somewhere in my path

Getting the correct version of all the dependencies is the trickiest part as there is no universal package managers - so it's all highly OS/distro specific. Some projects vendor their dependencies just to avoid this (and risk getting stuck with awfully out-of-date stuff).

> Maybe install some missing system libraries if necessary.

And hope their ABIs (if they're just dynamically loaded)/headers (if linked with) are still compatible with what the project expects. At least that is my primary frustration when I try to build something and it says it doesn't work anymore with whatever OS provides (mostly, Debian stable fault lol). It is not exactly fun to backport a Debian package (twice so if doing this properly and not handwaving it with checkinstall).

rgoulter

> Ideally, there's Nix (and poetry2nix) that could take care of everything, but only a few folks write Flakes for their projects.

Relevant to "AI, Python, setting up is hard ... nix", there's stuff like:

https://github.com/nixified-ai/flake

NBJack

The right combo for Nvidia/CUDA/RandomPythonML library is a nightmare at times. This is especially true if you want to use older hardware like a Tesla M40 (dirt cheap, still capable). And your maker hopefully be with you if you if you tried to use your distro's native drivers first.

It's fair to say part of the blame is on Nvidia, but wow is it frustrating when you have to find eclectic mixes.

kkfx

My personal recipe (on NixOS) is pip-ed virtual environment for quick tests, or conda inside a nix-shell, on top of a dedicated zfs pool/conda mounted in ~/.conda with dedup=on so nothing nixified and nothing that last a nixos-rebuild...

Many pythonic projects not only in ML world tend to be just developers experiments, so to be run as an experiment, not worth to be packaged as a stable, released program...

Oh, BTW projects like home-assistant fell in the same bucket...

nologic01

Wow, so interesting to see the "depth" of anti-python feeling in some quarters. I guess that is the backlash from all the hordes of Python-bros.

Having used both C++ and Python for some time, the idea that managing C++ dependencies is easier than venv and pip install is one of the moments you wonder how credible is HN opinion on anything.

> a compelling non-Python solution appears

Confusing a large ML framework like pytorch that allows you to experiment and develop any type of model with a particular optimized implementation in a low level language suggests people are not even aware of basic workflows in this space.

> also popular

Ofcourse its popular. As in: People are delirious with LLM FOMO but can't fork gazillions to cloud gatekeepers or NVIDIA so anybody who can alleviate that pain is like a deus-ex-machina.

Ofcourse llama.cpp and its creator are great. But the exercise primarily points out that there isn't a unified platfrom to both develop and deploy ML type models in a flexible and hardware agnostic way.

p.s. For julia lovers that usually jump at the "two-language problem" of Python: here is your chance to shine. There is a Llama.jl that wraps Llama.cpp. You want to develop a native one.

mook

Managing C++ dependencies _is_ much easier! It's either "run this setup exe" or "extract this zip file/tarball/dmg and run".

This is because most people don't care about developing the project, just using it. So they don't care what the dependencies are, just that things work. C++ might be more difficult to handle dependencies to build things, but few people will look into hacking on the code before checking to see if it's even relevant.

fransje26

> Managing C++ dependencies _is_ much easier! It's either "run this setup exe" or "extract this zip file/tarball/dmg and run".

Not sure if that was half sarcastic, but from experience, anything touching c++ & CUDA has the potential to devolve into a nightmare on a developer-friendly platform (hello kernel updates!), or worse, into a hair-pulling experience if you happen to use the malware from Redmond (to compound your misery, throw boost and Qt dependencies in the mix).

Then again, some of the configurations steps required will be the same for Python or C++. And if the C++ developer was kind enough to provide a one-stop compilation step only depending on an up-to-date compiler, it might, indeed, be better than a Python solution with strict package version dependencies.

nologic01

Maybe this distinction explains indeed the dissonance! But it might be rather shortsighted given the state of those models and the need to tune them.

ric2b

If we're only talking about end-user "binaries" you can also package Python protects into exe files or similar format that bundle all the dependencies and are ready to run.

simion314

>Wow, so interesting to see the "depth" of anti-python feeling in some quarters. I guess that is the backlash from all the hordes of Python-bros.

I think you are generalizing. I do not hate on Python the language but this ML projects are a very , very terrible experience. Maybe you can lame the devs of this ML projects, or the ones of the dependencies but the experience is shit. You can follow a step by step instruction that worked 11 day ago and today is broken.

I had similar issues with Python GTK apps, if the app is old enough then you are crewed because that old GTK version is no longer packaged, if the app is very new then you are screwed again because it depends on latest version of some GTK wrapper/helper.

nologic01

I think what has happened is that because Python is sweet and easy to use for many things, it generated irrational expectations that is perfect for all things. But its just an interpreted language that started as a scripting and gluing oriented language.

Its deployment story is where this gap frequently shows. Desktop apps at best passable, whereas e.g. android apps practically non-existing despite the efforts of projects like kivy.

tempera

I think the Python hate has been manufactured, starting with Google's launch of Go language, which wanted to be eat Python's cake.

And some jumped on the started bandwagon unknowingly, and never got off of it.

lexandstuff

Just to balance things out: I still love Python. A lot!

noman-land

I don't even write python, really, but I've been interfacing with llama.cpp and whisper.cpp through it recently because it's been most convenient. Before that I was using nodejs libraries that just wrap those two cpp libs.

I guess since these models are meant to be run "client side" or "at the edge" or whatever you want to call it, it helps if they can be neutrally used with just about any wrapper. Using them from Javascript instead of Python is sort of huge for moving ML off the server and into the client.

I haven't really dipped my toes into the space until llama and whisper cpp came along because they dropped the barrier extremely low. The documentation is simple, and excellent. The tools that it's enabled on top like text-generation-webui are next level easy.

git clone. make. download model. run.

That's it.

v3ss0n

Quality of HN comments are getting bad for a few months. This is nothing to do with python ML ecosystem and what you have to realize llmcpp doing is it is inferencing already built models- which is running the models.

Building (training) machine learning, deep learning models are much more complex , order of magnitude complex than just running the models and doing that in C or C++ would take you years which would take just a few month with python.

And complexity of `pip install` is nothing compared to that.

That's why no real ETL+Deep learning, training work is done in c or c++.

cztomsik

You're not entirely wrong but pretty much everything you use from python is written in C++ anyway, so what's your point?

ageofwant

The point is, as you pointed out, that you code against the appropriate level of abstraction. You write a ML workflow appropriate language like Python in something like C++/rust, and ML flows in Python. That should really not be that hard to understand.

v3ss0n

It is same argument as "Every HTML, CSS , JAVASCRIPT" development you do is written in C/C++ anyways .

forgingahead

Yes Python is incredibly annoying to use. Their dependency management is a total mess, and it's incredible how brittle packages if there are even minor point changes in versions anywhere in a stack.

sp332

I have to agree. Installing dependencies for some git repos is a total crapshoot. I ended up wasting so much hard drive space with copies of pytorch. Meanwhile llama.cpp is just "make" and takes less time to build than to download one copy of pytorch.

cshimmin

So, the solution is that everyone should write code as self-contained C++ code and not use any software libraries ever. Dependency hell has been solved for all time!

realusername

I personally really don't like much Python, I find it as tedious to write as Go but without the added performance, typesafety and autocomplete benefits that comes with it in exchange.

If I have use a dynamic language, at least make it battery included like Ruby. Sure it's also not performant but I get something back in exchange.

Python sits in a very uncomfortable spot which I don't find a use for. Too verbose for a dynamic language and not performant enough compared to a more static language.

The testing culture is also pretty poor in my opinion, packages rarely have proper tests, (and especially in the ML field)

emmender

In addition to the above:

1) function decorators etc have made the code unreadable

2) while code is succinct, a lot of abstraction is hidden in some C/C++ language binding somewhere, so, when there is a problem, it is hard to debug

3) Pytorch has become a monolithic monster with practically no-one understanding its functionality e-2-e

pjmlp

For me Python's main use cases are being BASIC replacement, and a saner alternative to Perl (I like Perl though) for OS scripting.

For everything else, I rather use compiled languages with JIT/AOT toolchains, and since most "Python libraries" for machine learning are actually C and C++, any language goes, there is nothing special about Python there.

yieldcrv

The Python apologists are more annoying than the language.

Its always been obvious that ML’s marriage to python has always been credential ladened proponents in tangentially related fields following group think.

As soon as we got a reason to ignore those PhDs, their gatekept moat evaporated overnight and the community of [co-]dependencies became irrelevant overnight.

dontreact

As far back as 2015, it’s been common to take neural net inference and get a C++ version of it. That’s what this is.

It didn’t make python obsolete then and it won’t now.

Training (orchestration, architecture definition etc.) and data munging (scraping cleaning analyzing etc.) are much easier with python than C++, and so there is no chance that C++ takes over as the lingua Franca of machine learning until those activities are rare

emmender

python is syntactic sugar - the heavy lifting is done by c/c++ bindings.

many ML experts are not software engineers. They just want syntax to get their job done. fair enough.

aidenn0

Slightly OT:

I have been playing around with whisper.cpp; it's nice because I can run the large model (quantized to 8-bits) at roughly real-time with cublas on a Ryzen 2700 with a 1050Ti. I couldn't even run the pytorch whisper medium on this card with X11 also running.

It blows me away that I can get real-time speech-to-text of this quality on a machine that is almost 5 years old.

eurekin

Seconded. I were playing around for my native language (Polish) and the large models actually blew me away. For example, it handled "przescreenować" spelling correctly, which is an english word with a polish prefix and a conjugated suffix.

nitinreddy88

is there any dummy guide to get started with any of these?

Tepix

Have you tried the Quick start in the https://github.com/ggerganov/whisper.cpp README?

m3affan

This is an impresdive use case

moneywoes

Is it possible to run on apple m1 devices or mobile phones or not yet?

michelb

I can recommend the MacWhisper app if you prefer a gui.

Void_

And Whisper Memos for iOS https://whispermemos.com/

raihansaputra

yeah the whisper.cpp github page has a demo for both. Have used it on my M1 MBA for the past few months.

shon

Nice to see Georgi has started a company:

https://twitter.com/ggerganov/status/1666120568993730561?s=4...

Godspeed

fnands

Nice. He's obviously a talented engineer who's struck a nerve with the whisper.cpp/llama.cpp projects, so hope he has success with whatever he plans to do.

csmpltn

A lot of work going into refactoring proprietary code that can be randomly deprecated and outcompeted without any prior notice by any number of large competitors... problematic business model, in my opinion.

logicchains

He's building a library, ggml; it's generally not hard to add support for new models. For instance llama.ccp already supports the Falcon 7B model (different architecture to llama). And given how politicised AI has become, there's unlikely to be many companies releasing weights for models competitive with the current models (e.g. LLaMA 65B). They may have private models that are better, like GPT3.5 and GPT4, but you can't run these on your own server so they're not competing with ggml.

csmpltn

> "there's unlikely to be many companies releasing weights for models competitive with the current models"

We're at the very dawn of this technology going mainstream, and you're saying that it's unlikely for new players to release new, competing and incompatible models?

underdeserver

Can someone ELI5 why AMD is not in this game? Is it really so much harder to implement this in a non-platform-specific library?

captainbland

CUDA is the best supported solution, tends to get you access to the best performance, has a great profiler (it will literally tell you things like "your memory accesses don't seem to be coalesced properly" or "your kernel is ALU limited" as well as a bunch of useful stats), even works in windows, all of that.

OpenCL is (was?) the main open alternative to CUDA and was mainly backed by AMD and Apple. Apple got bored of it when they decided Metal was the future. AMD got bored of it when they developed rocm and HIP (basically a partially complete compatibility layer with CUDA).

There's also stuff like DirectML which only works on windows and e.g. various (Vulkan, directx etc.) compute shaders which are really more oriented at games.

There's also a bit of a performance aspect to it. Obviously GPGPU stuff is massively performance sensitive and CUDA gets you the best performance on the most widely supported platform.

AMD are also the main competitor in the space but all but totally dropped support for GPGPU in desktop cards, trying to instead focus on gaming for Radeon and compute in MI/CDNA. They seem to have realised their mistake a bit and are now introducing some support for RDNA2+ cards.

sorenjan

If I remember correctly it's not just that AMD has poor support for their consumer cards, their Rocm code doesn't compile to a device agnostic intermediate, so you have to recompile for each chip. New and old Cuda compatible cards (like all Geforce cards) can run your already shipped Cuda code, as long as it doesn't use new and unsupported features. So even if AMD had supported more cards, the development and user experience would be much worse anyway where you have to find out if your specific card is supported.

slavik81

All RDNA2 GPUs and many Vega GPUs could use the same ISA (modulo bugs). Long ago, there was an assumption made that it was safer to treat each minor revision of the GPU hardware as having its own distinct ISA, so that if a hardware bug were found it could be addressed by the compiler without affecting the code generation for other hardware. In practice, this resulted in all but the flagship GPUs being ignored as libraries only ended up getting built for those GPUs in the official binary releases. And in source releases of the libraries, needlessly specific #ifdefs frequently broke compilation for all but the flagship ISAs.

There was an implicit assumption that just building for more ISAs was no big deal. That assumption was wrong, but the good news is that big improvements to compatibility can be made even for existing hardware just by more thoughtful handling of the GFX ISAs.

If you know what you're doing, it's possible to run ROCm on nearly all AMD GPUs. As I've been packaging the ROCm libraries for Debian, I've been enabling support for more hardware. Most GFX9 and GFX10 AMD GPUs should be supported in packages on Debian Experimental in the upcoming days. That said, it will need extensive testing on a wide variety of hardware before it's ready for general use. And we still have lots more libraries to package before all the apps that people care about will run on Debian.

jahav

True, it's better just to use OpenSYCL that stores intermediate device agnostic form and complies it as needed to specific card.

I don't understand why isn't SYCL more widely used.

SubjectToChange

“Apple got bored of it when they decided Metal was the future. AMD got bored of it when they developed rocm and HIP”

Apple backed OpenCL because they needed an alternative after their divorce with Nvidia. No one was going to target an AMD alternative when they had such a trivial market share, so it had to be an open standard. Initially this arrangement was highly productive and OpenCL 1.x enjoyed terrific success. Vendors across compute markets piled support behind OpenCL and many even started actively participating in it. However this success is what precipitated the disastrous OpenCL 2.x series. In other words, OpenCL 2.x was far too revolutionary for many and far too conservative for others. What followed was Apple pulling out to pursue Metal, AMD having shoddy drivers, Nvidia all but ignoring it, and mobile chip vendors basically sticking to 1.2 and nothing more. Eventually this deadlock was fixed after OpenCL 3.0 walked back the changes of 2.x, but this was in large part because the backers of 2.x moved to SYCL.

As for AMD, OpenCL was a tremendous boon when it was first introduced. At least initially it gave them a fighting chance against CUDA. But it was never realistic for OpenCL to be a complete CUDA alternative. I mean, any standard that is basically “everything in CUDA and possibly more” is a standard no one could afford or bother to implement. ROCm and HIP are basically AMD using an API people are already familiar with software underneath to play to the strengths of their hardware.

“AMD are also the main competitor in the space but all but totally dropped support for GPGPU in desktop cards, trying to instead focus on gaming for Radeon and compute in MI/CDNA.”

Keep in mind that AMD has been under intense pressure to deliver world class HPC systems, and they managed to do so with ORNL Frontier. I don’t blame them for being selective with ROCm support because most of those product lines were in flight before ROCm started development in earnest. That said, Nvidia is obviously the clear leader for hardware support, as therefore the safest option for desktop users.

znpy

> partially complete compatibility layer

So… partial compatibility layer?

captainbland

Is your point that "partially complete" is a redundant phrasing?

In this case I still prefer my version. I feel that it puts greater emphasis on the fact that it can potentially be complete, given the massive value that could give to the project.

Also "I would have written you a shorter letter but I did not have the time" sentiment springs to mind.

throwaway888abc

Thanks

WithinReason

Short version: AMD's software incompetence. Very few hardware companies have the competence to properly support their HW. You see this problem again and again, HW companies designing HW that's great on paper but can't be used properly because it's not properly supported with SW. Nvidia understands this and has 10 times as many SW engineers than HW engineers. AMD doesn't. Intel might too.

SilverBirch

I think you're totally right with this, just to add - it's often possible to do a lot of things that are neat in hardware but create difficult problems in software. Virtually always when this happens it turns out to be nearly impossible to actually create software that takes advantage of it. So it's massively important to have a closed loop feedback system between the software and hardware so that the hardware guys don't accidentally tie the software up in knots. This is common in companies that consider themselves hardware companies first.

davidgl

Examples being the PS3 cell architecture and HP's Itanium chips

pjc50

Strongly agree. There's a surprising cultural difference between the two. As a software engineer in a different hardware company, I can see where the fault lines are, and it takes continual management effort to make it work properly.

(I note that if we had an "open" GPU architecture in the same way that we have CPU architectures, things might be a lot better, but the openness of the IBM PC seems to be a historical accident that no company will allow again)

SubjectToChange

Chalking it up to “software incompetence” is a bit simplistic, to say the least. AMD was on the brink of bankruptcy not too long ago and their GPU division was struggling to even trend water. They didn’t have an alternative to CUDA because they couldn’t afford one and no one would use it anyway, OpenCL stagnated because most vendors didn’t want to implement functionality that only the biggest players wanted, and their graphics division had to pivot from optimizing for gaming (where they could sell) to optimizing for compute as well.

Now that AMD has the capital they are playing catch up to Nvidia. But it’s going to take time for their software to improve. Hiring at boat load of programmers all at once isn’t going to solve that.

WithinReason

It's been a while since AMD was on the brink of bankruptcy, they had enough time to do something about compute and yet it's still not usable, see the George Hotz rant. OpenCL stagnated because 2.0 added mandatory features that Nvidia didn't support so it never got adopted by the biggest player.

lhl

llama.cpp can be run with a speedup for AMD GPUs when compiled with `LLAMA_CLBLAST=1` and there is also a HIPified fork [1] being worked on by a community contributor. The other week I was poking on how hard it would be to get an AMD card running w/ acceleration on Linux and was pleasantly surprised, it wasn't too bad: https://mostlyobvious.org/?link=/Reference%2FSoftware%2FGene...

That being said, it's important to note that ROCm is Linux only. Not only that, but ROCm's GPU support has actually been decreasing over the past few years. The current list: https://rocm.docs.amd.com/en/latest/release/gpu_os_support.h... Previously (2022): https://docs.amd.com/bundle/Hardware_and_Software_Reference_...

The ELI5 is that a few years back, AMD split their graphics (RDNA) and compute (CDNA) architectures, which Nvidia does too, but notably (what Nvidia definitely doesn't do, and a key to their success IMO) AMD also decided they would simply not support any CUDA-parity compute features on Windows or their non "compute" cards. In practice, this means that community/open-source developers will never have, tinker, port, or develop on AMD hardware, while on Nvidia you can start with a GTX/RTX card on your laptop, and use the same code up to an H100 or DGX.

llama.cpp is a super-high profile project, has almost 200 contributiors now, but AFAIK, no contributors from AMD. If AMD doesn't have the manpower, IMO they should simply be sending nsa free hardware to top open source project/library developers (and on the software side, their #1 priority should be making sure every single current GPU they sell is at least "enabled" if not "supported" in ROCm, on Linux and Windows).

[1] https://github.com/SlyEcho/llama.cpp/tree/hipblas

Kelteseth

I just tried this [1] and it still uses my CPU even though the prompt says otherwise.

[1] https://github.com/ggerganov/llama.cpp/issues/1433#issuecomm...

lhl

I saw there was an answer already in your issue, although you plan on doing a lot of inferencing on your GPU, I'd highly recommend you consider dual-booting into Linux. It turns out exllama merged ROCm support last week and more than 2X faster than the CLBlast code. A 13b gptq model at full context clocks in at 15t/s on my old Radeon VII. (Rumor has it that ROCm 5.6 may add Windows support, although it remains to be seen what that exactly entails.)

Kelteseth

So it now uses the GPU after some help, but it is not that much faster on my Vega VII than on my 5950x 16 core cpu :/

born-jre

short answer is they have somewhat competent hardware but software sucks or you can watch george hotz rant about how amd driver sucks

https://www.youtube.com/watch?v=Mr0rWJhv9jU

alecco

He got a tarball fix for the driver after his rant got viral. Still not looking good IMHO.

https://geohot.github.io/blog/jekyll/update/2023/06/07/a-div...

xiphias2

It looks like some great engineers inside the company fighting the burocracy.

jandrese

> So it fixed the main reported issue! Sadly, they asked me not to distribute it, and gave no more details on what the issue is.

I think they missed the thrust of the rant.

HellDunkel

he goes from building a ml rack out of ati graphics cards to coding some python to recommending reading the unabomber manifesto, from marx to saying he owns a rolls royce... lord, please have mercy!

marcyb5st

I think there are several reasons.

Firstly, nVidia has been at it much longer. Just because of this tools on nVidia side feel easier to set up / are more polished (at least that was my feeling when fiddling with ROCm like a year ago).

Second but still related to #1, from the beginning even consumers nVidia cards were able to run CUDA and this made so that hobbyist and prosumers/researchers on a budget bought nVidia cards compounding even further the time/tooling advantage nVidia had. I.e. a huge user base of not only gamers but people that use their cards to do other things than gaming and know that things work on these cards.

These are, IMHO, the main reasons why everyone targets CUDA and explain why frameworks like Tensorflow or Pytorch targeted it as a first class citizen.

rapsey

If AMD was software competent the frameworks would support their drivers just as well. No one wants a monopoly.

marcyb5st

Agreed. Sorry if I gave the impression of being pro-nVidia. I am not.

But the reality is that when Tensorflow and Pytorch came to be there was no alternative. Now you need to jump through hoops to make it work with non CUDA hardware.

Additionally, while drivers play a role, I think the main difference is in the computing libraries (CUDA vs ROCm)

sorenjan

I'm hopeful for SYCL [0] to become the cross platform alternative, but there doesn't seem to be a lot of uptake from projects like this, so maybe my hope is misplaced. It's an official Khronos standard, and Intel seems to like it [1], but neither of those things are enough to change things.

Someone who knows about this space that can comment on the likelihood that SYCL will be a good option eventually? Cross platform and cross vendor compatibility would be really nice, and not supporting the proprietary de facto standard would also be a bonus as long as the alternative works well enough.

[0] https://www.khronos.org/sycl/

[1] https://spec.oneapi.io/versions/latest/elements/sycl/source/...

mschuetz

> It's an official Khronos standard

I think that's the problem. Khronos isn't known for good UX, and being from Khronos is exactly the reason why I'm not even bothering to check it out. I want an alternative to CUDA, but I also want it to be as easy to use as CUDA.

LoganDark

Being from Khronos is also a reason why it might actually be usable in a decade's time, like Vulkan.

(Vulkan is from 2015 and is just recently starting to become usable.)

bilekas

I'm no expert but if I understand correctly the CUDA cores are the main pull and the API to them.

They're supposed to be more optimized and more stable compared to AMD. That's how it was before anyway, not sure today.

Aardwolf

Isn't the main component for AI matrix multiplication? What makes it so hard to create a good alternative API for matrix multiplication?

dotnet00

It's a lot more complicated than just writing a matrix multiplication kernel because there are all sorts of operations you need to have on top of matrix multiplication (non linearities, various ways of manipulating the data) and this sort of effort is only really worthwhile if it's well optimized.

On top of that, AMD's compute stack is fairly immature, their OpenCL support is buggy and ROCm compiles device specific code, so it has very limited hardware support and is kind of unrealistic to distribute compiled binaries for. Then, getting to the optimization aspect, NVIDIA has many tools which provide detailed information on the GPU's behavior, making it much easier to identify bottlenecks and optimize. AMD is still working on these.

Finally, NVIDIA went out of its way to support ML applications. They provide a lot of their own tooling to make using them easier. AMD seems to have struggled on the "easier" part.

bilekas

Well I think there are 2 types right ? Tensor cores (which afaik AMD dont have) which are better for matrix ops, and CUDO which are better for general parallel ops.

Maybe someone more clever than me can go into the specifics, I only understand the minimum of the low lvl GPU details.

Nice high lvl document

[0] https://www.acecloudhosting.com/blog/cuda-cores-vs-tensor-co...

marcyb5st

I think API for matrix multiplication is just a part of the issue. CUDA tooling has better ergonomics, it's easier to set up and treated as first class citizen in tools like Tensorflow and Pytorch.

So, while I can't talk about the hardware differences in detail, developer experience is greatly on nVidia side and now AMD has a moat to overcome to catch up.

emmender

there is nccl, gpudirect, nvlink and so on and so forth.. It is not just matmul on gpus.

lofaszvanitt

Planned economy, planned who does what.

davidy123

I'm a bit surprised by the numbers. It's "only" a 2× speedup on a relatively top-end card (4090)? And you can only use one CPU core. With 16+ core CPUs becoming normal and 128GB+ RAM being cheap, that seems like leaving a lot on the table.

[edit] realized it's relative to the merged partial CUDA acceleration, so the speedup is more impressive, but still surprised by the core usage.

LoganDark

> And you can only use one CPU core.

because the core's job is solely to direct the GPU, which is doing all of the work.

davidy123

To the replies; I think one feature of llama.cpp is it can handle models with more RAM than VRAM provides, this is where I would think more cores would be useful.

eurekin

That cheap ram is about 10x slower than the VRAM. Didn't see any actual figures for latency, but there must be a reason, why newer gpus have memory chips on both sides of the PCB, as close to the GPU as possible

shgidigo

Excuse me for my ignorance, but can someone explain why the Llama.cpp is so popular? isn't it possible to port the pytroch lama to any environment using onnx or something?

jmiskovic

You can run it on RPi or any old hardware, only limited by the RAM and your patience. It is a lean code base easy to get up and running, and designed to be interfaced from any app without sacrificing performance.

They are also innovating (or at least implementing innovations from papers) different ways to fit bigger models in consumer HW, making them run faster and with better outputs.

Pytorch and other libs (bitsandbytes) can be horrible to setup with correct versions, and updating the repo is painful. PyTorch projects require a hefty GPU or enormous CPU+RAM resources, while llama.cpp is flexible enough to use GPU but doesn't require it and runs smaller models well on any laptop.

ONNX is a generalized ML platform for researchers to create new models with ease. Once your model is proven to work, there are many optimizations left on the table. At least for distributing an application that relies on LLM it would be easier to add llama.cpp than ONNX.

brucethemoose2

In Stable Diffusion land, onnx performance was not that great compared to ML compilers and some other implementations (at least when I tried it).

Also, llama.cpp is excellent at splitting the workload between CPU and accelerators since its so CPU focused. You can run 13B or 33B in a 6GB GPU and still get some acceleration.

Also, as said above, quantization. That is make or break. There is no reason to run a 7B model at fp16 when you can run 13B or 30B in the same memory pool at 2-5 bits.

v3ss0n

Should be similar performance but gglm guy did it in what he knows best and biggest selling point is single binary

ianpurton

ONNX doesn't support the same level of quantization as GGML.

So basically GGML will run on hardware with less memory.

regularfry

Or alternatively, bigger models with the same memory (just quantised harder).

ykonstant

Python Torture Chamber is, of course, an eminently viable tool, but I gather some people prefer a more streamlined toolchain like that of C. Or COBOL.

naasking

Llama.cpp runs better than pytorch on a much wider variety of hardware, including mobile phones, Raspberry Pis and more.

getcrunk

Anyone get performance numbers for other 30 series cards? 3060 12gb?

I’m curious how it compares to his apple silicon numbers

lhl

Not a 30 series, but on my 4090 I'm getting 32.0 tokens/s on a 13b q4_0 model (uses about 10GiB of VRAM) w/ full context (`-ngl 99 -n 2048 --ignore-eos` to force all layers on GPU memory and to do a full 2048 context). As a point of reference, currently exllama [1] runs a 4-bit GPTQ of the same 13b model at 83.5 tokens/s.

[1] https://github.com/turboderp/exllama

speed_spread

Also, someone please let me use my 3050 4GB for something else than stable diffusion generation of silly thumbnail-sized pics. I'd be happy with an LLM that's specialized in insults and car analogies.

Blahah

You can split inference between CPU and GPU using whatever available GPU vRAM with llama.cpp. And you can run many small models with 4GB of vRAM. Anything with 3B parameters quantized to 4bit should be fine.

hospitalJail

Do you have your settings correct? I have a 1650 on an old computer and I have generated 512x512 pictures and merged models.

Heck, even using CPU, I've been able to generate 512x512.

If you arent generating 512x512 pictures, lmk, I'll go grab my bat file's startup parameters.

getcrunk

I can’t do 512x512 on a laptop 1050 ti 4gb

wing-_-nuts

Did you buy that card for cuda? Cause otherwise I have no idea why someone would chose a 3050 over a 6600

machinawhite

What's a 6600, some AMD card? And why is it better?

jfdi

Is there a legitimate way to get the weights to actually use this without filling in forms?

dchest

You can use OpenLLaMA models (Apache 2.0 license, unrelated to LLaMA apart from their architecture and general approach to training):

https://huggingface.co/TheBloke/open-llama-7b-open-instruct-...

https://huggingface.co/SlyEcho/open_llama_7b_ggml

https://huggingface.co/SlyEcho/open_llama_3b_ggml

wmf

Not if you want the original LLAMA weights, but now there are other models like RedPajama available.

pram

There’s a torrent linked in the Llama.cpp docs, it’s in a merge request on the LLaMA repo. Has all the files.

hnfong

It's almost (or actually is) a "pirated" torrent. So it might not be "legitimate".

v3ss0n

The best model currently is Falcon

adultSwim

csjh

They come a dime a dozen on HuggingFace, check out https://old.reddit.com/r/LocalLLaMA/wiki/models for a few options

quickthrower2

Are these done as LLaMA deltas still? I.e. do I need to apply a patch to LLaMA, and so I still need to source LLaMA?

Tostino

Most of them are merged models, so you don't need the base model.

It's stupidly simple to get going.

v3ss0n

Thr best model currently is Falcon

jokethrowaway

Great news but I'd like to know how does it compare with just using torchlib.

If this is faster than torchlib this optimizations should flow to torchlib as well

Love the idea of not having to deal with python, though; dependency management is just horrible, I'd much rather have ML projects written in cpp.

Remmy

I've been using llama.cpp with the python wrappers and it's the speed increase has been great, but it seemed to be limited to a max of 40 N_GPU_LAYERS. Going to have to update and see what sort of improvement I see.

gigel82

I'm a total newb about the implementation details, but I'm curious if a hybrid is possible (GPU+CPU) to enable inference with even larger models than what fits in consumer GPU VRAM.

skirmish

llama.cpp does it already. You tell it how many layers to offload to GPU, and it runs remaining ones on CPU.

hendry

RTX 3090 isn't cheap, more than 1000GBP new, crikey!

hospitalJail

As someone who uses their computers every day, for 7 years... Then I hand them down to my kids.... Then I turn them into servers.

I find the cost of computing extremely affordable, even for high end stuff. Whats the amortization on a 2-3k computer over 7 years? How about if I use it 4 hours a day actively and 24 hours passively?

I have considered spending 10-30k on a computer given the recent AI craze, but the thing stopping me is that by 2025, a 10-30k computer in the AI space is going to be 2-4x better. Only in the last 1 year are we finding out the importance of absurd amounts of VRAM. I feel like the 4090's 24gb VRAM is going to age alright at best, but most likely poorly. (Not that 4090 buyers are going to have qualms upgrading to the 6090)

imranq

This is pretty cool, do you find the server farm of older computers valuable for your own work?

hospitalJail

Oh yeah, I have a computer for a minecraft server. A computer hosting my kiddo's website(just for fun, its silly, but randomly he will want me to pull it up from outside of the house). That same computer hosts some listeners/watchdogs for a media computer, but I havent actually used much of that information or features in a year (WFH kind of removed the need for me to use my remote tools).

I suppose that's it for now.

Oh, I thought of another use, I run a small business on the side and my interns occasionally don't have a laptop, I give them a crappy laptop. (they are basically just using excel/google sheets)

Tepix

Why not get a used one?

Daily Digest email

Get the top HN stories in your inbox every day.