M2 Ultra can run 128 streams of Llama 2 7B in parallel

Daily Digest email

Get the top HN stories in your inbox every day.

mrb

For those wondering why the M2 Ultra is so fast, or the M1 & M2 series in general, it's because inference's main bottleneck is memory bandwidth, not compute power. And the M2 Ultra has a bandwidth of 800 GB/s which is about 8 times faster than an average modern desktop CPU (dual-channel DDR4-6400 offers a bandwidth of 102 GB/s).

This high bandwidth is really a result of Apple having designed a unified memory architecture for the M1 and M2 chips. Typically on a laptop or desktop, the CPU and GPU have distinct memory systems: high-bandwidth (but relatively low-capacity) graphics memory, and relatively low-bandwidth (but high-capacity) CPU memory. Apple decided to simplify that and instead implemented a single high-bandwidth memory system shared by the CPU and GPU. The only downside is that such high-bandwidth memory had to be tightly integrated in the M2 package, so the maximum capacity is limited. For example whether you spend 5,600 USD (cheapest Mac Studio machine with M2 Utra and 192 GB) or $10k+ (maxed out Mac Pro), you will only ever get 192 GB RAM max. For that amount, a PC could get 1024 GB RAM (5× more!) But on the other hand, if your workload, like inference, doesn't need more than 192 GB, then that's great. Personally I think Apple made the right tradeoff here. 800 GB/s of memory bandwidth on a general purpose CPU, on a single socket, has never been done before (to my knowledge.)

kristianp

Compare with an rtx 4090, which has memory bandwidth of 1,008 GB/s, but only 24GB of gddr. The 4090 is cheaper.

https://www.techpowerup.com/gpu-specs/geforce-rtx-4090.c3889

mrb

I agree, GPUs can still generate more token per second per dollar, but what is new and great about the high-end M1 and M2 is Apple offering this much memory bandwidth on a general purpose CPU, thus immediately available to all software running on the CPU.

kristianp

Looking at the prices, The 4090 is about 3/4 of the price of a base model M2 Ultra Mac Studio, which has 64GB RAM. With the rest of the PC to go with the graphics card, it's about 7/8s of the price of an Ultra. Then you have the compatibility. Do you want your software to be CUDA compatible or Metal compatible? If you're writing the software, maybe you want both!

The budget option is to go with a used 3090, which still has greater memory bandwidth than the M2 Ultra.

In terms of FP16 Flops on the GPU, you have:

M2 Ultra, 27 TFLOPS

RTX 3090, 35 TFLOPS

RTX 4090, 82 TFLOPS

https://www.cpu-monkey.com/en/igpu-apple_m2_ultra_76_core

https://www.techpowerup.com/gpu-specs/geforce-rtx-3090.c3622

https://www.techpowerup.com/gpu-specs/geforce-rtx-4090.c3889

boesboes

Unless you calculate power usage I’d bet

mark_l_watson

The 800 G/s bandwidth is amazing. I ordered a maxed out Mac mini with 32G ram and 200G/s bandwidth. For the LLMs I want to run right now, that is sufficient for my needs, although I did consider over-buying and getting a M2 Ultra. I also pay Google for Colab, and as long as I don’t over-use it, I can almost always get an A100. My strategy is to split my work as appropriate between the Mac mini when I get it in a week or two, and Colab. I used to run on Lambda Labs, also excellent, but setup time was non-negligible.

ProllyInfamous

I have an M2 Pro 16GB — anybody on Apple Silicon can download DiffusionBee.app and immediately be generating images (from text prompts) with its default model/engine... drag-and-drop.

Incredible what (even with limitations of a single $1000 computer, costing less than a single nVDA 4090) a desktop mac mini can accomplish.

----

For comparison: the hard drive in the M2 Mini is FASTER THAN A MACPRO5,1's RAM!!!

j45

The M1 Max and M2 Max are quite serviceable too in not jumping to an ultra.

sonthonax

I've noticed that M series Macs have extremely fast disk drives that the OS uses as swap quite efficiently. I've frequently used all my RAM on my Mac and barely noticed any slowdown when it starts swapping.

detourdog

I found that SSD drives finally eliminated the drawbacks of Mach's VM. While on platter drives one needed lots of RAM to avoid needing to swap.

seec

As was already noted in other comments, the M2 Ultra bandwidth is not that special against high-end GPUs (all the recent ones have over 700GB/s generally) and this bandwidth has to be shared with CPU. So technically if you keep doing work on the CPU there is less available bandwidth at any given time. A 4090 + 13900K has almost 1100GB/s combined ; not that it matters for most use case. For regular CPU tasks the added bandwidth doesn't seem to make a difference as far as I can tell, at least Apple Silicon isn't winning in any scenario where they don't have a specialized block on the chip for the task. So what's the point ? (beside overpaying for memory)

And to "win" this the total VRAM available at once is what make a difference, not really the bandwidth ; this is just because the task has been parallelized as much as possible. Even then, it required to optimize for the arch with parallelization and it is absolutely not cost competitive with the PC used as a reference. If you really need to maximize GPU VRAM in a single workstation (without going to a servers/cloud solutions) you could build a machine with multiple RTX A4000 SFF (1 slot, 20GB). It would get more expensive than the maxed out M2 Ultra but at this point the M2 Ultra lose so hard in FLOPS power that you really need to specifically look for situations where you would want more VRAM (up to 144GB available for the M2 Ultra GPUs vs 80GB for 4x1 slot card) but wouldn't want to run the model faster/longer in a dedicated server rack that could potentially have even more available VRAM (and be available/shared with other peoples).

Realistically NVidia knows how to put more RAM in their GPUs because it doesn't make sense to scale VRAM faster than computer power for most workload, you need to have a balance that make sense. As an analogy it is like coming up with a truck than can carry 150T at once but can only do so at 1/3rd the speed of regular trucks. In most case you actually gonna want to run 3 regular trucks even thought is going to be less efficient (it still gonna cost less and be faster overall) unless you really don't have a choice ; at this point you are in "special convoy" territory (like for wind turbine blades) and it's gonna cause lots of headaches on top of being slow and expensive.

Apple market their stuff as an incredible innovation when in fact not only it is irrelevant for most workload that are usually thrown at workstations (mobile or not) but I would argue that running the workloads where it would actually make a difference is a bad idea on a single user workstation. For most things that actually matters in a single user workstation/prosumer/enthusiast system, Apple Silicon lose quite hard especially when it comes to GPU performance : viewport performance, close to real-time 3D rendering (before sending to render farm for final detailed render), games, etc...

And this is the Ultra version of the chip, that is out of reach for most people (it makes look at the 4090 as not that overpriced, which is quite funny). If you go down to the M2 Max version, suddenly the bandwidth is 400GB/s and not only it is not impressive at all, it is even worse than an Intel A770M laptop GPU (512GB/s) while still having less raw power and costing way more. The more you go down in the Apple Silicon roster, the worse it gets. AS is not competitive at the high-end workstation level but it is absurdly overpriced at almost every level.

The reason they have this architecture (that isn't very good for most traditional computer application) isn't because they went out of their way to engineer something great. Nope. It is because they basically scaled up a mobile architecture that was like this from the get go (power and space constraint, plus no need to have that much RAM nor have it upgradeable). And this is only because Apple is currently run by a Scrooge who figured he could get even more money out their silicon division if they solds SKUs with binned parts and controlled the RAM supply/price.

If Apple had actually done useful engineering they would have figured out a way to scale the GPU/VRAM combo independently and a way to package/sell it efficiently. It makes no sense to scale VRAM past a certain point : why would you want to load a 3D model/view/whatever if you cannot compute it fast enough. As for the CPUs existing memory interfaces where fast enough for most things and the "benefit" is inexistent in most case. They went about it in the worse way possible with cost reduction above all approach while jacking up the price up to 11. This is the most lazy approach they could take and they even dumped all the unnecessary cost directly onto the consumer (low yield for big area chips and soldered RAM close to the chip from a lack a dedicated GPU SKUs). Even if the consumer want to absorb the cost he still get bad scaling and uncompetitive performance...

I just don't get how Apple get away with it and there are people like you falling for their marketing bullshit that is just a spin on what are actually weaknesses...

macwebcomputing

Interesting feedback but focus on cost is a old debate. Your feedback seems more about 3D than AI. A lot of developers also want a better user experience, reliability. My reply is that M1 and M2 are just the beginning as Apple is investing billion of dollars in R&D. Also, no PC laptops can do better than AS architecture today, at a lower cost than high-end PC laptops and more battery life. Pro servers is next step. M2 ultra is just a preview. Just saw two weeks ago, an engineer doing a demo of Llama 2 on a recent AMD laptop. he was complaining how slow it was compared to a Mac. Again, Apple is leading for laptops now, server is next. M3 will remove memory limitation.

leetharris

For applications that aren't latency sensitive my company has found that Apple hardware inference is far less expensive than the competition when calculating for electricity usage. I wish they would make a cloud offering.

whalesalad

Amazon has Apple servers for rent. There is also https://macminicolo.net and https://www.macminivault.com

nathancahill

I'm so happy MacMiniColo is still around. I had projects hosted there when they were just getting started. They've stayed true to their mission. I love that their website still feels like Wordpress 3.0.

hamandcheese

They aren't accepting new Mac minis for colocation. Which means if you want to go from, say, 8gb to 16gb of ram then that's an extra $90 a month for the privilege.

cyberge99

There’s also MacStadium.

detourdog

I have a fiber connection, the space and the experience to run a Mac mini colo. If people are really interested. I would want to do it as a co-operative that helps with the expense of the physical location.

ruph123

MacMiniColo and MacStadium merged afaik.

JCharante

Doesn’t macos require you to rent it out with a one month minimum? At that point just buy the hardware

undefined

[deleted]

garciasn

We went with a M2 Studio with maxed out RAM because we simply cannot get reliable GPU availability with cloud providers and for $6000 (with tax) we can have the equivalent VRAM of ~2 80GB GPUs instead of paying $5/hr for the pleasure.

ttt3ts

With a 70B param model how many tokens/second?

Did the math and assuming 100% util and equal performance (which is certainly not the case) payback on your Mac is 9 months...

garciasn

You need to pay for dedicated because they’re generally unavailable in the moment. So it’s more like 45 days, if we’re only talking about a single GPU—but we’re talking about ~2x.

api

Apple really should license the M chip IP to someone to make a server chip out of it, or do it themselves. It's money on the table for them and would not cannibalize their Mac business at all. It's a very nice core.

Aurornis

Apple silicon is great for low power desktops and laptops, but they don't actually have groundbreaking performance relative to what we've got in the server space. If you dropped the M2 Ultra from the $4000 into a server, it would perform about the same as a $1500 AMD 7950X3D based server (this is a common budget server setup with ECC) in CPU tasks. Stick a common GPU in there and you're running circles around the M2 Silicon in GPU tasks.

The Apple Silicon is great at really low power work, but if you dial desktop or server GPU power limits down they also become quite efficient. The marginal cost of electricity is cheaper than buying more hardware, so nVidia and others run their parts deep into the diminishing returns part of the curve to maximize performance at the expense of power efficiency.

cyber_kinetist

What Apple Silicon brings to the table is not simply just performance, but a large amount of unified memory that can be used by the GPU (which are needed for inference of large deep neural networks like LLMs).

A top-of-the-line Mac Studio will give you 192 GB of unified RAM in less than $7000. Meanwhile a H100 from NVIDIA with 80 GB of VRAM will cost you like $30000...

sitkack

There are multiple riscv based solutions that will be out in 12-18 months. But for now, getting Apple hardware is the best solution.

mschuster91

Fully agree. It's time for some serious competition not just against Intel/x86 but also in the ARM space.

dzhiurgis

IDK about ML part, but equivalent performing Ryzen mini pc cost me 3x less over m1 macbook (yes im aware you get more with macbook)

starcraft2wol

you can buy rack mount Mac pro

crooked-v

You could always rack mount some Mac Minis.

ninkendo

It’s sad that racking and maintaining your own physical hardware is becoming such a lost art… I appreciate the up-front simplicity of cloud offerings as much as anyone, but there’s something to be said for owning your own hardware and avoiding the continual rent payments you’re sending the cloud providers.

The wisdom is that cloud providers are better at infra than you, and that the economies of scale make it better to piggy back on what they’re doing, but… AWS is the most profitable part of Amazon for a reason. They’re overcharging you.

ilc

For most orgs: AWS is not overcharging IMHO.

When you look at the cost of the hardware + hosting. Yes, it certainly looks and feels that way.

But if you've dealt with corporate IT, and had to deal with 3-6 month lead times on getting hardware, or politics to get your hands on hardware to get stuff done.

AWS is cheap. It gives you velocity.

If your company is large enough that it can offer the elasticity of resources that Amazon offers or even 1/4 of it... and you have an IT org that will let it happen. Yes, AWS is a waste.

But with AWS... when a project dies, you can wipe its costs out, people won't hold onto hardware so they have hardware for the next project, etc...

Trust me. I've been IT, I can spec and build rack systems. I am a software dev. And I've been a dev most all my career.

For 90%+ of orgs... they don't have the maturity and skills to handle that type of infra without substantially distracting from their primary business.

sbarre

The question I've often heard asked when deciding on build vs buy (which can apply to cloud vs. bare metal) is:

Are we in the business of building, maintaining and operating <thing to build> or do we want to buy that as a service instead and focus on our actual core business?

There's more to the cost of building and operating than just the hard costs.

Retaining good modern IT talent is getting harder and harder - and I'm not even talking about salaries.. You need a whole department including strong leaders who can hire, train, and lead the right people, etc..

This is something most companies wouldn't even know where to start with.

ClimaxGravely

I'm a former build engineer and used to do everything on prem and I gotta say I miss it (not being a build engineer, the on prem experience). Since those days pretty much every company I've worked at moved their CI/CD to the cloud and I gotta say it feels so much slower, even when working from home.

I remember twice switching from an in-house jenkins/teamcity/whatever type of CI to Azure devops and the thing I remember the most was how much longer it took a build to complete as well as the massively longer time downloading a build from Azure vs from within the office. Even when working from home the on-prem stuff was faster.

The thing is, the build/devops teams seem to be about the same size in both cases. It's just kind of worse in pretty much every case when we do CI in the cloud.

Notes

- My experiences are largely for game development so the build times and artifact sizes can be quite large.

- I've only ever had CI/CD experience with Azure, I've not tried other cloud providers

- Since this is game development and we're using CI downtime is more acceptable than other cases. That said, I don't remember much downtime when I was working as a build engineer. I have seen periods of 1-2 hours of downtime once in a blue moon but then again I've seen that with Azure. In both cases it wasn't so much the setup but a build script deployment issue.

Also being able to cool off in the rack room when it's a hot day is always a treat :)

cherryswitch

Sometimes the flexibility and time savings is worth the added upfront cost. Similar to how companies like to hire consultants or lease office space. Being able to walk away is better for short term, because companies value profits in the short term.

whalesalad

You’ve never had to drive to the colo at 3am huh?

solardev

Where does this stop? Do you produce your own electricity? Farm your own food? Make your own silverware and shoes? Sometimes it's just easier to outsource the things you don't want to (or aren't good at) doing yourself.

If I wanted to host a website, sure, I can build a server out of parts and negotiate with my ISP and get a business pipe and handle all caching and such. Or like I can pay a provider $5/mo and get better performance and reliability with no management overhead. Yeah, maybe over 5 years I'd save more money doing it myself... but it's not worth the time.

If I wanted to generate a photo or a dozen, or a few paragraphs of text, that's like a few cents worth of cloud AI. Maybe low single-digit dollars. Or I could spend thousands on fat GPUs or a Macbook, spend forever training it, and still end up with a sub-par result.

AWS is profitable not just because they're overcharging you but because they are providing a hugely useful service for millions of businesses that don't want to deal with that infrastructure themselves, any more than they'd want to manage their own plumbing or electrical grid or roads and bridges leading to their office. DIY makes sense if you're doing it as a hobby or if your scale is so big that you would incur significant savings to in-house it, but for millions of small and medium businesses, it's just not the most practical approach. Nothing wrong with that.

I mean, it's like saying development is such a lost art... why hire a dev if you can learn to code yourself? Sure, but not everyone wants to, can, or has time.

GravityLab

I have a base-model m1 Mac Mini and it's a beast. I'm using it as my build/deploy server and also as a back-end server (for running jobs) for the prototype I'm working on. I also do development on it when I want to use my big monitor rather than my laptop. And I listen to music and run Cookie Clicker at the same time while doing development.

Got three databases up and running too. It's a beast. I'd definitely consider self-hosting with a few Mac Minis, that would be fun and they're really cute, sleek devices too. I paid $650 for it and consider it a great deal. Definitely should've gotten it with more than 8gb of ram but I got it to try it out and haven't yet really needed to upgrade to a unit with more memory.

nre

Interestingly enough I was actually discussing this with a friend (who works in enterprise IT) the other day. Basically rack servers are purpose build for the task, with hot swappable components, redundant power/storage, multiple NICs, ECC, remote management, and so on. They come with enterprise support and can be easily maintained in the field.

Meanwhile a Mini cluster is literally a bunch of mini pcs in a rack, and idk if Apple even supports this kind of industrial use. While it's a quality product the Mini isn't really designed for the datacenter.

xienze

> and idk if Apple even supports this kind of industrial use. While it's a quality product the Mini isn't really designed for the datacenter.

I think they know of it and tacitly approve of this use case, as evidenced by the Mac Mini having the same form factor for ages. They’re well aware that a lot of people use Minis (and Studios now) in data centers, and that the Mini footprint is sort of “standardized” at this point.

smoldesu

For applications that aren't latency sensitive, I run inference on a free 4 core Ampere server from Oracle. Once you ditch the "fast" prerequisite, a lot of hardware becomes viable.

chmod775

Nowadays GPUs factory defaults try to squeeze out the last bit of performance for a huge cost in power.

You can run them at half the power usage and only lose a fraction of the performance - at least in gaming. Try for AI tasks.

macwebcomputing

macweb.com is also a solution :-)

gcr

For folks curious what this is, this seems to be a caching optimization for saving time on parallel streams of text. The benchmark is incidental. Most individual users likely have just one un-batchable conversation going at once with llama-cpp, and I think it’s unclear whether this PR improves that case much.

Also note that the demo video is sped up to fit inside GitHub attachment limits. Your observed speed may vary. :)

lukeschlather

I've been curious about using LLMs for large-scale refactoring. Prompts like

anywhere you find `FooBarBaz(blip, kap)` replace it with `new newThing(blip).bump(kap)`

I don't know how reliable it is, but it seems like if you can easily run this on commodity hardware it could totally replace most IDE refactoring tools, although obviously the IDE refactoring is more reliable, it seems like this could be made simple and flexible, and possibly just as reliable as IDEs.

But also it could enable some interesting things that you could never do with an IDE refactoring tool.

potatoman22

LLMs aren't very good at string operations. They struggle with things like character counts, extracting substrings, and replacements.

simonw

Yeah, for this kind of thing I would get the LLM to generate the regular expression.

sharkjacobs

I already do things like "write a script to replace `FooBarBaz(blip, kap)` with `new newThing(blip).bump(kap)` in a project folder"

I'm more comfortable with that because I find it usually takes two or three prompts to get it right

e.g. A couple hours ago I prompted it to help me do a diff of two commits ignoring all white space, just to check if there were any other changes. The first response didn't ignore new newlines, the second one was a multiline script, the third response gave me what I actually wanted.

  diff -w <(git show 0bb2c8579efe775de883e0182db48989bfa324f2:"path/to/file"|tr -d '\n') <(git show 6c71efc17497ad7c90b9c7b690075ec031c13c69:"path/to/file"|tr -d '\n')

Ductapemaster

I think that an LLM could be an amazing interface or translation layer for this sort of thing, but I would argue that the underlying operations of refactoring or something similar should remain very much like a function with discrete inputs and outputs.

dwringer

I believe that the application of multiple streams in parallel is a natural evolution of using a single stream. I've used some local models for help in creative writing, and some of the most productive results I got were from running the same prompt and sequence of interactions dozens and dozens of times. Although in that case, I was personally going through each result line by line, I can certainly imagine fully automated tools that leverage the range of responses to a given prompt.

gpderetta

I always wondered of pipelining/parallelization can help with AutoGPT-like tasks were a supervisor AI delegates subtasks to sub instances.

londons_explore

Speed-wise, on a single stream, I have no need for it to generate text faster than I can read.

However, for scripts that try to use hundreds/thousands of invocations to solve some problem (eg. "write me a whole book"), the parallelism will be great (but obviously the script has to be written with that in mind).

fudged71

So this will only work on, say, one prompt being run many times at once?

ugh123

Is it possible that in a few years time, only Mac silicon and PCs with high-end GPUs will be required to run "In-home LLMs" affordably?

If we get closer to a either "AGI" (whatever the hell that is) or at least a reasonably useful AutoGen/BabyAGI-like system that become popular to use at home, those machines will be the only ones capable of running advanced LLMs without having to pay OpenAI, Microsoft, Amazon/AWS, etc inordinate sums of money to do what consumers will deem a utility some day.

anotherhue

> Is it possible that in a few years time, only Mac silicon and PCs with high-end GPUs will be required to run "In-home LLMs" affordably?

No. They work well on the apple chips thanks to the integrated memory and the large size of the models. I know of no reason why an x86 chip could not be designed in a similar way if desired. IANAChipDesigner but I have worked for one of them.

behnamoh

FWIW, while Apple silicon can _run_ huge models thanks to the unified memory (not to be confused with shared memory), the inference is pretty slow compared to dedicated GPUs, so it's a tradeoff. The significance of this PR is that inference speed can—at least in certain applications—be sped up using parallel decoding.

Metus

Do these implementations use the neural engine? I saw that there was a stable diffusion implementation using the neural engine and I found that my macbook noticably did not run hot, as opposed to an average Teams call.

snitty

It doesn't. You need to generate models for use on the neural engine, which apple did for Stable Diffusion, but this is just taking advantage of lots of fast RAM and lots and lots of threads, if I understand it correctly.

ramesh31

It uses Metal acceleration, and takes advantage of the shared memory architecture, meaning it's basically a GPU with 196GB VRAM. Trading space (VRAM) for time (FLOPs), it can beat the performance of an RTX4080 here.

lostmsu

> can beat the performance of an RTX4080 here

This needs some backing. When M1 just got out people were claiming it is comparable to 3080, until they saw the performance difference.

woadwarrior01

Encoder only transformers (like BERT) can be made to run on neural engine with CoreML. Efficient inference with autoregressive encoder-decoder and decoder only transformers (aka LLMs) needs KV-caching, which currently can't be efficiently implemented with CoreML (and thus neural engine). So, for now it's GPU only, with Metal.

smpanaro

You can do autoregressive decoding with KV caching on the Neural Engine. You have to make a bit of a trade off and use fixed size inputs [1] but the speed up over no caching is meaningful.

There's a Whisper (Encoder-Decoder) [2] implementation if you want to see it in practice. Shameless plug, but I have a repo [3] where I'm working on autoregressive text generation on the Neural Engine. I'm running gpt2-xl (1.5B params) locally with KV caching at 120ms/token (vs. 450ms without caching). Will push an update soon.

Without quantization you can't go much higher than 1.5B params on M1's Neural Engine. M2 seems to have a higher ceiling but I haven't measured. I'm optimistic (but have not tried) that the new runtime quantization added to CoreML this year will allow for larger (and maybe faster) models on both.

[1] Technically you should be able to use 1 input with an enumerated set of sizes but I haven't been able to get it to work on the Neural Engine. This would likely be even faster. [2] https://github.com/wangchou/whisper.coreml/ [3] https://github.com/smpanaro/more-ane-transformers/

cypress66

>I'm running gpt2-xl (1.5B params) locally with KV caching at 120ms/token (vs. 450ms without caching).

That seems very slow compared to llama cpp?

GaggiX

Autoregressive transformer models are usually memory bound, whereas SD is compute bound, so perhaps the difference lies here. Also the reason why SD runs so much faster on the GPU than on the CPU.

ninkendo

M1 has (fast) unified memory between GPU and CPU, so something being memory bound ought not to have much bearing on whether it belongs on CPU or GPU… at least in theory. I’m a total noob here though so I may be wrong.

GaggiX

We were discussing mostly about NPU, I don't know if it makes a difference.

eurekin

Ok, impressive!

What are real world use cases for 7B family of models? Is anyone using them for anything productive?

wokwokwok

They're quite good at generating scaffolds and ideas (mistral specifically).

You can use them for trivial nlp tasks ("between 0 and 1 how similar are these two sentences? Respond with an explanation.") and because it's a small model, you just run it 4 or 5 times and take an average pretty quickly.

anotherjesse

7B coding models? Having massive amounts of questionable code :)

d_sem

Welp, looks like I'm out of a job. Perhaps management will suit me well, where I can make massive amounts of questionable decisions.

eurekin

Yeah, same experience here

undefined

[deleted]

Havoc

They're perfectly fine for story telling and basic chatbot duty. Also generating basic code boilerplate works just fine.

potatoman22

They make good classifiers when fine tuned

eurekin

Interesting!

I'd have one use case for classification: user text (from a jira issue) mapped to the team responsible for the fix.

Can you share some tutorials? I only just managed to get this working on windows/cuda:

https://colab.research.google.com/drive/1vk8i01apaSp59GVV2yI...

It's been a royal pain to setup

objektif

Do you have any pointers to learn how to start with fine tuning mistral locally!

cypress66

Use axolotl

nvm0n2

With these improvements llama.cpp/ggml is really becoming a pretty competitive serving stack even for large scale cloud hosted AI. I wonder how ggerganov finds the time to do all this, does anyone know if he's being sponsored?

killthebuddha

He founded a startup https://ggml.ai/

selectodude

Doesn’t seem like much of a business model there.

vineyardmike

IDK, I can see a future. It’s a one-man (for now) business, so minimal costs to consider. If he can swing consulting using the .cpp projects as advertising, that sounds like a good business.

Additionally, I can imagine companies investing and paying for the open source work to expand access to their licensed models. Use the same interface as people use LLAMA but upgrade to BetterModel, fully compatible.

Additionally, I could believe this is simply a build up to a future Acquihire, which is the most lucrative way to be hired.

simonw

He also raised a bunch of money, so I guess the plan is to figure out the business model as he goes.

bick

Anyone here wanting to try an M1 Mac mini 16GB in the cloud for free this month just send me an email (click handle). Now that we've moved on to the M2, I've got a hand full of the M1 available FREE for trying. You can also try an M2 Pro, Max, or Ultra but for that you'll need to subscribe. https://www.macweb.com/macinthecloud

undefined

[deleted]

undefined

[deleted]

ngcc_hk

Wonder about CUDA issue. Some can but most cannot ?

undefined

[deleted]

Daily Digest email

Get the top HN stories in your inbox every day.