Show HN: 1-Bit Bonsai, the First Commercially Viable 1-Bit LLMs

Daily Digest email

Get the top HN stories in your inbox every day.

jjcm

1 bit with a FP16 scale factor every 128 bits. Fascinating that this works so well.

I tried a few things with it. Got it driving Cursor, which in itself was impressive - it handled some tool usage. Via cursor I had it generate a few web page tests.

On a monte carlo simulation of pi, it got the logic correct but failed to build an interface to start the test. Requesting changes mostly worked, but left over some symbols which caused things to fail. Required a bit of manual editing.

Tried a Simon Wilson pelican as well - very abstract, not recognizable at all as a bird or a bicycle.

Pictures of the results here: https://x.com/pwnies/status/2039122871604441213

There doesn't seem to be a demo link on their webpage, so here's a llama.cpp running on my local desktop if people want to try it out. I'll keep this running for a couple hours past this post: https://unfarmable-overaffirmatively-euclid.ngrok-free.dev

najarvg

Thanks for sharing the link to your instance. Was blazing fast in responding. Tried throwing a few things at it with the following results: 1. Generating an R script to take a city and country name and finding it's lat/long and mapping it using ggmaps. Generated a pretty decent script (could be more optimal but impressive for the model size) with warnings about using geojson if possible 2. Generate a latex script to display the gaussian integral equation - generated a (I think) non-standard version using probability distribution functions instead of the general version but still give it points for that. Gave explanations of the formula, parameters as well as instructions on how to compile the script using BASH etc 3. Generate a latex script to display the euler identity equation - this one it nailed.

Strongly agree that the knowledge density is impressive for the being a 1-bit model with such a small size and blazing fast response

jjcm

> Was blazing fast in responding.

I should note this is running on an RTX 6000 pro, so it's probably at the max speed you'll get for "consumer" hardware.

ineedasername

consumer hardware?

That... pft. Nevermind, I'm just jealous

abrookewood

Holy hell ... that's a monster of a card

najarvg

I must add that I also tried out the standard "should I walk or drive to the carwash 100 meters away for washing the car" and it made usual error or suggesting a walk given the distance and health reasons etc. But then this does not claim to be a reasoning model and I did not expect, in the remotest case, for this to be answered correctly. Ever previous generation larger reasoning models struggle with this

jjcm

I ran it through a rudimentary thinking harness, and it still failed, fwiw:

    The question is about the best mode of transportation to a car wash located 100 meters away. Since the user is asking for a recommendation, it's important to consider practical factors like distance, time, and convenience.

    Walking is the most convenient and eco-friendly option, especially if the car wash is within a short distance. It avoids the need for any transportation and is ideal for quick errands.
    Driving is also an option, but it involves the time and effort of starting and stopping the car, parking, and navigating to the location.
    Given the proximity of the car wash (100 meters), walking is the most practical and efficient choice. If the user has a preference or if the distance is longer, they can adjust accordingly.

adityashankar

here's the google colab link, https://colab.research.google.com/drive/1EzyAaQ2nwDv_1X0jaC5... since the ngrok like likely got ddosed by the number of individuals coming along

qingcharles

Thanks, that works. I only tested the 1.7B. It has that original GPT3 feel to it. Hallucinates like crazy when it doesn't know something. For something that will fit on a GTX1080, though, it's solid.

We're only a couple of years into optimization tech for LLMs. How many other optimizations are we yet to find? Just how small can you make a working LLM that doesn't emit nonsense? With the right math could we have been running LLMs in the 1990s?

nnevod

I think that not just could, we should have had them.

As far as I understand, neural networks were very hyped in 60s and 70s and when hype bust, they've fallen out of focus. Hardware was not there yet.

Then they were neglected for many years and really pioneer science was apparently only done by Google. Theoretical breakthroughs came in 2010s, after GPT-2 masses attention caught up and we (over)focused on neural networks again. GPT-2 was way below the capabilities of current hardware, we quickly caught up and now we're optimising.

Had it not be the burst of previous hype bubble, the NN wouldn't be essentially forgotten, and we'd have steady stream of optimisations and improvements while using the maximum of currently availible hardware.

Something like voice translation model running locally should have been possible by the end of 1990s. That way we'd have steady increase of LLM capabilities, no hype, and time to adapt and understand how to properly use them with no disruption.

jjcm

Good call. Right now though traffic is low (1 req per min). With the speed of completion I should be able to handle ~100x that, but if the ngrok link doesn't work defo use the google colab link.

adityashankar

The link didn't work for me personally, but that may be a bandwidth issue with me fighting for a connection in the EU

AnthonBerg

As someone whose brain was addled by exposure to art history, I strongly support the suggested pelican on bicycle.

andai

Thanks. Did you need to use Prism's llama.cpp fork to run this?

jjcm

Yep.

andai

Could you elaborate on what you did to get it working? I built it from source, but couldn't get it (the 4B model) to produce coherent English.

Sample output below (the model's response to "hi" in the forked llama-cli):

X ( Altern as the from (.. Each. ( the or,./, and, can the Altern for few the as ( (. . ( the You theb,’s, Switch, You entire as other, You can the similar is the, can the You other on, and. Altern. . That, on, and similar, and, similar,, and, or in

rjh29

I reminds me of very early ChatGPT with mostly correct answers but some nonsense. Given its speed, it might be interesting to run it through a 'thinking' phase where it double checks its answers and/or use search grounding which would make it significantly more useful.

uf00lme

The speed is impressive, I wish it could be setup for similar to speculative decoding

abrookewood

man, that is really really quick. What is your desktop setup??? GPU?

jjcm

It is fast, but I do have good hardware. A few people have asked for my local inference build, so I have an existing guide that mirrors my setup: https://non.io/Local-inference-build

pdyc

thanks, i tested it, failed in strawberry test. qwen 3.5 0.8B with similar size passes it and is far more usable.

algoth1

Does asking it to think step by step, or character by character, improves the answer? It might be a tokenization+unawareness of its own tokenization shortcomings

pdyc

no it did not with character by character it concluded 2 :-)

cztomsik

I hope you are kidding, how is that a test of any capabilities? it's a miracle that any model can learn strawberry because it cannot see the actual characters and ALSO, it's likely misspelled a lot in the corpus. I've been playing with this model and I'm pleasantly surprised, it certainly knows a lot, quite a lot for 1.1G

selcuka

Interesting. Qwen 3.5 0.8B failed the test for me.

I ran my custom agentic SQL debugging benchmark against it and I'm impressed.

Results: 8 passed, 0 failed, 17 errored out of 25

That puts it right between Qwen3.5-4B (7/25) and Nanbeige4.1-3B (9/25) for example, but it took only 200 seconds for the whole test. Qwen3.5 took 976 seconds and Nanbeige over 2000 (although both of these were on my 1070 so not quite the same hardware)

Granite 7B 4bit does the test in 199 seconds but only gets 4/25 correct.

See https://sql-benchmark.nicklothian.com/#all-data (click on the cells for the trace of each question)

Errors are bad tool calls (vs failures which is incorrect SQL)

I used @freakynit's runpod (thanks!)

[1] https://news.ycombinator.com/item?id=47597268

undefined

[deleted]

Imustaskforhelp

I have been using @freakynit's runpod as well all be it, I like making working pomodoro apps as my own custom test, and although its not good for it (none of the prototypes work), I feel like it can be good within a specific context like Sql as you mention.

I imagine this being used as sub-agents with some sota models directing them but I wasn't really able to replicate it personally (I had asked Claude to create a detailed plan for a pomodoro app and then passed it to Bonsai)

I also tried its writing skills and actually they are kind-of decent, I also found that this model actually uses very comparatively little em-dashes.Its fine tunes are gonna be some really amazing things to come out. I hope someone makes a fine tune for website/tampermonkey extensions ;)

I remember using chatgpt-3 to use svelte/sveltekit to make a green button to blue button and having the text inside those buttons change and it's my personal wow moment from gpt-3 (This wasn't really able to accurately replicate it even in plain js), but I think that maybe the current model isn't good at writing html but the possibilities with custom-training these models and the idea of 1 bit model feels really great to me.

Especially with the idea of Ngram-embedding[0] (Meituanlongcat/LongCatFlashLite) and its idea. I imagine a 1 bit model + Ngram-embedding idea and I feel it can have many endless possibilities.

[0]: https://news.ycombinator.com/item?id=46803687 (I had submitted this but it seems to have had no attention during that time)

Maybe a 1 bit model like this and diffusion models for coding purposes might also go hand in hand, there are many experiments which can be done with this! (Also yes, many thanks to @freakynit running the runpod, I think I really learnt many things about this model in particular because of his runpod)

TLDR: I feel like this model is good within writing or atleast better in it than usual and it can be good asking it General purpose questions default but I feel like its not good at making html which can be fair, good to see that they are good in sql, but, not sure how they might approach in normal coding tasks. But either way, its an extremely fun model to play with!

(Edit: After some more tries, I have been able to make even one prototype of it after Gemini had holded its hands/giving it the code/errors, its not the best at this but still it works, just barely, https://gist.github.com/SerJaimeLannister/e90e8a134e4163f205...)

> I feel like it can be good within a specific context like Sql as you mention.

Yes I think very constrained task: known data universe, well known language etc should be the best possible place for small language models to play

Imustaskforhelp

Yes, I think though that, maybe 1) this shows 1-bit llm models working so more companies can do that so that we can get more competition within this space (+ ngram-embedding idea)

Another point, but I feel like, we can see some really good fine tuned models out of this model, the community feels excited about 1-bit LLM architecture. We are gonna see some good innovation within this space in the upcoming future.

simonw

You can run this model on an iPhone via the latest update to this Locally AI app: https://apps.apple.com/us/app/locally-ai-local-ai-chat/id674...

For its size (1.2GB download) it's very impressive.

Here's a pelican it drew me running on my phone - the SVG comments are good, the image not so much: https://tools.simonwillison.net/svg-render#%3Csvg%20width%3D...

newman314

One thing I discovered tonight is that it appears smaller models are remarkably bad at converting time between timezones.

I tested the following using almost all available models on Locally and did not get a single model that got the right answer.

"What is 9:30 am (Taiwan Standard Time, TST) in US Pacific?"

voxelghost

    <!-- Bicycle wheels -->
    <circle cx="285" cy="130" r="5" fill="#81c784" />
    <circle cx="315" cy="130" r="5" fill="#81c784" />
    <circle cx="285" cy="160" r="5" fill="#81c784" />
    <circle cx="315" cy="160" r="5" fill="#81c784" />

Did you ask for a pelican with a bicycle, or was that just an added bonus?

simonw

The prompt I always use for this is:

  Generate an SVG of a pelican riding a bicycle

IshKebab

It's a well known LLM test. Google "SVG pelican bicycle".

undefined

[deleted]

freakynit

Open access for next 5 hours (8GiB model, running on RTX 3090) or until server crashes or the this spot instance gets taken away :) =>

https://ofo1j9j6qh20a8-80.proxy.runpod.net

  ./build/bin/llama-server \
   -m ../Bonsai-8B.gguf \
   -ngl 999 \
   --flash-attn on \
   --host 0.0.0.0 \
   --port 80 \
   --ctx-size 65500 \
   --batch-size 512 \
   --ubatch-size 512 \
   --parallel 5 \
   --cont-batching \
   --threads 8 \
   --threads-batch 8 \
   --cache-type-k q4_0 \
   --cache-type-v q4_0 \
   --log-colors on

The server can serve 5 parallel request, with each request capped at around `13K` tokens...

A bit of of benchmarks I did:

1. Input: 700 tokens, ttfs: ~0 second, outputs: 1822 tokens ~190t/s

1. Input: 6400+ tokens, ttfs: ~2 second, outputs: 2012 tokens at ~135t/s

Vram usage was consistently at ~4GiB.

ggerganov

Better keep the KV cache in full precision

freakynit

Wow.. the GOAT himself.. thank you sooo much for creating llama.cpp ... will re-deploy with full kv cache once requests stop coming.

ramon156

I genuinely love talking to these models

https://ofo1j9j6qh20a8-80.proxy.runpod.net/#/chat/5554e479-0...

I'm contemplating whether I should drive or walk to the car wash (I just thought of that one HN post) and this is what it said after a few back-and-forths:

- Drive to the car (5 minutes), then park and wash.

- If you have a car wash nearby, you can walk there (2 minutes) and do the washing before driving to your car.

- If you're in a car wash location, drive to it and wash there.

Technically the last point was fine, but I like the creativity.

logicallee

That was really impressive. https://pastebin.com/PmJmTLJN pretty much instantly. (Very weak models can't do this.)

freakynit

Update: this has been evicted by runpod as it was on spot.

Imustaskforhelp

Kind sir, May I say to you thanks for doing so! I really appreciate it :D

TRCat

Thank you! I am impressed by the speed of it.

kgeist

[dead]

wild_egg

Don't have a GPU so tried the CPU option and got 0.6t/s on my old 2018 laptop using their llama.cpp fork.

Then found out they didn't implement AVX2 for their Q1_0_g128 CPU kernel. Added that and getting ~12t/s which isn't shabby for this old machine.

Cool model.

UncleOxidant

Are you getting anything besides gibberish out of it? I tried their recommended commandline and it's dog slow even though I built their llama.cpp fork with AVX2 enabled. This is what I get:

    $ ./build/bin/llama-cli     -hf prism-ml/Bonsai-8B-gguf -p "Explain quantum computing in simple terms." -n 256 --temp 0.5 --top-p 0.85 --top-k 20 -ngl 99
    > Explain quantum computing in simple terms.

     \( ,

      None ( no for the. (,./. all.2... the                                                                                                                                ..... by/

EDIT: It runs fine in their collab notebook. Looking at that you have to do: git checkout prism (in the llama.cpp repo) before you build. That's a missing instruction if you're going straight to their fork of llama.cpp. Works fine now.

UncleOxidant

UPDATE: I was using the llama.cpp CPU backend and was still getting gibberish. On Google colab they're running with CUDA. I turned Claude loose on the problem and it discovered a problem in the llama.cpp CPU backend code where a float was being converted to an int and basically going to 0. Now it runs fine locally with the CPU backend.

gorgonical

Mind sharing the fix as a patch? I would like to run it this way, too.

cubefox

"Not shabby" is a big understatement.

ddtaylor

Why so?

boxedemp

Because it's the opposite of shabby

undefined

[deleted]

undefined

[deleted]

alyxya

I expect the trend of large machine learning models to go towards bits rather than operating on floats. There's a lot of inefficiency in floats because typically they're something like normally distributed, which makes the storage and computation with weights inefficient when most values are clustered in a small range. The foundation of neural networks may be rooted in real valued functions, which are simulated with floats, but float operations are just bitwise operations underneath. The only issue is that GPUs operate on floats and standard ML theory works over real numbers.

cubefox

> and standard ML theory works over real numbers.

This paper uses binary numbers only, even for training, with a solid theoretical foundation: https://proceedings.neurips.cc/paper_files/paper/2024/file/7...

TL;DR: They invent a concept called "Boolean variation" which is the binary analog to the Newton/Leibniz derivative. They are then able to do backpropagation directly in binary.

hrmtst93837

[flagged]

guerrilla

Well this is perfect then. We just post-process models like this after training.

drob518

I’m really curious how this scales up. Bonsai delivers an 8B model in 1.15 GB. How large would a 27B or 35B model be? Would it still retain the accuracy of those large models? If the scaling holds, we could see 100+B models in 64 GB of RAM.

cubefox

Also depends on how expensive training these models is. It's probably at least as expensive as full precision models, otherwise they would have mentioned it.

londons_explore

My guess is the training process is their secret sauce...

cubefox

Yes, but their training speed is not secret. If their process were fast, they would have said so.

MeetRickAI

[dead]

_fw

What’s the trade-off? If it’s smaller, faster and more efficient - is it worse performance? A layman here, curious to know.

kvdveer

Their own (presumably cherry picked) benchmarks put their models near the 'middle of the market' models (llama3 3b, qwen3 1.7b), not competing with claude, chatgtp, or gemini. These are not models you'd want to directly interact with. but these models can be very useful for things like classification or simple summarization or translation tasks.

These models quite impressive for their size: even an older raspberry pi would be able to handle these.

There's still a lots of use for this kind of model

sossov

[dead]

adityashankar

If you look at their whitepaper (https://github.com/PrismML-Eng/Bonsai-demo/blob/main/1-bit-b...) you'll notice that it does have some tradeoffs due to model intelligence being reduced (page 10)

The average of MMLU Redux,MuSR,GSM8K,Human Eval+,IFEval,BFCLv3 for this model is 70.5 compared to 79.3 for Qwen3, that being said the model is also having a 16x smaller size and is 6x faster on a 4090....so it is a tradeoff that is pretty respectable

I'd be interested in fine tuning code here personally

druskacik

The 8B model response to my "Harry Potter knowledge-bench" question is too funny not to share.

> *Fathers of Harry and James Potter*: - Sirius Black is the *father* of *James Potter* (the older brother of Harry).

> - James Potter is *Harry's uncle* and the *older brother* of *Luna Lovegood*.

> - This means *Sirius and James are Harry's uncles*, though they are *father and brother*.

https://pastebin.com/WAAmFKfX

WaterRun

Feels a bit like gradually moving back toward analog circuits, step by step. There is less and less need for the precision that digital circuits provide.

TheLNL

What ? How did you come to this conclusion with this context ?

WaterRun

Traditional programming requires the absolute precision provided by digital circuits; a single bit flip can lead to a completely different outcome.

Large models do not require that kind of exactness. They are somewhat like a "field" or a "probability cloud": as long as the main directional tendency is correct, a few individual deviations—or even a whole cluster of them—make almost no difference.

fxwin

I'm very skeptical of the advantage they're claiming here. The whitepaper [0] only compares these to full precision models, when the more interesting (and probably more meaningful) comparison would be with other quantized models with a similar memory footprint.

Especially considering that these models seem to more or less just be quantized variants of Qwen3 with custom kernels and other inference optimizations (?) rather than fine tuned or trained from scratch with a new architecture, I am very surprised (or suspicious rather) that they didn't do the obvious comparison with a quantized Qwen3.

Their (to my knowledge) new measure/definition of intelligence seems reasonable, but introducing something like this without thorough benchmarking + model comparison is even more of a red flag to me.

[0] https://github.com/PrismML-Eng/Bonsai-demo/blob/main/1-bit-b...

riedel

Actually IMHO the promise would be beyond standard FP4 quants. I think the goal is more where 1.58 bit (ternary) quants are heading. Having said that it would be interesting to see performance on nonstandard HW.

kent8192

Oh, boy. This good tool hates my LM Studio... The following message appears when I run Bonsai in my LM Studio. I think my settings have done something wrong. ``` Failed to load the model Error loading model. (Exit code: null). Please check the settings and try loading the model again. ```

liuliu

It needs a mlx fork because the lowest bit in mlx is 2 currently (for affine quantization).

riidom

That mlx is for apple hardware only, though? Or did I misunderstand something.

dragonwriter

It needs a llama.cpp fork, too; so the stock runtime (based on stock llama.cpp) used by LM Studio presumably won't work for it.

dodos

Same issue here, wanted to give it a shot but ran into that error trying to load the model in lm studio.

Daily Digest email

Get the top HN stories in your inbox every day.