Brian Lovin
/
Hacker News
Daily Digest email

Get the top HN stories in your inbox every day.

bastawhiz

There's no way the red v2 is doing anything with a 120b parameter model. I just finished building a dual a100 ai homelab (80gb vram combined with nvlink). Similar stats otherwise. 120b only fits with very heavy quantization, enough to make the model schizophrenic in my experience. And there's no room for kv, so you'll OOM around 4k of context.

I'm running a 70b model now that's okay, but it's still fairly tight. And I've got 16gb more vram then the red v2.

I'm also confused why this is 12U. My whole rig is 4u.

The green v2 has better GPUs. But for $65k, I'd expect a much better CPU and 256gb of RAM. It's not like a threadripper 7000 is going to break the bank.

I'm glad this exists but it's... honestly pretty perplexing

overfeed

> I'm also confused why this is 12U. My whole rig is 4u.

I imagine that's because they are buying a single SKU for the shell/case. I imagine their answer to your question would be: In order to keep prices low and quality high, we don't offer any customization to the server dimensions

ottah

That's just such a massively oversized server for the number of gpus. It's not like they're doing anything special either. I can buy an appropriately sized supermicro chassis myself and throw some cards in it. They're really not adding enough value add to overspend on anything.

randomgermanguy

The major selling point of the tinyboxes is that you're able to run them in your office without any hassle.

I used to own a Dell Poweredge for my home-office, but those fans even on minimal setting kept me up at night

oceanplexian

It will work fine but it’s not necessarily insane performance. I can run a q4 of gpt-oss-120b on my Epyc Milan box that has similar specs and get something like 30-50 Tok/sec by splitting it across RAM and GPU.

The thing that’s less useful is the 64G VRAM/128G System RAM config, even the large MoE models only need 20B for the router, the rest of the VRAM is essentially wasted (Mixing experts between VRAM and/System RAM has basically no performance benefit).

androiddrew

Could you share what you are using for inference and how you are running it? I have a 64G VRAM/128G system RAM setup.

sosodev

Most people are using something in the llama family for inference. Llama server is my go to. Unsloth guides describe how to configure inference for your model of choice.

syntaxing

Split RAM and GPU impacts it more than you think. I would be surprised if the red box doesn’t outperform you by 2-3X for both PP and TG

datadrivenangel

Yeah I've got the q4 gpt-oss-120b running at ~40-60 tokens per second on an M5 Pro.

ericd

Was that cheaper than a Blackwell 6000?

But yeah, 4x Blackwell 6000s are ~32-36k, not sure where the other $30k is going.

bastawhiz

I bought the A100s used for a little over $6k each.

ericd

Oh, why'd you go that route? Considering going beyond 80 gigs with nvlink or something?

segmondy

folks have too much money than sense, gpt-oss-120b full quant runs on my quad 3090 at 100tk/sec and that's with llama.cpp, with vllm it will probably run at 150tk/sec and that's without batching.

Aurornis

> gpt-oss-120b full quant runs on my quad 3090

A 120B model cannot fit on 4 x 24GB GPUs at full quantization.

Either you're confusing this with the 20B model, or you have 48GB modded 3090s.

integralid

Thanks for chiming in. I'm looking for a reasonably cheap local LLM machine, and multiple 3090s is exactly what I planned to buy. Do you have any recommendations or recommend any reading material before I decide to spend money on that?

edit: Found your comment about /r/localllama, but if you have anything more to add I'm still very interested.

amarshall

You're almost certainly (definitely, in fact) confusing the 120b and 20b models.

ericd

How're you fitting a model made for 80 gig cards onto a GPU with 24 gigs at full quant?

gfiorav

I think Hotz basically created super specific software for the gpus that throws away anything that doesn't contribute to inference (not turing complete, for example).

zozbot234

> And there's no room for kv, so you'll OOM around 4k of context.

Can't you offload KV to system RAM, or even storage? It would make it possible to run with longer contexts, even with some overhead. AIUI, local AI frameworks include support for caching some of the KV in VRAM, using a LRU policy, so the overhead would be tolerable.

tcdent

Not worth it. It is a very significant performance hit.

With that said, people are trying to extend VRAM into system RAM or even NVMe storage, but as soon as you hit the PCI bus with the high bandwidth layers like KV cache, you eliminate a lot of the performance benefit that you get from having fast memory near the GPU die.

zozbot234

> With that said, people are trying to extend VRAM into system RAM or even NVMe storage

Only useful for prefill (given the usual discrete-GPU setup; iGPU/APU/unified memory is different and can basically be treated as VRAM-only, though a bit slower) since the PCIe bus becomes a severe bottleneck otherwise as soon as you offload more than a tiny fraction of the memory workload to system memory/NVMe. For decode, you're better off running entire layers (including expert layers) on the CPU, which local AI frameworks support out of the box. (CPU-run layers can in turn offload to storage for model parameters/KV cache as a last resort. But if you offload too much to storage (insufficient RAM cache) that then dominates the overhead and basically everything else becomes irrelevant.)"

bastawhiz

The performance already isn't spectacular with it running all in vram. It'll obviously depend on the model: MoE will probably perform better than a dense model, and anything with reasoning is going to take _forever_ to even start beginning its actual output.

ranger_danger

I know llama.cpp can, it certainly improved performance on my RAM-starved GPU.

Aurornis

> There's no way the red v2 is doing anything with a 120b parameter model.

I don't see the 120B claim on the page itself. Unless the page has been edited, I think it's something the submitter added.

I agree, though. The only way you're running 120B models on that device is either extreme quantization or by offloading layers to the CPU. Neither will be a good experience.

These aren't a good value buy unless you compare them to fully supported offerings from the big players.

It's going to be hard to target a market where most people know they can put together the exact same system for thousands of dollars less and have it assembled in an afternoon. RTX 6000 96GB cards are in stock at Newegg for $9000 right now which leaves almost $30,000 for the rest of the system. Even with today's RAM prices it's not hard to do better than that CPU and 256GB of RAM when you have a $30,000 budget.

ottah

Honestly two rtx 8000s would probably have a better return on investment than the red v2. I have an eight gpu server, five rtx 8000, three rtx 6000 ada. For basic inference, the 8000s aren't bad at all. I'm sure the green with four rtx pro 6000s are dramatically faster, but there's a $25k markup I don't honestly understand.

packetlost

This does not match my experience with 120B~ models. I run Qwen3.5 122b A10B on about 80GB of vRAM just fine.

bastawhiz

Qwen 3.5 is MoE. But you're also almost certainly running a quantized version. 120B is well over 200gb at bf16. With int4 you're looking at 60gb or so. Qwen uses relatively little kv (only about 2gb for 64k context). So you're not too snug, but if qwen isn't cutting it for you, as it didn't for me, you're kind of in a pickle. For writing tasks, int4 was simply too chaotic. I also couldn't get it to use tools.

For me, qwen didn't cut it. You're not fine tuning a 120b parameter model with 80gb. You're probably not going to be able to abliterate it either, because it's moe. Other options use more vram, and where you'd have a fair amount of buffer with qwen, you're pressed with other big models.

ivraatiems

There's some irony in the fact that this website reads as extremely NOT AI-generated, very human in the way it's designed and the tone of its writing.

Still, this is a great idea, and one I hope takes off. I think there's a good argument that the future of AI is in locally-trained models for everyone, rather than relying on a big company's own model.

One thought: The ability to conveniently get this onto a 240v circuit would be nice. Having to find two different 120v circuits to plug this into will be a pain for many folks.

solarkraft

I find that the most respected writing about AI has very few signs of being written by AI. I'm guessing that's because people in the space are very sensitive to the signs and signal vs. noise.

rimeice

And because people writing anything worth reading are using the process of writing to form a proper argument and develop their ideas. It’s just not possible to do that by delegating even a small chunk of the work to AI.

Aperocky

I found it useful to preface with

* this section written by me typing on keyboard *

* this section produced by AI *

And usually both exist in document and lengthy communications. This gets what I wanted across with exactly my intention and then I can attach 10x length worth of AI appendix that would be helpful indexing and references.

jolmg

> attach 10x length worth of AI appendix that would be helpful indexing and references.

Are references helpful when they're generated? The reader could've generated them themselves. References would be helpful if they were personal references of stuff you actually read and curated. The value then would be getting your taste. References from an AI may well be good-looking nonsense.

wat10000

If you’re spending $65,000 on this thing, needing two circuits seems like a minor problem

ycui1986

they could had gone with the Max-Q version RTX PRO 6000 and only require 120V circuit. 10% performance hit, but half the power.

fundamentally, looks like they are shipping consumer off-the-shelf hardwares in a custom box.

ericd

Yeah, the other big benefit is that the Max-Q's have blowers that exhaust the hot air out of the box, the workstation cards would each blow their exhaust straight into the intake of the card behind it. The last card in that chain would be cooking, as the air has already been heated up by 1800W, essentially a hair dryer on high.

Or could be the server edition 6000s that just have a heatsink and rely on the case to drive air through them, those are 600W cards.

ivraatiems

The $12,000 one also requires it.

wat10000

The specs show that it only has one PSU. The docs just say that it has 2 and thus needs two circuits, but I’d guess that was meant to be for the more expensive one.

knollimar

Easier to get two circuits than rewire a breaker in an office you might be renting, no?

(I work for an electrical contractor so my sense of ease might be overcorrecting)

isatty

Surprisingly affordable but I’m not really interested in the 9070XT.

If it shipped with like 4090+ (for a higher price) it’d be more tempting.

jofzar

Good? That's what I want out of all websites. I don't want to read what an AI believes is the best thing for a website, I want to know the honest truth.

agnishom

I don't view this as irony. This seems like good sense in understanding when AI usage will make things better and when it will not.

Lerc

I am a little surprised that they openly solicit code contributions with "Invest with your PRs" but don't have any statement on AI contributions.

Maybe the volume for them is ok that well-intentioned but poor quality PRs can be politely(or otherwise, culture depending) disregarded and the method of generation is not important.

KeplerBoy

Tinygrad sure shared a few opinions on AI PRs on Twitter. I believe the gist was "we have Claude code as well, if that's all you bring don't bother".

all2

That's a pretty excellent take, IMO. Just an undirected AI model doesn't do much, especially when the core team has time with the code, domain expertise, _and_ Claude.

cyanydeez

I'm starting to think that if you have an AI repo thats basically about codegen, you should just close all issues automatically, the manually (or whatever) open the ones you/maintainers actually care about. Thats about the only way to kill some of the signal/noise ratio AIs are creating.

Then you could focus fire, like the script kiddies did with DDoS in the old days on fixing whatever preferred issues you have.

adrianwaj

"locally-trained models for everyone"

Wouldn't there be a massive duplication of effort in that case? It'll be interesting to see how the costs play out. There are security benefits to think about as well in keeping things local-first.

all2

There are multiple efforts for 'folding at home' but for AI models at this point. I get the impression that we will see a frontier model released this year built on a system like this.

nutjob2

3200W at ~240V is ~15A, that's just a regular household socket, at least in Europe. I imagine 240V sockets in the US are at least 15A.

No need for separate circuits, just use a double adapter.

undefined

[deleted]

vessenes

The exabox is interesting. I wonder who the customer is; after watching the Vera Rubin launch, I cannot imagine deciding I wanted to compete with NVIDIA for hyperscale business right now. Maybe it’s aiming at a value-conscious buyer? Maybe it’s a sensible buy for a (relatively) cash-strapped ML startup; actually I just checked prices, and it looks like Vera Rubin costs half for a similar amount of GPU RAM. I’m certain that the interconnect will not be as good as NV’s.

I have no idea who would buy this. Maybe if you think Vera Rubin is three years out? But NV ships, man, they are shipping.

kulahan

Sometimes you can compete with the big boys simply because they built their infra 5+ years ago and it’s not economically viable for them to upgrade yet, because it’s a multi-billion dollar process for them. They can run a deficit to run you out of the business, but if you’re taking less than 0.01% of their business, I doubt they’d give a crap.

h14h

Have to imagine each tinybox is targeting different tiers of startups trying to fine-tune/RL their way to custom models for narrow use-cases.

Maybe the target profile for exabox looks like a smaller/younger Cursor? If you're a small team with some seed funding and expertise, this kind of compute in a single box you can set up in your office feels like it could be a great fit.

zozbot234

> The exabox is interesting.

Can it run Crysis?

dist-epoch

Yes, it can generate Crysis with diffusion models at 60 fps.

WithinReason

Only gamers understand that reference

-- Jensen Huang

zargon

*Only gamers know that joke.

bastawhiz

Probably, the rdna5 can do graphics. But it would be a huge waste, since you could probably only use one of the 720 GPUs

paxys

The problem with all these "AI box" startups is that the product is too expensive for hobbyists, and companies that need to run workloads at scale can always build their own servers and racks and save on the markup (which is substantial). Unless someone can figure out how to get cheaper GPUs & RAM there is really no margin left to squeeze out.

nine_k

Would a hedge fund that does not want to trust to a public AI cloud just buy chassis, mobos, GPUs, etc, and build an equivalent themselves? I suspect they value their time differently.

paxys

Why do you think a hedge fund can't hire a couple of IT guys? Most of the larger ones have technical operations that would put big tech to shame.

ViscountPenguin

Medium sized hedge funds are a good portion of the market, and only really want to hire just enough tech people to keep the quant pipelines running.

signal_v1

[dead]

p1esk

They wouldn’t build anything - they would order from Dell or Supermicro.

mihaaly

We may be surprised how illefficient companies are in organizing the creation of sophisticated things (including processes) for themselves, to use (so for the cost center column).

Higher management figures out things to do in strategic level, in brief, and pushes on "soldiers", who kick it through in the least time (cheapest of the cheapest, for the sake of the quarterlies) EXACTLY the way management told it. Because they have to, their job is to make happen the company objectives given, the way it is given. Pushing out crap in the shape of the thing expected.

Larger organiztaion can use these kind of things the most. Even if they don't do that.

qubex

They’re kickstarting a TINY device that is pocketable and aimed at consumers. I’ve backed it (full disclosure).

jgrizou

griffinmb

This is not the same company. The OP Tiny Corp accused them of Trademark infringement on Twitter, due to exactly this kind of misconception.

ankaz

[dead]

kkralev

[flagged]

wmf

just want to run a 7-8b model locally

This is already solved by running LM Studio on a normal computer.

zozbot234

Ollama or llama.cpp are also common alternatives. But a 8B model isn't going to have much real-world knowledge or be highly reliable for agentic workloads, so it makes sense that people will want more than that.

alexfromapex

$12,000 for the base model is insane. I have an Apple M3 Max with 128GB RAM that can run 120B parameter models using like 80 watts of electricity at about 15-20 tokens/sec. It's not amazing for 120B parameter models but it's also not 12 grand.

Thaxll

M3 max tflops is tiny compared to the 12k box. It's not even comparable.

davej

It is very comparable if you work out the $/tok/s on inference. I did some napkin math and it looks like you’re getting roughly 3x the performance for 3x the cost. Red v2 vs Mac Studio M3 Ultra 96GB.

If you compare tokens/kWh efficiency then my math has Mac Studio being about 1.5x more efficient.

zozbot234

M3 has tolerable decode performance for the price, and that's what people would care about most of the time. they underperform severely wrt. prefill, but that's a fraction of the workload. AI, even agentic AI, spends most of its time outputing tokens, not processing context in bulk.

segmondy

it's for fools. i bought 160gb of vram for $1000 last year. 96gb of p40 VRAM can be had for under $1000. And it will run gpt-oss-120b Q8 at probably 30tk/sec

timschmidt

P40 is Tesla architecture which is no longer receiving driver or CUDA updates. And only available as used hardware. Fine for hobbyists, startups, and home labs, but there is likely a growing market of businesses too large to depend on used gear from ebay, but too small for a full rack solution from Nvidia. Seems like that's who they're targeting.

segmondy

99% of interest is in inference. If you want to fine-tune a model, just rent the best gpu in the cloud. It's often cheaper and faster.

siliconc0w

Tinybox is cool but I think the market is maybe looking more for a turn-key explicit promise of some level of intelligence @ a certain Tok/s like "Kimi 2.5 at 50Tok/s".

roarcher

> In order to keep prices low and quality high, we don't offer any customization to the box or ordering process. If you aren't capable of ordering through the website, I'm sorry but we won't be able to help.

Has this guy never worked on a B2B product before? Nobody is going to order a $10 million piece of infrastructure through your website's order form. And they are definitely going to want to negotiate something, even if it's just a warranty. And you'll do it because they're waving a $10 million check in your face.

The tone of this website is arrogant to the point of being almost hostile. The guy behind this seems to think that his name carries enough weight to dictate terms like this, among other things like requiring candidates to have already contributed to his product to even be considered for a job. I would be extremely surprised if anyone except him thinks he's that important.

codemog

I haven’t seen tinygrad used for any mainstream production project or thing of value, yet.

Besides a lot of self congratulatory pats on the back for how elegant it is. Honestly, when I read it, it looked confusing as all the other ML libraries. Not actually simple like Karpathy’s stuff.

All that to say, I do really want it to succeed. They should probably hire some practical engineers and not just guys and gals congratulating themselves how elegant and awesome they are.

jen729w

Your framing of this section is misleading. On the site it's preceded by a FAQ-style 'question':

> Can you fill out this supplier onboarding form?

That's very important context, as anyone who has been asked to fill out a supplier onboarding form (hi) will attest.

roarcher

Filling out an onboarding form is an example of what he's not willing to do, not the only thing he isn't willing to do.

> we don't offer any customization to the box or ordering process

Every B2B deal of that size that I've ever seen requires at least weeks of meetings between the customer and vendor, in which every detail is at least discussed if not negotiated. That would certainly constitute a "customization" to this guy's prescribed ordering process, which is to "Buy it now" [1] through the website at the stated price like you're ordering a jar of peanuts on Amazon. This is not "framing", it's what the guy said. If it isn't what he meant then he needs to fix his copy.

[1] Yes, there is an actual "Buy it now" button for a $65,000 business purchase that takes you to a page that looks just like a Stripe form. There isn't even a textbox for delivery instructions. Wild.

awesomeMilou

Then if they succeed, I guess you're going to see a different process for the first time in your life.

On a website where we frequently talk about disruptive business models, this whole attitude kinda stinks.

phrotoma

> arrogant to the point of being almost hostile

First encounter with geohot eh?

crossroadsguy

What does this mean? Is it some reference to different temperaments across geographies? Or some Internet slang?

wmf

He's not actually selling the exabox yet. It sounds like he put up a hypothetical config to see if anyone is interested.

kube-system

The specs for the “exabox” scream “this is a joke” to me.

> 20,000 lbs

> concrete slab

Huge-scale IT systems are typically delivered in one or more 42/44u cabinets, and are designed to be installed on raised floors.

0xbadcafebee

It's a shipping container. Look at the dimensions. They say concrete slab probably half as a joke, half because building code would require it to consider it a non-temporary structure.

wmf

It's a shipping container that you install outdoors.

kube-system

Are you referring to the images of branded shipping containers on their Twitter page that have visible Gemini watermarks … and jokes in the comments about AI trailer parks?

roarcher

It's also funny that they explicitly list driver quality as "good" for the base option and "great" for the intermediate one. You're really going to deliberately provide worse drivers for the machine I paid you for, just because I didn't buy the more expensive one?

I mean I'm sure lots of companies do this in practice because tickets for higher-paying customers naturally get prioritized, but directly stating your intention to do it on your home page is hilarious.

wmf

Nvidia drivers are better than AMD. It's not really something they have control over. Geohot is definitely obsessed with bitching about driver bugs though.

kube-system

I took that as a dig against AMD vs Nvidia driver quality.

zekrioca

I guess it is called ‘honesty’.

HWR_14

There isn't a $10MM device right now, just $64M and under. I doubt the order process will remain the same in 12 months when the $10MM device becomes available

jrflowers

I imagine that the FAQ might get updated when there’s actually a $10M machine for sale

roarcher

Maybe. Frankly I'd be very surprised if any business ordered a $65k machine that way either.

jrflowers

Yeah it’s a little odd. Maybe they are meant to be really really cool toys? People regularly spend more than $65k on things like cars to show off, so it could be like that.

I have no use for these but I might buy one anyway if I won the lottery. ¯\_(ツ)_/¯

Havoc

> arrogant to the point of being almost hostile.

The YouTube rap video of geohotz telling Sony lawyers suing him to blow him is still up.

His style of dealing with corporate matters is certainly unconventional

lofaszvanitt

Well, at least he had the power that average joes don't have. And he used it well.

hmokiguess

Is this like the new equivalent of crypto mining? I remember the early days when they would sell hardware for farming crypto, now it’s AI?

latchkey

Kind of yes, except there is no block reward.

barnabee

The block reward is firing humans and collecting ad revenue for slop

ekropotin

IDK, I feel it’s quite overpriced, even with the current component prices.

I almost sure it’s possible to custom build a machine as powerful as their red v2 within 9k budget. And have a lot of fun along the way.

lostmsu

AMD now has 32 GiB Radeon AI Pro 9700. 4 of these (just under 2k each) would put you at 128 GiB VRAM

ekropotin

VRAM is not everything - GPU cores also matter (a lot) for inference

cyanydeez

inference speed is like monitor Hz; sure, you go from 60 to 120Hz and thats noticeable, but unless your model is AGI, at some point you're just generating more code than you'll ever realistically be able to control, audit and rely on.

So, context is probably more $/programming worth than inference speed.

lostmsu

4x Radeon will have significantly more GPU power than say Mac Studio or DGX Spark.

mellosouls

Where is the 120B documented? This seems to be an editorialized title.

Edit: found a third party referencing the claim but it doesn't belong in the title here I think:

Meet the World’s Smallest ‘Supercomputer’ from Tiiny AI; A Machine Bold Enough to Run 120B AI Models Right in the Palm of Your Hand

https://wccftech.com/meet-the-worlds-smallest-supercomputer-...

Aurornis

That third party link is from a different company (Tiiny with an extra i)

Now I'm wondering if the HN title was submitted by some AI bot that couldn't tell the difference.

mellosouls

Ha, good catch, I googled for Tinybox 120B and clearly didn't read the article beyond the seeming match.

undefined

[deleted]

adrianwaj

Perhaps this company should think about acting as a landlord for their hardware. You buy (or lease) but they also offer colocation hosting. They could partner with crypto miners who are transitioning to AI factories to find the space and power to do this. I wonder if the machines require added cooling, though, in what would otherwise be a crypto mining center. CoreWeave made the transition and also do colocation. The switchover is real.

I think Tinygrad should think about recycling. Are they planning ahead in this regard? Is anyone? My thought is if there was a central database of who own what and where, at least when the recycling tech become available, people will know where to source their specific trash (and even pay for it.) Having a database like that in the first place could even fuel the industry.

operatingthetan

The incremental price increases between products is funny.

$12,000, $65,000, $10,000,000.

znpy

I was more worried by the 600kW power requirement... that's 200 houses at full load (3kw) in southern europe... which likely means 400 houses at half load.

the town near my hometown has 650 – 800 houses (according to chatgpt).

crazy.

nine_k

Or it's two 300kW fast EV chargers working together.

A typical home just consumes rather little energy, now that LED lighting and heat pump cooling / heating became the norm.

delusional

I think the above commentor is reflecting on the total energy use from having a 600KW load running 24/7. I suppose the more interesting observation is the 14 MWh of daily consumption, enough to charge 100 Rivians every day.

paganel

> and heat pump cooling / heating became the norm.

We're not all solidly middle-class (especially in Southern and Eastern Europe) and as such we cannot afford those heat pumps. But we'll have to eat the increased energy costs brought by insane server configurations like the ones from the article, so, yeey!!!

znpy

> now that LED lighting and heat pump cooling / heating became the norm.

My brother in Christ, you vastly overestimate southern europe

nutjob2

> at full load (3kw)

Do you live in a deprived rural village in a very poor country? Because you can't even run a heater and the oven with 3kW.

znpy

No it’s quite the norm actually.

Most power contracts give you 3 kwh power supply for residential home. That’s the standard.

Bumping to 4.5 or 6kwh must be required explicitly and costs and extra on the base power supply bill

dist-epoch

Your hometown also has public lightning, water pumps, and probably some other stuff.

ericd

That’s surprising, 200 amp 240v service is pretty common in the US.

sudo_cowsay

I mean the difference in performance is quite big too. However, the 10,000,000 is a little bit too much (imo).

Daily Digest email

Get the top HN stories in your inbox every day.