Get the top HN stories in your inbox every day.
hereme888
nabla9
Memory is the bottleneck. It limits the size of the models you can run and what you pay for.
Spark: 128 GB LPDDR5x, unified system memory
5090 : 32 GB GDDR7,
Model sizes (parameter size) Spark: 200B
5090 : 12B (raw)artemisart
That's very true and what's segmenting the market, but I don't understand why you're saying the 5090 supports only 12B model when it can go up to 50-60B (= a bit less than 64B to leave room for inference) as it supports FP4 as well.
nabla9
Its for comparison using raw, non optimized models. Both can do much better when you optimize for inference.
Information is in the ratio of these numbers. They stay the same.
undefined
Y_Y
You are doing god's work.
In fact you're also doing the work Nvidia should have done when they put together their (imho) ridiculously imprecise spec sheet.
aurareturn
It's not good value when you put it like that. It doesn't have a lot of compute and bandwidth. What it has is the ability to run DGX software for CUDA devs I guess. Not a great inference machine either.
jacquesm
It's great at one thing: memory. And that's interesting because memory is a commodity, but they still make bank on just being able to access it.
aurareturn
Memory is a commodity but access to high bandwidth memory is expensive whether it's HBM, or LPDDR/DDR connected to many memory channels.
dheera
The RTX Pro 6000 Blackwell gets you 96GB of RAM, a LOT more compute, and costs ~$7K which is not that much more than the $3K-$4K you'd pay for a DGX Spark.
I think the RTX Pro is probably the best deal right now if you're looking for a GPU dev desktop and don't care about physical size or power consumption.
canucker2016
5090: 32GB RAM (newegg & amazon lowest price seems to be +300 more)
4090: 24GB RAM
Thor & Spark: 128GB RAM (probably at least 96GB usable by the GPU if they behave similar to the AMD Strix Halo APU)
oliwary
True... It would be very interesting to make a comparison of various open models based on token generation speed on these platforms. Presumably starting st some size the larger accessible RAM wins out over raw speed but low VRAM? Although I suppose things like MoE and FP would also matter.
bjackman
Note you cannot actually get a 5090 for $1999 that's just the RRP. I believe they actually cost $4k
IshKebab
I just googled it and the first result was one in stock for £2200. That's including tax. I assume $1999 is excluding tax. Without tax and converted to dollars it's $2470.
From other less reliable sources like eBay they are more like £1800.
bjackman
Huh, you are right. I googled it yesterday too but I guess I had confirmation bias and happened to stumble across an old price and go "yep still 4k".
Well, I'm glad to be wrong on his!
hnuser123456
The prices came down to near MSRP in the last month or so.
alkonaut
So long as that's true, it's also likely you'll see the same markup for the spark, so they should compare similarly.
conradev
where does an RTX Pro 6000 Blackwell fall in this? I feel like that’s the next step up in performance (and about the same price as two Sparks)
qingcharles
I thought the 6000 was slightly lower throughput than 5090, but obviously has a shitload more RAM.
skhameneh
It's more throughput, but way less value and there's still no NVLink on the 6000. Something like ~4x the price, ~20% more performance, 3x the VRAM.
There's two models that go by 6000, the RTX Pro 6000 (Blackwell) is the one that's currently relevant.
undefined
boulos
As long as you're going to add FP8 dense, you could do the same for the parts mentioned in the FP4 section. Divide by two from dense => sparse, and another two for FP4 => FP8.
That gives you 250 tops of fp8 for Spark.
syntaxing
While a completely different price point, I have a Jetson Orin Nano. Some people forget the kernels are more or less set in stone for product like these. I could rebuild my own Jetpack kernel but it’s not that straight forward to update something like CUDA or any other module. Unless you’re a business where your product relies on this hardware, I find it hard to buy this for consumer applications.
coredog64
Came in here to say the same thing. Have bought 3 Nvidia dev boards and never again as you quickly get left behind. You're then stuck compiling everything from scratch.
larodi
My experience with Jetson Nano was that it had to have its Ubuntu debloatred first (with 3rd party script) before we could get their NN something library to run the image recognition, designated to run on this device.
These seem to be highly experimental boards, even though are super powerful for their form factor.
syntaxing
That’s true for the Jetson Nano. The Jetson Orin Nano (I know, the naming sucks) is much better in that aspect. Higher memory (8GB), way higher memory bandwidth (120 GB/s), and Orin has way more CUDA cores. It can pretty much run any “traditional” neural network, even YOLO large and even LLMs.
qwertox
According to Wendell from Level1Techs, the now-launched Jetson Thor uses a Linux Kernel built by Nvidia, on Ubuntu 20.04 [0]. So I assume getting upgrades will have the same feel as those Chinese SBC's like from Radxa or cheap Android devices.
I wonder if this also applies to this DGX Spark. I hope not.
bri3d
In the case of Jetson, NVidia also have a fairly generic BSP which you can use to customize almost any distribution, and the Jetson-customized Ubuntu is standard enough that you can upgrade it using the normal Ubuntu upgrade path without major issue.
For most of the Tegra boards there’s also upstream support. Overall the situation with NVidia BSP is about 10000x better than weird Chinese stuff. In the case of Tegra/Jetson, there’s even detailed first-party documentation about reconstructing the BSP components from source:
https://docs.nvidia.com/jetson/archives/l4t-archived/l4t-327...
I’d assume the decent software support will carry over to DGX.
rowanG077
Oooof that would be an instant dealbreaker for me... Better get a mac pro with asahi linux. That at least has great linux support.
fh973
Marketing material says NVIDIA DGX™ OS, which at version 7 would be an Ubuntu 24.04: https://docs.nvidia.com/dgx/dgx-os-7-user-guide/introduction...
naikrovek
or how the Jetson Nano was on Ubuntu 18.04 when it was released in 2019 and never got a single major OS upgrade.
That platform was great for a few months...
cherioo
The mainstream options seem to be
Ryzen AI Max 395+, ~120 tops (fp8?), 128GB RAM, $1999
Nvidia DGX Spark, ~1000 tops fp4, 128GB RAM, $3999
Mac Studio max spec, ~120 tflops (fp16?), 512GB RAM, 3x bandwidth, $9499
DGX Spark appears to potentially offer the most token per second, but less useful/value as everyday pc.
lhl
RDNA3 CUs do not have FP8 support and its INT8 runs at the same speed as FP16 so Strix Halo's max theoretical is basically 60 TFLOPS no matter how you slice it (well it has double INT4, but I'm unclear on how generally useful that is):
512 ops/clock/CU * 40 CU * 2.9e9 clock / 1e12 = 59.392 FP16 TFLOPS
Note, even with all my latest manual compilation whistles and the latest TheRock ROCm builds the best I've gotten mamf-finder up to about 35 TFLOPS, which is still not amazing efficiency (most Nvidia cards are at 70-80%), although a huge improvement over the single-digit TFLOPS you might get ootb.If you're not training, your inference speed will largely be limited by available memory bandwidth, so the Spark token generation will be about the same as the 395.
On general utility, I will say that the 16 Zen5 cores are impressive. It beats my 24C EPYC 9274F in single and multithreaded workloads by about 25%.
UncleOxidant
> Ryzen AI Max 395+, ~120 tops (fp8?), 128GB RAM, $1999
Just got my Framework PC last week. It's easy to setup to run LLMs locally - you have to use Fedora 42, though, because it has the latest drivers. It was super easy to get qwen3-coder-30b (8 bit quant) running in LMStudio at 36 tok/sec.
alias_neo
I'm pretty new to this, so if I wanted to benchmark my current hardware and compare to your results what would be the best way to do that?
I'm looking at going for a Framework Desktop and would like to know what kind of performance gain I'd get over the current hardware I have, which so far I have a "feel" for the performance of from running Ollama and OpenWebUI, but no hard numbers.
linuxftw
What nobody seems to ever share is the context and TTFT (time to first token). You can get a very good TPS by using small prompts, even if the output tokens are very large. If you try to do any kind of agentic coding locally, where contexts are 7k+, local hardware completely falls over.
qwen-code (cli) gives like 2k requests per day for free (and is fantastic), so unless you have a very specific use case, buying a system for local LLM use is not a good use of funds.
If you're in the market for a desktop PC anyway, and just want to tinker with LLMs, then the AMD systems are a fair value IMO, plus the drivers are open source so everything just works out of the box (with Vulkan, anyway).
UncleOxidant
You could load up LMStudio on your current hardware, get qwen3-coder-30b (8bit quant) and give it some coding tasks, something meaty (I had it create a recursive descent parser in C that parses the C programming language). At the end of it's response it shows the tok/sec. I'm getting 36 tok/sec on the Framework running that model.
hasperdi
Hi could you share if you get a decent coding performance (quality wise) with this setup? IE. Is it good enough to replace say Claude Code?
UncleOxidant
qwen3-coder-30b is surprisingly good for a smallish model, but it's not going to replace Claude Code. Maybe if you're using it for Python it could do well enough. I've been trying it with C code generation and it's not bad, but certainly not at Claude Code level. I hope they come out with a qwen coder model in the 60b to 80b range - something like that would give higher quality results and likely still run in the 15 tok/sec range which would be usable.
pixelpoet
Very encouraging result, I'm waiting super anxiously for mine! How much memory did you allocate for the iGPU?
UncleOxidant
I haven't done any fiddling with that yet. Out of the box it seems to allocate 1/2 for the iGPU. The qwen3-coder-30b 8bit quant model was (as you would expect) only taking 30GB (a bit less than half of what was allocated). Though weirdly, in htop it shows that the CPU has 125GB available to it, so I'm not sure what to make of that.
jauntywundrkind
NVidia Spark is $4000. Or, will be, supposedly whenever it comes out.
Also notably, Strix Halo and DGX Spark are both ~275GBps memory bandwidth. Not always but in many machine learning cases it feels like that's going to be the limiting factor.
robbomacrae
GosuCoder's latest video seems to be a well timed test of using Ryzen AI Max on some local models getting 40 TPS on a quantized Qwen 3 coder.
rjzzleep
Maybe the real value of the DGX spark is to work on Switch 2 emulation. ARM + Nvidia GPU. Start with Switch 2 emulation on this machine and then optimize for others. (Yeah, I know, kind of expensive toy).
pta2002
I think you can get something a lot cheaper if that’s all you want, e.g. something in the Jetson Orin line. That’s more similar to the switch, also, since it’s a Tegra CPU.
ThatMedicIsASpy
Expensive today. But how quickly (years) will these systems lower in value? At least on the Nvidia side of things they can be stacked.. so maybe not so much =/
undefined
littlestymaar
You should add memory bandwidth to your comparison, as it's usually the bottleneck in terms of tps (at least for token generation, prompt processing is a different story).
aurareturn
Mac Studio max spec, ~120 tflops (fp16?), 384GB RAM, 3x bandwidth, $9499
512GB.DGX has 256GB/s bandwidth so it wouldn't offer the most tokens/s.
rz2k
Perhaps they are referring to default GPU allocation that is 75% of the unified memory, but it is trivial to increase it.
jauntywundrkind
The GPU memory allocation refers to how capacity is alloted, not bandwidth. Sounds like the same 256-bit/quad-channel 8000MHz lpddr5 you can get today with Strix Halo.
echelon
tokens/s/$ then.
seanalltogether
Do we need a new term to describe "unified memory" where the cpu and gpu are still isolated from each other and memory needs to be allocated for one or the other and "unified memory" where the cpu and gpu can both access the same addresses. Which systems use which?
nightski
Am I missing something or does the comparably priced (technically cheaper) Jetson Thor have double the PFLOPs of the Spark with the same memory capacity and similar bandwidth?
Apes
My understanding is the DGX Spark is optimized for training / fine tuning and the Jetson Thor is optimized for running inference.
Architecturally, the DGX Spark has a far better cache setup to feed the GPU, and offers NVLINK support.
AlotOfReading
There's a lot of segmentation going on in the Blackwell generation from what I'm told.
modeless
Also Thor is actually getting sent out to robotics companies already. Did anyone outside Nvidia get a DGX Spark yet?
undefined
ComplexSystems
The RAM bandwidth is so slow on this that you can barely train or do inference or do anything on it. I think the only use case they have in mind for this is fine tuning pretrained models.
wmf
It's the same as Strix Halo and M4 Max that people are going gaga about, so either everyone is wrong or it's fine.
gardnr
Memory Bandwidth:
Nvidia DGX: 273 GB/s
M4 Max: (up to) 546 GB/s
M3 Ultra: 819 GB/s
RTX 5090: ~1.8 TB/s
RTX PRO 6000 Blackwell: ~1.8 TB/s
aurareturn
M4 max has more than double the bandwidth.
Strix Halo has the same and I agree it’s overrated.
Rohansi
I would expect/hope that DGX would be able to make better use of its bandwidth than the M4 Max. Will need to wait and see benchmarks.
7thpower
The other ones are not framed as an “AI Supercomputer on your desk”, but instead are framed as powerful computers that can also handle AI workloads.
littlestymaar
Same as Strix Halo, which is 30% cheaper and readily available, yes.
Hence the disappointment.
garyfirestorm
What did I miss? This was revealed in May - I don’t see anything new in that link since it was revealed.
wmf
Not much. There was a presentation yesterday but it's mostly what we already knew: https://www.servethehome.com/nvidia-outlines-gb10-soc-archit...
numpad0
This has been getting delayed for months. Display out isn't working or something.
KingOfCoders
I think it depends on your model size
Fits into 32gb: 5090
Fits into 64gb - 96gb: Mac Studio
Fits into 128gb: for now 395+ $/token/s,
Mac Studio if you don't care about $
but don't have unlimited money for Hxxx
This could be great for models that fit 128gb and you want best $/token/s (if it is faster than a 395+).timc3
The 395 although it can be supplied with 128GB can’t use all that for VRAM (unless something has changed in the last couple of weeks).
lhl
In Linux, you can set it as high as you want, although you should probably have a swap drive and still be prepared for you system to die if you set it to 128GiB. Here's how you'd set it to 120GiB:
# This is deprecated, but can still be referenced
options amdgpu gttsize=122800
# This specifies GTT by # of 4KB pages:
# 31457280 * 4KB / 1024 / 1024 = 120 GiB
options ttm pages_limit=31457280KingOfCoders
From YouTube it seems up to 105gb Models disksize work, yes.
dirtyhand
I was considering getting an RTX 5090 to run inference on some LLM models, but now I’m wondering if it’s worth paying an extra $2K for this option instead
apitman
If you want to run small models fast get the 5090. If you want to run large models slow get the Spark. If you want to run small models slow get a used MI50. If you want to run large models fast get a lot more money.
Gracana
You might be able to do "large models slow" better than the spark with a 5090 and CPU offload, so long as you stick with MoE architectures. With the kv cache and shared parts of the model on GPU and all of the experts on CPU, it can work pretty well. I'm able to run ~400GB models at 10 tps with some A4000s and a bunch of RAM. That's on a Xeon W system with poor practical memory bandwidth (~190GB/s), you can do better with EPYC.
Apes
RTX 5090 is about as good as it gets for home use. Its inference speeds are extremely fast.
The limiting factor is going to be the VRAM on the 5090, but nvidia intentionally makes trying to break the 32GB barrier extremely painful - they want companies to buy their $20,000 GPUs to run inference for larger models.
skhameneh
RTX 5090 for running smaller models.
Then the RTX Pro 6000 for running a little bit larger models (96gb VRAM, but only ~15-20% more perf than 5090).
Some suggest Apple Silicon only for running larger models on a budget because of the unified memory, but the performance won't compare.
BoorishBears
No. These are practically useless for AI.
Their prompt processing speeds are absolutely abysmal: if you're trying to tinker from time to time, a GPU like a 5090 or renting GPUs is a much better option.
If you're just trying to prep for impending mainstream AI applications, few will be targeting this form factor: it's both too strong compared to mainstream hardware, and way too weak compared to dedicated AI-focused accelerators.
-
I'll admit I'm taking a less nuanced take than some would prefer, but I'm also trying to be direct: this is not ever going to be a better option than a 5090.
aurareturn
Their prompt processing speeds are absolutely abysmal
They are not. This is Blackwell with Tensor cores. Bandwidth is the problem here.BoorishBears
They're abysmal compared to anything dedicated at any reasonable batch size because of both bandwidth and compute, not sure why you're wording this like it disagrees with what I said.
I've run inference workloads on a GH200 which is an entire H100 attached to an ARM processor and the moment offloading is involved speeds tank to Mac Mini-like speeds, which is similarly mostly a toy when it comes to AI.
querez
"developers can prototype, fine-tune, and inference [AI models]"...
shouldn't it be infer?
myrmidon
It should be "run inference on" in my opinion, and would be best shortened IMO to just "prototype, fine-tune, and run".
I'argue that "inference" has taken on a somewhat distinct new meaning in an LLM-context (loosely: running actual tokens through the model) and deviating from the base term to the verb form would make the sentence less clear to me.
killerstorm
No. It's quite common for technical slang to deviate from general vocabulary.
Cf. "compute" is a verb for normal people, but for techies it is also "hardware resources used to compute things".
querez
I don't think "inference" as a verb has become technical slang. At least not in my bubble.
globular-toast
Perhaps "infer from"? I was also taken aback by how they just decided to make "inference" a verb, though. A decent writer would have rewritten the sentence to make it work, similar to how a software implementation sometimes just doesn't work out. But apparently that's too much to ask from Nvidia marketing.
Funnily enough things like this show that a human probably was involved in the writing. I doubt an LLM would have produce that. I've often thought about how future generations are going to signal that they are human and maybe the way will be human language changing much more rapidly than it has done, maybe even mid sentence.
nharada
I don’t think either of those are right…
undefined
Get the top HN stories in your inbox every day.
FP4-sparse (TFLOPS) | Price | $/TF4s
5090: 3352 | 1999 | 0.60
Thor: 2070 | 3499 | 1.69
Spark: 1000 | 3999 | 4.00
____________
FP8-dense (TFLOPS) | Price | $/TF8d (4090s have no FP4)
4090 : 661 | 1599 | 2.42
4090 Laptop: 343 | vary | -
____________
Geekbench 6 (compute score) | Price | $/100k
4090: 317800 | 1599 | 503
5090: 387800 | 1999 | 516
M4 Max: 180700 | 1999 | 1106
M3 Ultra: 259700 | 3999 | 1540
____________
Apple NPU TOPS (not GPU-comparable)
M4 Max: 38
M3 Ultra: 36