AMD's Next GPU Is a 3D-Integrated Superchip

spectrum.ieee.org

Daily Digest email

Get the top HN stories in your inbox every day.

JonChesterfield

I want one of these a lot. Spec https://www.amd.com/en/products/accelerators/instinct/mi300/....

It's 24 x64 cores and 228 CDNA compute units on a common 128GB of memory. Personally I want to run constraint solvers on one. The general approach of doing lots of integer/float work on the GPU and branchy work on the CPU, both hitting the same memory, feels like an order of magnitude capability improvement over the current systems.

brucethemoose2

Unfortunately the BoM on these things is probably super high. Like abdornally high even for a datacenter GPU.

AMD is coming out with a "strix halo" APU somewhat appropriate for compute.

https://hothardware.com/news/amd-strix-halo-cpu-rumors

GeekyBear

> Unfortunately the BoM on these things is probably super high.,, even for a datacenter GPU

You would save money on improved yields of chiplets and on using cheaper nodes where appropriate. I imagine that would help offset the increased packaging costs.

However, they are moving to compete in a market where the profit margins are so obscene that a small increase in the cost of materials really wouldn't be an important factor.

> Nvidia Makes Nearly 1,000% Profit on H100 GPUs: Report

https://www.tomshardware.com/news/nvidia-makes-1000-profit-o...

varelse

And that's the consequence of giving the market leader a decade and a half head start. No one and I mean no one was preventing AMD or Intel from building a competitive hardware and software ecosystem other than themselves.

JonChesterfield

Yeah, I can't reasonably buy one. But rent one to do some maths? Good odds.

There are some existing APU systems - I use low wattage ones as thin clients. Currently thinking it should be possible to write code that runs slowly on a cheap APU and dramatically faster on a MI300A system. Debug locally, batch compute on AWS (or wherever ends up hosting these things).

abstractcontrol

BoM?

MR4D

Bill of Materials.

Also know as cost of the chip.

Kye

Bill of Materials

fooker

If you can figure out how to effectively utilize this much parallelism for constraint solving, you are likely going to get a Turing award.

dmead

I want to run image stacking for my astrophotography habit. 12 hours to drizzle my current data sets is a long wait.

dsab

What tool can you advise me to find the best combination of dividers and multipliers in PLL clocks and dividers in the UART and SPI peripherals so that the deviation from the assumed baudrate is as small as possible and that the dividers meet a number of constraints, including e.g. those from the chip errata which says that one PLL must be 2 times faster than the other?

fedegiova

Write a half page long python script that iterate all combinations , it won't take long, and you have to do that only once

segfaultbuserr

"When in doubt, use brute force." - Ken Thompson

rsaxvc

Excel. Conditionally format based on constrain violations and difference from target baudrate. Brute force and scroll.

gavinray

I've had good experiences with OptaPlanner:

https://www.optaplanner.org/

imtringued

Try optlang. It mostly does ILP/LP though.

avereveard

there's a good chance a m3 ultra will be competitive in that area and will be actually available to people

maciejgryka

Oxide and Friends just did an episode talking about this: https://oxide.computer/podcasts/oxide-and-friends/1643335

aidenn0

I just discovered On the Metal earlier this year and finished it a week ago. I'll probably get to this episode of Oxide & Friends sometime next year.

mkj

How do they physically align all the parts? Do they have some kind of self-aligning mechanism or it's done with external manipulation?

(Or maybe that's TSMC's secret)

brucethemoose2

Yeah, it is TSMC, see:

https://3dfabric.tsmc.com/english/dedicatedFoundry/technolog...

Older explanation, but lots of this stuff is just now shipping: https://www.anandtech.com/show/16051/3dfabric-the-home-for-t...

Not that AMD doesn't deserve any credit, they have considerable multi chip experience under their belt and undoubtedly served as a guinea pig/pipe cleaner for TSMC's advanced package.

amelius

It doesn't sound very complicated compared to, say, aligning the masks.

bob1029

I think the other parameters are way more difficult to get right than any optical system alignment considerations. Things like statistical process control make photolithography look less scary.

You can check if you correctly patterned the wafer almost immediately. You won't know if the layer is any good until many process steps later. Maybe not for sure until EDS. Tuning the interactions between manufacturing processes is the actual secret sauce that all the manufacturers are trying to protect. How much dose on the EUV machine depends a lot on how you intend to etch the wafer. Imagine iteration cycles measured in months for changing individual floating point variables.

crotchfire

The masks only need to be aligned during manufacturing.

The chiplets need to stay aligned as temperatures change. Much more difficult.

lightedman

Not very difficult. Mask alignment is harder, soldering physics takes place once the oven temps begin and (usually) keeps things locked-down.

sylware

Ask ASML.

The contraints all depend on the scale of the alignment required.

I wonder what is the wavelength range used for their interferometers, and what kind of mecanical engines they use (probably piezo electric based engines).

pk-protect-ai

Ping me when the software stack for the AMD hardware is as good as CUDA.

tovej

What exactly are you missing when hipifying your CUDA codebase? For most of the software I've looked at this has been a breeze, mostly consisting of srtting up toolchains.

Or do you mean the profiler tooling?

I hear everyone say that AMD doesn't have the software, but I'm a little confused --- have you tried HIP? And have you tried the automatic CUDA - HIP translation? What's missing?

dotnet00

I think they're referring to support in general. Technically HIP exists, but it's a pain to actually use, it has limited hardware support, is far less reliable in terms of supporting older hardware, needs a new binary for every new platform and so on.

CUDA runs on pretty much every NVIDIA GPU, this year they dropped support for GPUs released 9 years ago, and older binaries are very likely to be forward compatible.

Meanwhile my Radeon VII is already unsupported despite still being pretty capable (especially for FP64), and my 5700XT was never supported at all (I may be mixing this up with support for their math libraries), everyone was just led on with promises of upcoming support for 2 years. So "AMD has the software now" is not really convincing.

tovej

I suppose if you're talking about consumer cards, I agree, support is often missing.

But if we're talking datacenter GPUs, the software is there. Data centers is where most GPGPU computing happens after all.

It's not ideal when it comes to hobby development, but if you're working in a professional capacity I'm assuming you're working with a modern HPC or AI cluster.

undefined

[deleted]

FL33TW00D

There is a truly gigantic demand for this - I expect you won't be waiting too long.

qwertox

Related (published yesterday): Intel CEO attacks Nvidia on AI: 'The entire industry is motivated to eliminate the CUDA market' https://www.tomshardware.com/tech-industry/artificial-intell...

bbatha

The chip industry is sure. But are the customers? The customers who cared are jaded by nearly 15 years of intel and amd utterly failing to make a compelling alternative and likely have a large existing investment in CUDA based GPUs.

zozbot234

Discussed https://news.ycombinator.com/item?id=38645021

chrsw

It feels like playing software catchup in this fast moving sector is very challenging. NVIDIA+CUDA is kind of a standard at this point. AMD's CPUs still ran Windows. AMD's GPUs still ran DirectX and OpenGL. This feels different.

throwup238

I’ve been waiting sixteen years.

FL33TW00D

There has never been more money riding on eliminating the CUDA monopoly than now.

tails4e

HIP is a direct drop in for CUDA. It's ready, and many folks using it ported CUDA code with little to no effort.

The SW story has been bad for a long time, but it is perhaps right now better than you think

orbital-decay

That's a lot of layers with different thermal expansion coefficients and conductivity. How do they cool all this?

bryanlarsen

The AMD announcement spent a surprising amount of time on this issue and the ways they're dealing with it.

flkenosad

Can't wait for n-layered chips.

yazaddaruvala

Anyone have insight on:

Why don’t they position the SOC to have the HBM in the center with the CCDs and XCDs on the perimeter?

Seems to me that would yield lower “wire length” through the interconnect for each CCD/XCD to all of the memory.

brucethemoose2

The XCDs need to talk to each other very quickly, way faster than the HBM, to act like a single chip.

Also, the physical interconnect between the XCDs is different than the HBM.

jauntywundrkind

The memory isn't designed to connect to a lot of different chips. So there's no reason to put it in the center.

The core systems of interconnect are at the base most center-most (well, there's a passive interposer too, but it's just wires): the IOD. These intermediate connections across the chips on top of them, the IOD next to them, and the memory.

thrtythreeforty

HBM is designed for the wire lengths that you're talking about. The core protocol isn't so terribly different from good old DDR4.

The die-to-die links, on the other hand, have extremely short wire length limits (less than a centimeter iirc for UCIe). At the physical layer, you just waste a bunch of power and area driving a high voltage to make the signal go further. Do that enough and you basically wind up with a PCIe PHY.

wtallis

Each memory chip is connected to a single DRAM controller. Those DRAM controllers are what you'd want to be close to each other to minimize NUMA effects.

soganess

complete and total unfounded guess... thermal expansion?

PedroBatista

Maybe thermals? Idk

undefined

[deleted]

dharma1

Priced in on the stock yet? Looks like team red starting to nip at Nvidia’s heels again

drexlspivey

It’s around 12% up since their event but some of it is beta, everything is pumping

FeepingCreature

Is this actually a GPU? Ie. can it even render graphics and scan out to a monitor?

Pomfers

AMD's compute oriented cards used to come with displayport output, but I haven't seen one of those in a long time. These cards are definitely GPUs in that they can handle graphical workloads, but I don't think anyone is trying to make them work for video games or the like.

daemonologist

I heard a rumor somewhere that Stadia ran on MI25s - not sure if that's true but certainly there have been a lot of them floating around on eBay in the past year.

slavik81

IIRC the graphics hardware was dropped in CDNA architectures (MI100, MI200, and MI300). For example, I don't think there are texture units.

brucethemoose2

AMD still sells firepros (EG the W7900) like that.

Dylan16807

That's just a variant of the RX 7900. It's an entirely different architecture.

ooterness

Does this mean some of the AMD-Xilinx FPGAs will start stacking DRAM on-chip?

wtallis

They've had FPGAs with HBM DRAM for several years.

redder23

I do not have much clue about chips but when AMD bought ATI I thought they will come out with some superchip that combines CPU and GPU into one.

This is just for AI? So not for gaming PCs?

kube-system

> when AMD bought ATI I thought they will come out with some superchip that combines CPU and GPU into one.

That is exactly what did happen. The year was 2006 and it was called the AMD Fusion project. AMD launched the chips, that they call "APUs", in 2011.

Nowadays, this configuration is common in both AMD and Intel "CPUs".

pstuart

I imagine they'll go for the big money stuff first and it will trickle back to gamers.

aappleby

That is literally what drives the PS5 and XSX.

fancyfredbot

"MI300 [is] three slices of silicon high and can sling as much as 17 terabytes of data vertically between those slices"

After you transfer 17 terabytes of data, it is worn out and you can't use it any more

opwieurposiu

Once the ones wear down below 0.5 they start to look like zeros.

fancyfredbot

Exactly. Moving vertically against gravity gradually squashes the bits down.

metabagel

Per second

fancyfredbot

Probably :-)

The literal interpretation of the article is funnier though.

undefined

[deleted]

29athrowaway

If not CUDA, then what can you use with those?

bayindirh

HPC admin here.

ROCm, which is AMD's equivalent of CUDA. The thing is you don't have to directly interface with CUDA or ROCm. Once the framework you want to use supports these, you're done.

AMD is consistently getting used on the TOP500 machines, and this gives them insane amounts of money to improve ROCm. CUDA has an ecosystem and hardware moat, but not because they're vastly superior, but because AMD both prioritized processors first, and NVIDIA played dirty (OpenCL support and performance shenanigans, anyone?).

This moat is bound to be eaten away by both Intel and AMD, and compute will be commoditized. NVIDIA foresaw this and bought Mellanox and wanted ARM to be a complete black box, but it didn't work out.

Ethernet consortium woke up and got agitated by the fact that only ultra-low latency fabric provider is not independent anymore, so they're started to build alternatives to Infiniband ecosystem.

Interesting times are ahead.

Also there's OneAPI, but I'm not very knowledgeable about it. It's a superAPI which can target many compute platforms, like OpenCL, but takes a different approach. It compiles to platform native artifacts (CUDA/ROCm/Intel/x86/custom, etc.) IIRC.

formerly_proven

TOP500 has had all sorts of "odd" systems near the top. Unusual or custom architectures and fabrics are far more viable for "this is our nation's nukeputer" or "this is going to run the nation's weather forecasting model" type of systems compared to e.g. a commercial HPC system where you wanna run all sorts of commercial/proprietary software. AMD being successful in the former niche doesn't mean their software stack is viable for the latter niche.

bayindirh

The systems running AMD cards are not "Nukeputers", and not all "Nukeputers" and "Fusionputers" are running on custom silicon.

When you look to latest list [0], #1 system, Frontier, is a Nukeputer, but it's not only a Nukeputer. #5, LUMI is definitely not a Nukeputer. They are very close to us, we work together under a project with them (we're equals. They operate under a different consortium, we have stakes on another computer which is also very famous and also in top 10 in TOP500, and is not a Nukeputer). We also have our smaller systems in our own datacenter.

We operate mostly in the long tail of science, and this means we see heavy use of popular software packages, and many special software packages optimized for these long tail problems. ROCm was invisible in this niche before, but with Frontier and LUMI, and with AMD's announcements, this started to change very quickly.

ROCm libraries are more open w.r.t. their CUDA counterparts, and they started to land to mainstream distribution repositories directly. This is important for our niche. AMD is sponsoring integration of their libraries to popular packages and it started to payoff already.

Also, small-guy-programmers start to optimize LLM training routines for AMD cards, getting 99% of the performance of NVIDIA counterparts with way less power consumption.

As a result, AMD is already much more visible and more capable position when compared to last year.

[0]: https://top500.org/lists/top500/list/2023/11/

justinclift

> they're started to build alternatives to Infiniband ecosystem.

Cool, that sounds interesting. Anything you can point at? :)

viewtransform

Discussion on ultrafast ethernet and partners at AMD's December presentation. https://youtu.be/tfSZqjxsr0M?t=5198

timschmidt

UltraEthernet

neverrroot

Everything is possible, especially what they have in mind, where they will do a custom implementation. CUDA is important for the existing ecosystem, but that doesn’t make it the only show in town.

doikor

Simulating nuclear explosions, effects of decay on nuclear weapon stockpiles, etc

(this is literally what the El Capitan supercomputer mentioned in the article is for)

neverrroot

Wrong reply, please ignore.

hutzlibu

You can also delete your wrong comments .. (within 2 h I think)

anotherhue

CUDA is at the API level (ish, I know). There's plenty of room for new silicon that can expose different APIs.

Compare to switching from x86 to ARM.

ssijak

why no CUDA rosetta?

c0n5pir4cy

AMD has released HIP and a tool called HIPIFY which kind of behaves like this but at the source level¹. Rather than try and just translate CUDA to work on AMD compute they are more focused on higher level tooling.

Currently they seem to have a particular focus on AI frameworks and tools like PyTorch/Tensorflow/ONNX. They have sponsored and helped with a lot of PyTorch development for example, so PyTorch support for AMD is much better than it was this time last year².

¹(https://github.com/ROCm/HIP)

²(https://pytorch.org/blog/experience-power-pytorch-2.0/)

JonChesterfield

Nvidia could build that but doesn't want it to exist. Outside of nvidia you'd have to reverse engineer their machine code which would be a massive undertaking. The ISA ISA is published by AMD and Intel, you could build tooling from the docs alone if you wish.

DeathArrow

Games? Rendering? Transcoding video?

sylware

Video is getting a lot of direct ASIC blocks (look at the latest VPE block in AMD GPUs).

I guess those chips are for the movies/video industry, online or not. Because for the "consumer", the CPU is already very efficient, and I don't think we would save interesting about of battery usage in a "real usage" perspective. I may be very wrong, but I don't watch hours and hours in a row of ultra high quality videos on a small screen, that off the AC plug, the battery is unusable in a matter of a few years anyway... it not less.

imtringued

Encoding video in real time is expensive. You can run your blog on a 5€/month machine. You'll most likely need ten times that money for a single real time stream using nothing but software decoding, rendering and encoding. If you think, "hey I'm going to use a GPU for this" then congratulations, you've increased your cloud bill by a factor of 20.

Daily Digest email

Get the top HN stories in your inbox every day.