The x86 Advanced Matrix Extension

fuse.wikichip.org

Daily Digest email

Get the top HN stories in your inbox every day.

PaulHoule

I wonder what Charlie Demerjian is going to say about this.

It is an awful lot of registers for a feature that few programs may use. There's the risk that bfloat16 is a fad and five years from now it is used hardly at all. At best it is for a full stack perception-and-synthesis feature about as good as

https://en.wikipedia.org/wiki/Intel_Quick_Sync_Video

Most of all it drives me nuts that Intel is going down the fixed-length SIMD instruction route (going wider) where the instructions you write are specific to the structures of the processor you run on. Machines like the ILLIAC, Cray, and the vector processor for the 3090 mainframe would automatically chunk the work so that you didn't need to rewrite your code for a future wider model.

ARM is doing it the right way:

https://alastairreid.github.io/papers/sve-ieee-micro-2017.pd...

Veedrac

It's a controversial opinion I've not heard anyone else express, but IMO variable-width vector instructions are the wrong approach, since they're optimizing for the easy cases and solving problems that don't matter much, like instruction counts.

Although x86's SIMD extensions have a bunch of crippling issues, fixed-width instructions are fundamentally more general because they work both for vectorizable length-N loops, and for the many, many cases where you have a fixed units of work that you can perform in parallel. If you design your architecture to depend on loops to extract full performance, you cut yourself off from many productive uses of vector instructions outside of loops, and make it more difficult to juggle cases where you have more than one loop dimension.

Upgrading to new architectures with longer supported vector lengths is best left to recompilation and variable-width vector libraries.

ritter2a

Wouldn't be the first of Intel's ISA extensions to be unsuccessful because of its limitations, look at MPX: https://intel-mpx.github.io/

zamalek

> It is an awful lot of registers for a feature that few programs may use

I was just wondering what you could do with 8KiB of register memory, if you ignored the intent of the registers outright (assuming you can load/store to the classical registers from these registers).

undefined

[deleted]

api

Cache expanded cryptographic keys and tables entirely in registers for one...

fuoqi

AFAIK SIMD accelerated cipher implementations already mostly do it. For example, AES-NI based implementations of AES-128 and AES-192 keep round keys in XMM registers without reloading them during processing of blocks.

rbanffy

Intel reasons it's so massive that any way they decide to go is The Right Way and people will accommodate anything they do. So far, this is mostly true - we have code that takes different branches depending on the ISA extensions available, probably all the way down to x87.

jeffbee

I don't think that's how these features get developed. Someone who orders CPUs by the cubic meter comes to Intel and asks for a bfloat16 unit. Six years later, they ship one. I know for a fact this is how BMI came to exist.

rbanffy

Indeed. We should never think that we are Intel's (or AMD's or IBM's) customers. They have a very short list - Dell, Lenovo, HP and, more recently, Google, AWS, Facebook and some of these are procuring directly from TSMC.

undefined

[deleted]

api

Intel is going to start thrashing around now, adding a million features to try to beat ARM and AMD on various microbenchmarks and special use cases. Meanwhile ARM and AMD will keep winning on throughput, price/performance, and (for ARM particularly) performance/watt.

Any win all these features bring can also be achieved with ARM or Zen by adding more cores, with the exception of those few cases where you have huge computational tasks that cannot be efficiently parallelized and where there are few discrete jobs to allow for coarse grained (job/task) parallelization. There are not many of these.

Meanwhile all these features are going to make Intel chips even more complex, making bugs more likely and making iteration more costly.

My read is that Intel is shooting for maximum possible single threaded performance because they can't compete on power efficiency or many-core. They can't compete in those areas because their process nodes are not as small as what TSMC can offer, and both (most) ARM chips and AMD are using TSMC and fabricating at 7nm and soon 5nm. (Yes I know nm node sizes are no longer directly comparable, but they are ahead of Intel and likely to stay ahead unless Intel can really push hard on fab engineering.)

TinkersW

It sounded interesting until I saw that the only float type it supports is Brain float :(

cesaref

I'm kind of interested to see what bfloat16 sounds like (for audio DSP). I'd expect it to be good enough for a large number of algorithms so long as they are properly stable, and if we have decent performance and reduced power use, i'm all for that!

klodolph

I’m sure you could come up with an application for it, but if you want audio output at some point, half-float sounds like quite a challenge.

- The -66dB noise floor is pretty bad, and it accumulates at every step.

- 11 bits is probably not enough for filter coefficients. So your filters would likely be running with single precision floats, at least. Even low-cost DSP chips tend to give you a big chunk of bits for your filters.

- Naïve oscillator designs would accumulate a lot of error. Back of the envelope calculation, if you wanted an oscillator at C4, you’d likely be around 1/4 tone sharp or flat unless you ran the oscillator at higher precision.

I’m definitely of the mind that bit depth is overrated in music. 16 bits is great-for mastered music and simple tweaks. By contrast, from my experiences writing DSP code, it often makes your code simpler and faster to run at higher depths and sample rates, and then convert to e.g. 16 bit as the very last step. The problem is that squeezing good output from low precision or low sample rates requires more complicated and slower algorithms.

rbanffy

Probably not as crisp as 16-bit integers - mantissa is 7 bits, so 8-bits with sign. Clever use of exponents may give some additional range, but I wouldn't be too optimistic.

The good thing is that, as any FP number, can represent a very large range of values, so I'd expect percussion and other things with very high-frequency transients to sound nice. My bet is it'll sound "colorful". IEEE float16 should sound better.

But I'm no expert and I don't have the time to write something that force a good high dynamic range track to be rounded to the nearest bfloat16 and then expanded back. Also, I don't have gear good enough to hear anything better than CD-grade audio.

innocenat

That tile registers are crazy. But I wonder how long it will actually take to actually be performant, considering the present problem with AVX512 switching.

google234123

AVX512 is already performant.

jfkebwjsbx

Only in a subset of cases, which is the problem: you cannot simply always use it like the previous extensions.

klodolph

None of the previous extensions could be used blindly, either. It was a while before people figured out how to use MMX or SSE well, and people still often find that the scalar version of an algorithm beats their vector version.

unwind

Meta: I'm not a native speaker, but that lonely 'W' in the title really irks me. Suggested alternate title would be something like "Intel's Sapphire Rapid debuts x86 Advanced Matrix Extension" (59 chars).

beojan

It needs a '/' following it (and should probably be a lowercase 'w'). 'w/' is quite a common abbreviation for 'with' though (with 'w/o' for 'without').

Presumably there's a character limit on HN titles, and '/' isn't allowed either?

messe

Or how about remove “the ” from the start of the sentence, and just writing “with”

stefan_

Or we just replace the w/ with "in".

waynesonfire

where are these extensions being used? I didn't look hard but are there open source libraries / compilers that will take advantage of these?

It's sort of amazing the amount of performance you can squeeze when you fix your OS and cpu architecture.

There are so many extensions: https://software.intel.com/sites/landingpage/IntrinsicsGuide -- are we suppose to write our own libraries to leverage these or do we need to file tickets with our favorite compilers for them to develop these optimizations?

Oh, one more questions, how do these overlap with AMD?

mratsim

BLAS libraries, oneDNN, OpenCV, Eigen, Tensorflow, PyTorch, LLVM MLIR, ...

AMD usually implements them but with a couple of years of delay.

For example AVX512 is not implemented, and we had to wait for Ryzen 3 to have the same AVX capabilities as Intel (2 AVX units per core instead of one).

fomine3

An Intel engineer sends PR https://github.com/herumi/xbyak/pull/95

sradman

OK, this is a new Intel SIMD-like instruction set: AVX for vectors, now AMX for matrices. I guess this is an alternative to Nvidia GPU, Google TPU, Apple Neural Engine, etc..

nabla9

It's not general alternative. It's good some subset of inference tasks.

Intel deploys their own GPU some time in the future.

deltasquared

I am wondering why I would want this on a CPU when this kind of processing is already available on a GPU.

chrisseaton

Where is your data? Is it in the CPU cache or is it in the GPU? Computing where your data is, rather than moving your data to where your compute is, can often be the best option.

emcq

For small networks it's often a win to stay on chip at least on the power side. But if you do need to go off chip for memory it's hard to beat the memory bandwidth you have on a GPU.

mratsim

Looks very interesting but ... AVX512 is already problematic cooling wise, this seems even worse.

im3w1l

How big of an issue is context switching in the middle of a sequence of AMX operations?

jcranmer

The AMX extensions drop 2 more components into XSAVE: XTILECFG (which is 64 bytes) and XTILEDATA (8192 bytes).

Interestingly, there does seem to be a new extension (see §3.2.6 of https://software.intel.com/content/www/us/en/develop/downloa...) that, on first glance, looks to be a per-thread enable/disable bit for using these registers, which suggests that an OS could make a process-level capability to enable/disable AMX and thereby not bother saving these registers on context switches if it's switching to a process without AMX.

snvzz

In other news, bloated CISC architecture becomes further bloated.

I'm looking forward to RISC-V's V extension, which is due to become standard around September. Unlike AVX512 and friends, this one is vector size agnostic.

monocasa

This isn't built on the AVX512 style register file either; or even a vector register at all. It's a set of huge matrix registers, so it's pretty orthogonal to both AVX512 and RV-V.

jabl

I haven't followed the RV-V extension in a while, but IIRC it has acquired features to configure the vector registers as matrix tiles, and matrix multiplication instructions.

monocasa

It hasn't as of the 0.9 draft, but maybe there's something new I don't know about.

Daily Digest email

Get the top HN stories in your inbox every day.