Dissecting the Apple M1 GPU, Part III

Daily Digest email

Get the top HN stories in your inbox every day.

gfxgirl

That's very exciting!

A nit:

> For example, I have not encountered hardware for reading vertex attributes or uniform buffer objects. The OpenGL and Vulkan specifications assume dedicated hardware for each, so what’s the catch?

That is not my understanding of those specs (as someone that's written graphics drivers). Uniform Buffer Objects are not a "hardware" thing. They're just a way to communicate uniforms faster than one uniform per API call. What happens on the backend is undefined by those specs and is not remotely tied to some hardware implementation. Vertex Attributes might have been a hardware thing long ago but. I'm pretty sure there are older references but this 9yr old 2012 book already talks about GPUs that don't have hardware based vertex attributes.

https://xeolabs.com/pdfs/OpenGLInsights.pdf chapter 21

M277

This is completely off topic, I am very sorry, but given your comment and your username -- are there any learning resources you would particularly recommend for graphics programming? I have collected a few already (however they are all beginner level), and was wondering if there are hidden gems I missed.

dlivingston

An excellent starting point for anyone interested in low-level graphics programming is Sokolov’s tinyraytracer [0]. It’s also a great way to learn a new language (work through the code while porting it to $DIFFERENT_LANGUAGE).

[0]: https://github.com/ssloy/tinyraytracer

majjam

not op, but here are a few ive collected:

https://web.archive.org/web/20130517222528/http://www.arcsyn...

https://news.ycombinator.com/item?id=26017086

https://news.ycombinator.com/item?id=18840859

nextaccountic

Here's a free book on physically based rendering

https://www.pbr-book.org/

sharpneli

I agree with the nit.

> Simply put – Apple doesn’t need to care about Vulkan or OpenGL performance.

OpenGL and Vulkan allow an implementer to more easily make such specialized HW. But it doesn't assume it at all in any other way. If your HW is fast enough there is absolutely no need to implement specialized block for it without any performance penalty.

It's trivial to implement things like input assembler without specific HW, just issue loads. But it would be massive pain to go the other way around. Try to sniff what loads fit the pattern that could be tossed into fixed function input assembler. That's a no go.

This is the right way around to do things. As there is no performance penalty for "emulating" it, because there is nothing to emulate in the end.

lights0123

This is great work—I'm glad to see this being tackled with such speed.

From the Phoronix comments on this post[0]:

> I have an idea. Why not support exclusively Vulkan, and then do the rest using Zink (that keeps getting faster and faster)?

> This way you could finish the driver in one year or two.

(For context: Zink is an OpenGL to Vulkan translator integrated into Mesa)

I had the same thought in my mind—Zink is 95% the speed of Intel's OpenGL driver[1], so why not completely ignore anything but Vulkan? On the Windows side, dxvk (DirectX to Vulkan) already is much faster (in most cases) than Microsoft's DX9-11 implementation, so it's completely feasible that Zink could become faster than most vendors' OpenGL implementation.

I have no knowledge of low-level graphics, so I don't know the ease of implementing the two APIs. I could envision, however, that because this GPU was never designed for OpenGL, there may be some small optimizations that could be made if Vulkan was skipped.

[0]: https://www.phoronix.com/forums/forum/phoronix/latest-phoron...

[1]: https://www.phoronix.com/scan.php?page=news_item&px=Zink-95-...

stefan_

This is very much abstracted away in Mesa already, particularly if you use NIR and your driver lives in Gallium.

account42

Not really, none of the Vulkan drivers in Mesa are built on top of Gallium.

devit

Yes, it works the other way: Zink is the Gallium->Vulkan translation layer, while the main Mesa code is effectively an OpenGL->Gallium translation layer.

gsnedders

This is the whole point of Gallium, right?

Like, the classic "Intel OpenGL driver" in Mesa (i.e., i965) doesn't use Gallium and NIR, and hence has to implement each graphics API itself, whereas their modern "Iris" driver using Gallium presumably just handles NIR -> hardware?

Or does the Gallium approach still require some knowledge of higher-level constructs and some knowledge of things above NIR?

e_proxus

Does that imply a performance hit, or is it roughly equivalent to targeting Vulkan “directly”?

vetinari

Think of it as HAL, on top of which state trackers implement their chosen APIs. OpenGL is one of them, there's also Gallium Nine that implement DirectX 9.

raphlinus

This is top-notch and very impressive work. I'm currently in the middle of tuning performance of piet-gpu for the Pixel 4[1], and I find myself relying on similar open resources from the Freedreno project. When the time comes to get this running efficiently on M1, having detailed knowledge of the hardware will be similarly invaluable - just the info on registers and occupancy is an important start.

Is there a way to support the work?

[1]: https://github.com/linebender/piet-gpu/issues/83

marcan_42

Alyssa isn't personally taking donations for her work on this project, but she suggests you donate to the Autistic Self Advocacy Network or the Software Freedom Conservancy instead :)

raphlinus

Thanks. I've made a substantial donation to ASAN listed as being in her honor. I'm also a fan of SFC and plan to continue to support them.

vesrah

Previous part discussions, for everyone else that was interested:

Part 1: https://news.ycombinator.com/item?id=25673631

Part 2: https://news.ycombinator.com/item?id=25873887

ogre_codes

So much for "It'll take years before we get the GPU working". Obviously this is far from a full implementation but seems like progress has been quick. Hopefully the power management stuff will be equally quick.

citrusui

Also curious how far progress is on reversing the Apple NVMe SSDs. Last I heard, Linux couldn't properly install itself on modern Macs, only do liveboot.

marcan_42

Apple NVMe SSDs have worked fine for years in mainline. This is a myth that won't die.

The Linux driver required two new quirks (different queue entry size, and an issue with using multiple queues IIRC). That's it. That's all it was.

On the M1, NVMe is not PCIe but rather a platform device, which requires abstracting out the bus from the driver (not hard); Arnd already has a prototype implementation of this and I'm going to work on it next.

makomk

As I understand it, Apple's NVMe were pretty wildly non-standards-compliant - they assume that tags are allocated to commands in the same way as Apple's driver does, including crashing if you use the same tag at the same time in both the admin and IO queues and only accepting a limited range of tags, and as you say they use a totally different queue entry size from the one required by the standard. Also, apparently interrupts didn't work properly or something.

Oh, and it looks like the fixes only made it into mainline Linux in 5.4, less than a year and a half ago, and from there it would've taken some time to reach distros...

vetinari

Maybe I'm remembering it wrong, but wasn't there an issue with a secret handshake, and if the system didn't do it in a certain time after the boot, the drive disappeared? I.e. some kind of T2-based security?

CyberRabbi

Interesting... what bus does it use if not PCIe? At the driver level I’m guessing it just dumps NVMe packets onto shared memory and twiddle some sort of M1-specific hardware register?

jacquesm

Nonsense. Last you heard was 5 years ago or so. And even then it could be done, just a bit more work rather than a default install.

mrweasel

There has to be some Apple engineers reading this wondering which feature she’ll find next, and hopefully with a big smile when she get something right.

Firadeoclus

This is some great work!

One point I disagree with:

>What’s less obvious is that we can infer the size of the machine’s register file. On one hand, if 256 registers are used, the machine can still support 384 threads, so the register file must be at least 256 half-words * 2 bytes per half-word * 384 threads = 192 KiB large. Likewise, to support 1024 threads at 104 registers requires at least 104 * 2 * 1024 = 208 KiB. If the file were any bigger, we would expect more threads to be possible at higher pressure, so we guess each threadgroup has exactly 208 KiB in its register file.

>The story does not end there. From Apple’s public specifications, the M1 GPU supports 24576 = 1024 * 24 simultaneous threads. Since the table shows a maximum of 1024 threads per threadgroup, we infer 24 threadgroups may execute in parallel across the chip, each with its own register file. Putting it together, the GPU has 208 KiB * 24 = 4.875 MiB of register file! This size puts it in league with desktop GPUs.

I don't think this is quite right. To compare it to Nvidia GPUs, for example, a Volta V100 has 80 Shader Multiprocessors (SM) each having a 256 KiB register file (65536 32-bit wide registers, [1]). The maximum number of resident threads per SM is 2048, the maximum number of threads per thread block is 1024. While a single thread block _can_ use the entire register file (64 registers per thread * 1024 threads per block), this is rare, and it is then no longer possible to reach the maximum number of resident threads. To reach 2048 threads on an SM requires the threads to use no more than 32 registers on average, and two or more thread blocks to share the SM's register file.

Similarly, the M1 GPU may support 24576 simultaneous threads, yet there is no guarantee it can do so while each thread uses 104 registers.

[1] https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.... : table 15, compute capabilities 7.0

glial

I'm not really a developer so maybe I'm just not understanding something, but why in the world isn't Apple making it easier for people to optimize for the M1? I would think it's in their best interest to help developers make the best software possible, by sharing information about how to leverage the architecture. It's bizarre to me that the best sources of information are posts like this.

dev_tty01

Apple is helping developers make the best software possible. They have a full suite of incredibly optimized and performant frameworks for compute, graphics, signal processing, machine learning, rendering, animation, networking, etc... That is all available via free download for writing MacOS and iOS apps.

Remember, they are selling computers as a combination of hardware and software. They are not selling processors so they are of course not supporting driver and other low level software for other OSs. That's a bummer if you are into other OSs, but it is not part of their business model so it should not be surprising.

Other OSs are supported via their virtualization framework. My limited tests show about a 10% performance penalty. Not too bad right out of the gate with a new system.

That being said, Ms. Rosenzweig is doing some incredible and interesting work. Really enjoying the series.

glial

That's helpful, thank you!

saagarjha

In a few months Apple is going to release a successor to the M1 processor, and then maybe in a year or two another revision after that. Apple would like your code to be optimized for that processor as well, in addition to the M1. The way Apple does this is by wrapping their hardware in a high-level API that knows how to use the hardware it is running on, rather than exposing architectural details and having developers hardcode assumptions into their code until the end of time.

snovv_crash

Abstractions are both leaky and expensive. There are a lot of things that could have much better performance if they had access to the lower level APIs.

astrange

Metal is a good fit for the M1 GPU (since the GPU was essentially designed to run it). There isn't a need for a lower level API than Metal.

Most people will not end up writing code in the optimal way though, since they also want to support discrete GPUs with their own VRAM and those have totally different memory management.

nash

Apple's official graphics API is Metal. There is plenty of documentation for that. Apple considers Metal a commercial advantage over OpenGL / Vulkan. That is; they want developers to develop against Metal.

dylan604

Why do you think they are not? They are helping developers developing for the macOS platform that they own,develop,support. They are not responsible for 3rd part OSes and the development on those platforms. Why would you expect anything else based on their history and track record for closed/walled ecosystems?

judge2020

Believe it or not, you're just purchasing rare earth materials packed very tightly into a nice box when you buy a Mac or an iPhone. The operating system is paid for on the backend by developers giving up 30% of revenue that goes through the App Store. The M1 is turning the Mac into an iPhone in exchange for an extremely fast processor and insane battery life, so they're not interested in helping you bypass the technical challenges of running other operating systems on their hardware (much like how they don't help you jailbreak iOS in order to help you install Cydia or third-party app stores).

marcan_42

You are drawing a false equivalency between the Mac and iPhones. iOS devices are deliberately locked so that running your own low-level software on them is not supposed to be possible, and requires breaking Apple's security. If they make no mistakes, doing it is completely impractical (the cost of an attack outside their attack model is greater than the price of the device).

macOS devices are not, and Apple invested significant development effort into allowing third-party kernels on M1 Macs. The situation is very different. They are not actively supporting the development of third-party OSes, but they are actively supporting their existence. They built an entire boot policy system to allow not just this, but insecure/non-Apple-signed OSes to coexist and dual-boot on the same device next to a full-secure blessed macOS with all of their DRM stuff enabled, which is something not even open Android devices do.

You can triple-boot a macOS capable of running iOS apps (only possible with secureboot enabled), a macOS running unsigned kernel modules or even your own XNU kernel build, and Linux on the same M1 Mac.

refulgentis

This pales in comparison to the rhetoric when OS X came out - you see that rhetoric survive today at opensource.apple.com, it's there, just, not in the spirit of the freedom that was promised.

CyberRabbi

I acknowledge that it’s possible to run unsigned code at “ring 0” on M1 MacBooks but the existence of the DRM restrictions leads me to believe that certain DRM-relevant hardware is not accessible unless a signed OS is running. I’m not exactly sure how the attestation works from software to hardware but I have to guess that it exists, otherwise it would be relatively trivial to load a kext that enables 4K DRM with SIP disabled.

One may consider that not important but I think it’s important to at least note the hardware of these machines are not fully under user software control.

Then again, I don’t think the iOS support requires any hardware so I’m not sure why someone hasn’t released a mod (requiring a kext or not) that enables iOS app loading with SIP disabled.

Rhedox

TBF they explicitly implemented a way to boot other kernels.

'turning the Mac into an iPhone' suggest they are locking it down, which isn't entirely true.

They could do more to help driver development though.

anentropic

> If changing fixed-function attribute state can affect the shader, the compiler could be invoked at inopportune times during random OpenGL calls. Here, Apple has another trick: Metal requires the layout of vertex attributes to be specified when the pipeline is created, allowing the compiler to specialize formats at no additional cost. The OpenGL driver pays the price of the design decision; Metal is exempt from shader recompile tax.

I've just started playing with OpenGL recently and I don't know what "changing fixed-function attribute state can affect the shader" means.

Can anyone give an example of what kind of operations in the shader code might cause these unnecessary recompiles?

gmueckl

OpenGL has a model of the hardware pipeline that is quite old. A lot of things that are expressed as OpenGL state are now actually implemented in software as part of the final compiled shader on the GPU. For example, GLSL code does not define the data format in which vertex attributes are stored in their buffers. This is set when providing the attribute pointers. The driver then has to put an appropriate decoding sequence for the buffer into the shader machine code. Similar things happen for fragment shader outputs and blending these days. This can lead to situations where you're in the middle of a frame and perform a state change that pulls a rug from under the shader instances that that driver created for you so far. So the driver has to go off and rewrite and reupload shader code for you before the actually requested command can be run.

More modern interfaces now force you to clump a lot of state together into pretty big immutable state objects (e.g. pipeline objects) so that the driver has to deal with fewer surprises at inopportune times.

anentropic

Thanks for the more elaborated explanation.

I think I understand now. Ideally the GLSL shader code is compiled once and sent to the GPU and used as-is to render many frames.

But if you use the stateful OpenGL APIs to send instructions from the CPU side during rendering you can invalidate the shader code that was compiled.

It had not occurred to me because the library I am using makes it difficult to do that, encouraging setting the state up front and running the shaders against the buffers as a single "render" call.

undefined

[deleted]

WhyNotHugo

I’m very amused by the fact that this hardwares does not have hardware for some specialised operations that competitors do.

From the article, it would seem that compensating via software was fine (performance wise). Apple’s approach seems to be seems to break the norm in fields where the norm has proven to be unnecessary complexity. Which open tip room for just more raw performance.

hrydgard

AMD has removed exactly the same hardware, and NVIDIA doesn't benefit much from the specialized vertex fetch hardware anymore, even specialized uniform buffer stuff is getting close to marginal benefit. Hardly unique to Apple.

kevingadd

Lots of what they do on the M1 is stuff that PowerVR was doing before (and apple's older GPUs were based on PowerVR's via a licensing deal). There are other vendors who have also ditched some of this stuff.

It's a smart move for Apple to double down on pruning hw features you don't think you need, but sadly you can only go all-in on it if you control the entire ecosystem.

hishnash

As described in the article by not having fixed function units apple can put more regular floating point math units. These units can be used in many, for example when doing compute tasks or when doing 3d tasks that do not make use of those very focused use cases. In the end if it is a big perf hit for those approaches devs will just use different solutions as they need to developer explicitly for metal anyway.

mrpippy

In case anyone isn’t aware, on Apple Silicon macOS (and recent iOS I believe), OpenGL is implemented on top of Metal.

iso8859-1

Is Collabora paying Alyssa Rosenzweig for this work?

lyssa

No, this is purely a hobby project undertaken in my spare time. (The email addresses on the git commits are force-of-habit, apologies for the confusion.)

MegaDeKay

That you are doing this "on the side" makes your accomplishments even more incredible. Keep up the amazing work!

wslack

Well done!

thechao

Stop talking to those backstabbin’ compilerfolk; when you’re ready to join the dark side, come to us and make GPUs.

Daily Digest email

Get the top HN stories in your inbox every day.