Graviton2 and Graviton3

Graviton2 and Graviton3


·December 4, 2021


> most workloads could actually run more efficiently if they had more memory bandwidth and lower latency access to memory

Turns out memory access speed is more or less the entire game for everything except scientific computing or insanely optimized code. In the real world, CPU frequency seems to matter much less than DRAM timings, for example, in everything but extremely well engineered games. It'll be interesting to learn (if we ever do) how much of the "real-world" 25% performance gain is solely due to DDR5.

I remember getting my AMD K8 Opteron around 2003 or 2004 with the first on-die memory controller. Absolutely demolished Intel chips at the time in non-synthetic benchmarks.


> everything except scientific computing or insanely optimized code

for insanely unoptimized code, such as accidentally ending up writing something compute intensive in pure python, its very plausible for it to be compute constrained -- but less because of the hardware and more because 99 %or 99.9% of the operations you're asking the cpu to perform are effectively waste.


I'm skeptical of this. I would expect many of those wasted instructions to be expensive memory instructions. I'm pretty sure regular Python code allocates a lot and the interpreter spends lots of time chasing after memory references.


Certainly, but many "wasteful" operations are reading/writing main memory unnecessarily - and since a single trip out to memory and back can take many many CPU cycles, typically optimizing memory access time does more for "bad code" than optimizing the number of CPU cycles per second. But obviously, you're right too - faster is faster after all :)


The team that designed the original Arm CPU in 1985 came to the conclusion that bandwidth was the most important factor influencing performance - they even approached Intel for a 286 with more memory bandwidth!


In the 90’s there were people trying to solve this problem by putting a small CPU on chip with the memory and running some operations there. I routinely wonder why memory hasn’t gotten smarter over time.


Distributing work to leverage faster memory locality is hard. It's not quite what you're talking about, but consider the Cell processor used in the PS3 - the compute capability in HTPC space was supposed to be prodigious, but even having faster (streaming) access to RAM had the tradeoff of dealing with code dispatch. (It's not a perfect example, the SPE model was also just kind of a pain, but you have to think about how to get your code local to the memory you want, how to keep allocation nonrivalrous, etc. - it's a lot!)


Heh, funny you mention this, because I was _so excited_ about the CELL architecture when it was announced - exactly because of memory read/write speed. Then I tried to actually write some code for it, got depressed, and moved back to x86 for the next decade :(


I suspect some future improvement on borrow checkers will facilitate doing this to a degree. But it's likely to be one of those things that only comes into being when someone needs it very badly.


DDR5 has basically minor cpus on the RAM for improved performance. That's the main reason why they're so expensive and sometimes need active cooling


So like a (micro)SD card that has some arm cpu for wear levelling and other performance enhancements.

It sounds interesting.


Interesting - do you know what kind of performance gains DDR5 will provide for real-life workloads?

Having active cooling, making the system noisier and potentially hotter, seems like a pretty big down side.


Man that sounds so hard to program. I wonder if that's the reason.


Memory bandwidth is not a problem. Even puny 1 channel memory desktops don't usually saturate it.

It's memory latency.


For some problems, there is a choice of how to organize its data structures. One that requires random access, and another that is mostly sequential access.

The latter might take an order of magnitude more space, while still being faster.

An example of such a problem is the Cuckoo Cycle Proof-of-Work [1].



Certain instruction sequences prohibit you from fully utilizing memory bandwidth and decrease it by as much as 40-60% so this statement is not true. For example, not using store and load buffers in ways they were designed to be used will lead to subpar performance for no apparent reason


Can you elaborate? I have a side project(1) where all profilers I've used give a very muddled picture, so I'm very interested in the question of what slows down code on a modern "big" CPU with wide dispatch, a few kB of decoded ops buffer and a lot of OoO hardware.

(1) It encodes and decodes a protocol from a potentially untrusted source, so there is obviously a lot of waiting for previous results. That much is clear, however I expected profilers to show me some causal link between serial nature and slow execution, but they don't. I have tried perf, Valgrind-Callgrind and AMD μProf (because I have a Ryzen CPU on both of my main private computers). I'm not sure if the tools suck, my test cases suck, or I just don't know how to interpret the tools' results - assignment of cost to lines of code seeming highly unreliable is my main problem. Maybe the stupid things (most of optimization is about not doing stupid things, after that it gets properly hard) that these profilers are designed to catch aren't the stupid or unavoidable things my code is doing.


For well optimized code it is.


That's the same thing


I find this claim hard to believe honestly,could you point to examples where performance is limited by Dram speed and not by cpu / caches? They must be applications with extremely bad design causing super low cache hits.


> They must be applications with extremely bad design causing super low cache hits

Yep, this is exactly the case - also, on systems that are busy and context-switching often and thus flushing their cpu caches more frequently. Combine the two, busy systems running loads of un-optimized code, and boom, you have described how most computers run in the real world. This is why "synthetic" benchmarks, which are well designed code running on quiet machines more or less match up to CPU Frequency exclusively.

I don't really have any good charts to show you, but you might checkout an old review of the processor I mentioned as having one of the first on-die memory controllers:

The AMD Opteron 240 1.4GHz keeps up with chips close to 2x it's frequency - and the memory access times are close to 1/2 as costly (ie: almost all the performance gain from 2x frequency is made up by 1/2 memory access time) - this makes sense, but remember these are well optimized applications (POV-Ray and Lightwave were extremely synthetic). In the real world, opening 10 misc windows applications from 2003, the K8 (particularly when overclocked) was a _beast_.


Well, Opteron is an ancient processor, I don't think we can make any conclusion based on that. Today's server processors has enormous caches compared to Opteron.

Honestly in cases you mention, badly designed processes killing the Cpu, I fail to see how faster ram makes a huge difference.


This is why you dedicate entire machines to the same kind of load. All application code or all database. There was a moment where people tried to integrate—which was indeed faster but only for very limited use cases.


In data compression, inverting a BWT with large blocks or using Context Mixing to compress large blocks (which requires huge context maps). These 2 cases require a lot of random memory accesses.


Anything where the working set is larger than L3 cache.


> They must be applications with extremely bad design causing super low cache hits.

So basically any program written in a language with pointer types exclusively.


I thought I’d heard that Java VMs go to great lengths to maintain cache coherence? I’d be curious to hear from the Lisp folks because I always hear that Lisps can be surprisingly performant.


Even then, today's cpus have enormous caches, and not all parts of program is pointers. you cant make a crappy application much faster just because you have faster ram.


Well, I disagree with pretty much everything in the claims.

First, most real unoptimised code faces many issues before memory bandwidth. During my PhD, the optimisation guys doing sat nextdoor and they produced beautiful plots of what limits performance for a bunch of tasks and how each optimisation they do removes an upper bound line until last they get to some bandwidth limitation. Real code will likely have false IPC dependencies, memory latency problems due to pointer chaising or branch mispredictions well before memory bandwidth.

Then the database workload is something I would consider insanely optimized. Most engines are in fierce performance competition. And normally they hit the memory bandwidth in the end. This probably answers why the author is not comparing to EPYC instances that have the memory bandwidth to compete with Graviton.

Then the claims that they choose not to implement SMT or to use DDR5 are both coming from their upstream providers.


Wouldn't SMT be a feature that you are free to use when designing your own cores? I'm assuming Amazon has an architectural license (Annapurna acquisition probably had them, this team is likely the Graviton design team at AWS). So who is the upstream provider? ARM?

And if they designed the CPU wouldn't they decide which memory controller is appropriate? Seems like AWS should get as much credit for their CPUs as Apple gets for theirs.

Bottom line for Graviton is that a lot of AWS customers rely on open source software that already works well on ARM. And the AWS customer themselves often write their code in a language that will work just as well on ARM. So AWS can offer it's customers tremendous value with minimal transition pain. But sure, if you have a CPU-bound workload, it'll do better on EPYC or Xeon than Graviton.


I can't escape the feeling that AWS is taking credit for industry trends (DDR5) and Arm's decisions (Neoverse).


> I can't escape the feeling that AWS is taking credit for industry trends (DDR5) and Arm's decisions (Neoverse).

ARM is just a design. AWS brought it to market. ARM-based server processors are still rare on the ground. IIRC Equinix Metal and Oracle Cloud offer them (Ampere chips) but not GCP or Azure.

We've tested Graviton2 for data warehouse workloads and the price/performance was about 25% cheaper and 25% faster than comparable Intel-based VMs. Still crunching the numbers but that's the approximate shape of the results.


Any rumors on when GCP will get an ARM offering?


Yeah, the tone of these talks is kind of weird. They talk about how "we decided to do foo" when the reality is "we updated to the latest tech from our upstream providers which got us foo".


Like how Apple takes credit for packaging new technology in an easy to use product? What’s wrong with that? They’re not exactly hiding it.


Isn't making the CPU wider one of the things Apple also did with M1? Doesn't feel like they are the first.


Apple designed the M1. AWS is (probably) using off-the-shelf Neoverse V1 cores that they did not design.

[Imagine "you made this, I made this" meme here]


They have a huge design team making custom silicon. They deserve a bit more credit even if they’re leaning on ARM IP.


Maybe but M1 doesn't really compete in this market.


A recurring theme is "build a processor that performs well on real workloads".

It occurs to me that AWS might have far more insight into "real workloads" than any CPU designer out there. Do they track things like L1 cache misses across all of EC2?


Reality varies. Its a truism in optimization that the only valid benchmark is the task you are trying to accomplish. These chips have been optimized for an average of the tasks run on AWS (which is entirely sensible for them), but that doesn't mean they'll be the best for your specific job.


They'll definitely have information that traditional CPU designers won't. Check out this talk from Brendan Gregg (he's probably lurking), where he specifically calls this out:

See slide 26 (and the rest ofc :)).


Hard to track for other people’s VMs, but they probably have (or can sample) that data for every AWS-operated service (dynamo, S3, redshift, etc..)


There is a strong internal mandate for internal services to switch over to gravitron. So they either likely have this data, or are just trying to free up more x64 cores for external customers.


Yeah, this is key. They definitely optimize for their own services. And they don’t run S3 and redshift on the same cpu/server at the same time.


They may not run those particular services on the same hosts but they heavily use Lambda (and docker) which can share hosts and be tossed around the data centers to saturate cores.


AWS can also build slightly different CPUs under the same name for different workloads and not tell anybody.


Arm seems poised to replace x86 in servers. If I were Intel this would make me really nervous.


Very unlikely. See for example Linus reasoning


I think the flaw in Linus’ argument is that this happened in the 90s-2010s for x86. A foundational time, especially for his worldview, but I don’t know that the pattern repeats (some of his viewpoint is colored by his time at Transmeta).

The development world today looks very different. Back then, language support for other architectures was more bespoke and CPU vendors had to add support for their chips. Today, there are plenty of very rich, platform-agnostic (both CPU and OS) libraries. Additionally, mobile development has sufficiently matured ARM development that I don’t think that argument holds. If it did, then developers wouldn’t be able to develop on their x86 MacBooks and deploy to their mobile Apple devices (yes it’s ARM now but it hasn’t been for the majority). I think the plain x86-box -> server story is pretty solid for but the cloud has changed that. Everyone is now starting out in the cloud with CPU-agnostic languages where switching architectures usually is as simple as changing 1 line in a config. In some cases it matters but the vast majority of SW dev shops don’t feel this like you used to in the 90s and 00s. Plus M1s now provide developers with local ARM development.


Linus's reasoning is sound, but the issue is that ARM development platforms are becoming a thing and to be honest I see x86 as being in the early stages of a death spiral and so does Intel the way they're focusing on the fabrication side of their business.

If anything programmers are adopting ARM based computers faster than the rest of the market. As pretty much every developer tool gets ported for Apple silicon every company is going to shrug and go "May as well release an ARM Windows/ARM Linux build as well".


I totally agree with everything you said except that devs are switching faster. I think first to switch was low end chrome books and surface go type devices. M1 is pulling the devs and professionals in, and gaming will be the last holdout (due to optimized IP that may be abandonware and never updated).


The good thing I see at work is that we all make everything work for x86 and arm. So we can deploy on any kind of cloud platform cpu and not worry about that anymore.


We've been migrating our production to Graviton2 (now Graviton3). Our developers run x86 Macs. Everything runs on the JVM, Python, Node, Go, so nobody feels like there's a difference. The ARM transition has been transparent for us.

Linus' reasoning makes sense, but the real world disagrees with him (at least in our case).


Linus's argument is that devs will use the same processors in production that they develop on. But everyone already has to develop for ARM because mobile runs ARM. And now the M1 Macs do too (and these AWS servers). So if you're forced to use ARM because of mobile and now there are good options for desktop and servers to use ARM as well, I don't see why people wouldn't switch to them. Basically Linus's own logic seems to contradict his claim.


That is, as far as the reasoning applies, why I consider the M1 Macs so pivotal. The MB Pro already was a very popular machine for developers. Now it not only got much faster and better, it also offers access to great ARM development machines. Be it for the largest market, smartphones, or for the cloud solutions based on ARM machines as the Graviton.


1. Fewer and fewer people run their stack on the laptop. There is tooling today to run even unit tests remotely pretty painlessly like bazel (and the like) and docker

2. With languages like java, go, python, node it doesn’t even matter

3. Devs are migrating to arm en masse (M1s)


Agree with 1. I'm part of 3. Regarding 2, it does matter for anything that has bits of optimized c code that was only done for x86. I have a lot of Node and Python things that don't run on my M1 natively (they even crash on qemu x86 vms whatever the kind of cpu features I emulate).


Unlikely for now. The ball has just started rolling with the M1.


Possibly on the desktop too... I imagine we'll see many m1 like windows pc options in the near future...


They are coming. Clearly not as impressive as the latest m1 max and pro, but getting there.


Don't forget Ampere's A1 i found them really, really impressive for SAT solving and that you can get them at 1ct/core/hour at Orcale makes them really financially attractive.


5 or 6 years ago Marc Andreesen was saying this would happen eventually. I was skeptical when I first heard the claim, but it's seeming more and more likely.


I don't understand this at all.

The first and second transcripts seemingly contradict each other. The first one says:

>Cores got so big and complex that it was hard to keep everything utilized.

But then the second one is about how they improved performance by making the cores bigger and more complex. Why is it possible to feed their wide core but not the competition's? Why is it that idle transistors are bad in the competition but Graviton benefits from specialized vector instructions that are only useful in some workloads?

>With Graviton2, one of the things we prioritized was large core local caches. In fact, the core local L1 caches on Graviton2 are twice as large as the current generation x86 processors.

This doesn't make sense. All modern x86 machines have both an L1 and L2 cache that are local to each core, with only the L3 cache being shared. My laptop has a total of 288 kiB of cache dedicated to each core.

The fact that the competition uses 32 kiB L1 caches has nothing to do with a difference in philosophy. Everyone realizes that caching is important and everyone uses the biggest caches they can get away with. The reason x86 designers chose a smaller cache is because, in their designs, increasing the cache size would reduce performance. Large caches are slower than small ones. Increasing the hit rate is not necessarily worth it if it makes every cache access more expensive.

>most workloads could actually run more efficiently if they had more memory bandwidth and lower latency access to memory

In other news, water is wet. Memory has been the bottleneck for as long as I've been alive.