Speed, scale and reliability: 25 years of Google datacenter networking evolution

cloud.google.com

Daily Digest email

Get the top HN stories in your inbox every day.

cletus

This mentions Jupiter generations, which I think is about 10-15 years old at this point. It doesn't really talk about what existed before so it's not really 25 years of history here. I want to say "Watchtower" was before Jupiter? but honestly it's been about a decade since I read anything about it.

Google's DC networking is interesting because of how deeply integrated it is into the entire software stack. Click on some of the links and you'll see it mentions SDN (Software Defined Network). This is so Borg instances can talk to each other within the same service at high throughput and low latency. 8-10 years ago this was (IIRC) 40Gbps connections. It's probably 100Gbps now but that's just a guess.

But the networking is also integrated into global services like traffic management to handle, say, DDoS attacks.

Anyway, from reading this it doesn't sound like Google is abandoning their custom TPU silicon (ie it talks about the upcoming A3 Ultra and Trillium). So where does NVidia ConnectX fit in? AFAICT that's just the NIC they're plugging into Jupiter. That's probably what enables (or will enable) 100Gbps connections between servers. Yes, 100GbE optical NICs have existed for a long time. I would assume that NVidia produce better ones in terms of price, performance, size, power usage and/or heat produced.

Disclaimer: Xoogler. I didn't work in networking though.

cavisne

The past few years there has been a weird situation where Google and AWS have had worse GPU's than smaller providers like Coreweave + Lambda Labs. This is because they didn't want to buy into Nvidias proprietary Infiniband stack for GPU-GPU networking, and instead wanted to make it work on top of their ethernet (but still pretty proprietary) stack.

The outcome was really bad GPU-GPU latency & bandwidth between machines. My understanding is ConnectX is Nvidias supported (and probably still very profitable) way for these hyperscalers to use their proprietary networks without buying Infiniband switches and without paying the latency cost of moving bytes from the GPU to the CPU.

latchkey

Your understanding is correct. Part of the other issue is that at one point, there was a huge shortage of availability of IB switches... lead times of 1+ years... another solution had to be found.

RoCE is IB over Ethernet. All the underlying documentation and settings to put this stuff together are the same. It doesn't require ConnectX NIC's though. We do the same with 8x Broadcom Thor 2 NIC's (into a Broadcom Tomahawk 5 based Dell Z9864F switch) for our own 400G cluster.

neomantra

Nvidia got ConnectX from their Mellanox acquisition -- they were experts in RMDA, particularly with Infiniband but eventually pushing Ethernet (RoCE). These NICs have hardware-acceleration of RDMA. Over the RDMA fabric, GPUs can communicate with each other without much CPU usage (the "GPU-to-GPU" mentioned in the article).

[I know nothing about Jupiter, and little about RDMA in practice, but used ConnectX for VMA, its hardware-accelerated, kernel-bypass tech.]

ceph_

From memory: Firehose > Watchtower > WCC > SCC > Jupiter v1

virtuallynathan

This latest revision of Jupiter is apparently 400G, as is the ConnectX-7, A3 Ultra will have 8 of them!

CBLT

I would guess the Nvidia ConnectX is part of a secondary networking plane, not plugged into Jupiter. Current-gen Google NICs are custom hardware with a _lot_ of Google-specific functionality, such as running the borglet on the NIC to free up all CPU cores for guests.

alex_young

Like most discussions of the last 25 years, this one starts 9 years ago. Good times.

undefined

[deleted]

eru

The Further Resources section goes a bit further back.

DeathArrow

It seems all cutting edge datacenters like x.ai Colossus are using Nvidia networking. Now Google is upgrading to Nvidia networking, too.

Since Nvidia owns most of the Gpgpu products, they have top notch networking and interconnect, I wonder if they don't have a plan to own all datacenter hardware in the future. Maybe they plan to also release CPUs, motherboards, storage and whatever else is needed.

danpalmer

I read this slightly differently, that specific machine types with Nvidia GPU hardware also have Nvidia networking for tying together those GPUs.

Google has its own TPUs and don’t really use GPUs except to sell them to end customers on cloud I think. So using Nvidia networking for Nvidia GPUs across many machines on cloud is really just a reflection of what external customers want to buy.

Disclaimer, I work at Google but have no non public info about this.

dmacedo

Having just worked with some of the Thread folks at M&S, thought I'd reach out and say hello. Seems like it was an awesome team! (=

danpalmer

You're lucky to be working with them, an amazing team.

adrian_b

Nvidia networking is what used to be called Mellanox networking, which was already dominant in datacenters.

immibis

Only within supercomputers (including the smaller GPU ones used to train AI). Normal data centers use Cisco or Juniper or similarly.well known Ethernet equipment, and they still do. The Mellanox/Nvidia Infiniband networks are specifically used for supercomputer-like clusters.

wbl

Mellanox Ethernet NIC got used a bunch of places due to better programmability.

crmd

Mellanox IB were ubiquitous in storage networking. None of the storage systems I worked on would have been possible without mellanox tech.

wil421

Most people on the TrueNAS/FreeNAS forums use 10gb Mellanox nics that are sold on eBay. The sellers get them after server gear gets retired.

dafugg

You seem to have a narrow definition of “normal” for datacenters. Meta were using OCP mellanox NICs for common hardware platforms a decade ago and still are.

HDThoreaun

I have to wonder if Nvidia has reached a point where it hesitates to develop new products because it would hurt their margins. Sure they could probably release a profitable networking product but if they did their net margins would decrease even as profit increased. This may actually hurt their market cap as investors absolutely love high margins.

eru

They can always release capital back to investors, and then those investors can put the money into different companies that eg produce networking equipment.

thrw42A8N

Why would they release money if they can invest it and return much more?

jonas21

I believe this is what they plan on doing. See, for example:

https://www.youtube.com/live/Y2F8yisiS6E?si=GbyzzIG8w-mtS7s-...

Kab1r

Grace Hopper already includes Arm based CPUs (and reference motherboards)

mikeyouse

Yeah there’s a bit of industry worry about that very eventuality — hence the ultra Ethernet consortium trying to work on open source alternatives to the mellanox/nvidia lock-in.

https://ultraethernet.org/

ravetcofx

Interesting Nvidia is on the steering committee

jclulow

Cisco have sat on the steering committees for a lot of things where they had a proprietary initial version of something. It's not that unusual, and also, it's often frankly not actually that open; e.g., see the rent seeking racket for access to PCI documentation, or USB-IF actively seeking to prevent open source hardware from existing, etc.

timenova

That was their plan with trying to buy ARM...

maz1b

Pretty crazy. Supporting 1.5mbps video calls for each human on earth? Did I read that right?

Just goes to show how drastic and extraordinary levels of scale can be.

sethammons

Scale means different things to different people

jerzmacow

Wow and it doesn't open with a picture of their lego server? Wasn't that their first one, 25 years ago?

teractiveodular

It's a marketing piece, they don't particularly want to emphasize the hacky early days for an audience of Serious Enterprise Customers.

undefined

[deleted]

ksec

They managed to double from 6 Petabit per second in 2022 to 13 Pbps in 2023. I assume with ConnectX-8 this could be 26 Pbps in 2025/26. The ConnextX-8 is PCI-e 6 so I assume we could get 1.6Tbps ConnextX-9 with PCI-e 7.0 which is not far away.

Cant wait to see the FreeBSD Netflix version of that post.

This also goes back to how increasing throughput is relatively easy and has a very strong roadmap. While increasing storage is difficult. I notice YouTube has been serving higher bitrate video in recent years with H.264. Instead of storing yet another copy of video files in VP9 or AV1 unless they are 2K+.

dangoodmanUT

Does gcp have the worst networking for gpu training though?

dweekly

For TPU pods they use 3D torus topology with multi-terabit cross connects. For GPU, A3 Ultra instances offer "non-blocking 3.2 Tbps per server of GPU-to-GPU traffic over RoCE".

Is that the worst for training? Namely: do superior solutions exist?

undefined

[deleted]

undefined

[deleted]

dangsux

[dead]

encom

[flagged]

darkmx0z

[flagged]

Daily Digest email

Get the top HN stories in your inbox every day.