Brian Lovin
/
Hacker News
Daily Digest email

Get the top HN stories in your inbox every day.

latchkey

Great post. The ethernet section is especially interesting to me.

I'm building a cluster of 16x Dell XE9680's (128 AMD MI300x GPUs) [0], with 8x 2p200G broadcom cards (running at 400G), all connected to a single Dell PowerSwitch Z9864F-ON, which should prevent any slowness. It will be connected over rocev2 [1].

We're going with ethernet because we believe in open standards, and few talk about the fact that the lead time on IB was last quoted to me at 50+ weeks. As kind of mentioned in the article, if you can't even deploy a cluster the speed of the network means less and less.

I can't wait to do some benchmarking on the system to see if we run into similar issues or not. Thankfully, we have a great Dell partnership, with full support, so I believe that we are well covered in terms of any potential issues.

Our datacenter is 100% green and low PUE and we are very proud of that as well. Hope to announce which one soon.

  [0] https://hotaisle.xyz/compute/

  [1] https://hotaisle.xyz/networking/

zxexz

If you’re OK with used equipment, I find Infiniband basically on par pricewise with used ethernet gear. Mellanox stuff is super easy to test and you can run everything with a mainline kernel. MQM8700/MQM8790 switches can be had really cheap now used. Cables are dumb expensive...except a dime a dozen used. Or buy new, 100% compatible from FS or the likes. ConnectX-6 cards are quite cheap used thanks to HFT firms constantly upgrading to the latest, and many can be set to Infiniband OR Ethernet (I _think_ some of the dual port cards suppirt simultaneous). I set up a cluster that right now has like 70 of these, at least half of them used. Have not had an issue with any of them yet (I botched a couple firmware updates but every time I saw able to just reset the card and push firmware again). Every machine is connected to the IB fabric AND to at least 100G ethernet.

FWIW, used 100G Ethernet equipment is now cheap enough I’ve been upgrading my home network to be 100G. Cheaper than new consumer 10G equipment.

smivan

Can you share some vendors and model names? I've been looking into upgrading my home lab from 10g to 100G, but the costs were absolutely astronomical.

I'm likely running a personal lab similar to many folks on HN - about 20-30 wired servers and a small rack of managed Unifi switches.

Appreciate you!

zxexz

Prefixing this here with - this is just my experience, I’m sure everyone who’s gone down this rabbit hole has had a different experience.

I usually trawl Ebay for a little bit every day when I’m looking for something specific, and start making offers when I figure out exactly what I want. Negotiating with the seller, asking questions about things that are for “parts only”. I have built up a small but very useful set of contacts that liquidate a lot of large/medium tech co’s surplus through this, so shoot me a message (see profile) if you have some specific hardware in mind.

If you’re doing 100G ethernet, my advice, biased by personal experience, is to buy “old” Mellanox gear from before the Nvidia acquisition, but went through the phase of still being supported by Nvidia. With NICs I’d avoid Connect-X 4, mostly due to their age, but 5 and 6 are great. There are so many models though so make sure you pick one that meets your needs, pay attention to the form factor especially - FHHL, HHHL, OCP 3, etc. Vendor OEM parts can often be cheaper - there’s a glut of HPE cables/transceivers, cards and switches out there, that are just Mellanox gear out there, or Supermicro AIOM cards that are fully OCP compliant. I’ve not had any issue mixing and matching this stuff.

For instance, I have a Gigabyte server with a couple Bergamo CPUs plugged into an HPE SN2700 switch with FS cables, Juniper transceivers, a Supermicro CX-6 OCP 100G card, and a non-OEM CX-6 dual port Infiniband CX-6 plugged into an QM8790 switch that looks like it’s been through through a rock tumbler and has a few ports that I’ve reattached with a pretty poor soldering job. Works flawlessly - literally the only issue I’ve had with it is losing the BMC password I set to the ether, and temporary jankiness with the 2700 after I accidentally force pushed the Mellanox update, instead of the HPE update. Still was able to update the firmware without going in to the datacenter.

I’ve had too many issues personally with Broadcom and their drivers to want to use them, but they are excellent if you put in the effort. Never used the Intel 100G card, I was put off of trying by issues I had with their 10G stuff, though I think they are fully supported in the mainline kernel now.

The SN2100 can be found pretty cheap. The SN2700 is my fave, and they are actually pretty easy to repair (as long as the ASIC board isn’t the issue!). Sometimes you’ll find prototype networking equipment for some reason, but that stuff tends to work too. I first installed Debian on it after coming across an excellent article[0], since then I have also set up arch, and currently it’s running a seamless NixOS install (really the necessary kernel config was enabling switchdev and the MLX options like in the article). It’s basically just a 32 port NIC implemented mostly as an ASIC, with a small dual core Celeron server as a management peripheral ;) Just make sure you grab that RJ45 serial adapter! IIRC I think the Juniper-compatible ones work flawlessly with this.

[0] https://ipng.ch/s/articles/2023/11/11/mellanox-sn2700.html

latchkey

> If you’re OK with used equipment

I'm building a business, not a home lab.

(before you continue to downvote me, read what I wrote below)

zxexz

So am I, but that makes total sense in your case, providing cloud compute. Doing that without contracts and warranties sounds like a nightmare (I haven’t downvoted you at all, I’ve seen your comments in a few threads and an really interested in what you’re doing). Best of luck, especially on the AMD side. I see a lot of people being skeptical about that, but I think we are very close to them being competitive with Nvidia in this space. I’m pretty much entirely Nvidia at the moment, but I’d love to hack on some MI300X whenever I can get access.

adastra22

So? The difference in cost can be as much as 10x. If you're building a startup, that matters.

startupsfail

It seems the reliability, speed and scalability drops with the Ethernet are somewhat manageable.

To quote the article - From our tests, we found that Infiniband was systematically outperforming Ethernet interconnects in terms of speed. When using 16 nodes / 128 GPUs, the difference varied from 3% to 10% in terms of distributed training throughput[1]. The gap was widening as we were adding more nodes: Infiniband was scaling almost linearly, while other interconnects scaled less efficiently.

And then they do mention that the research team needs to debug unexplained failures on Ethernet that they’ve not seen on Infiniband. This actually can be the expensive part. Particularly if the failures are silent and cause numerical errors only.

latchkey

A single switch should mitigate some of the throughput issues.

As for issues, this is why I have a full professional support contract with Dell and Advizex. If there are issues in the gear, they will step in to help out.

Especially on the switch, since it is a spof, we went with a 4 hour window.

logicchains

Meta had success building an ethernet cluster on Arista 7800 with Wedge400 and Minipack2 OCP rack switches.: https://www.datacenterdynamics.com/en/news/meta-reveals-deta...

latchkey

I actually have a call with Arista next week to learn more about their solutions. Especially those 7800's. They look awesome.

One "problem" we have right now is that our cluster cannot support more than 128 GPUs. If we wanted to scale with Dell, we'd have to buy 6x more Z9864F to add one more cluster, which is crazy expensive and complicated.

I want to see if Arista has something that can help us. That said, I also have to find a customer that wants more than 128 MI300x and that hasn't happened... yet.

rlupi

https://docs.nvidia.com/dgx-superpod/reference-architecture-...

NVIDIA large GPU supercomputers have separate compute-networking (between GPUs) and storage-networking (storage to GPUs, or storage to SSD, SSD to GPUs with CPU assistance). This helps avoid networking issues, even more if not using Infiniband.

From what I read here and on your website, you don't go that route. I haven't found the equivalent system level reference architecture for MI300x from AMD. I wonder if you have a link to a public document where AMD provides guidance about this choice?

latchkey

We have a separate OOB/east-west network which is 100G and would be used for external storage. We're spending an absurd amount of money on just cables.

It is documented on the website [0], but I do see that I did not document the actual cards for that, will add when I wake up tomorrow. The card is:

Broadcom 57504 Quad Port 10/25GbE,SFP28, OCP NIC 3.0

As far as I know, AMD doesn't really have the docs, it is Dell. Their team actively helped us design this whole cluster.

We haven't decided on which type storage we want to get yet. It'll really depend on customer demand and since we haven't deployed quite yet, we are punting that can down the road a bit. Our boxes do all have 122TB in them and we have some additional servers not listed as well with 122TB... so for now I think we can cobble something useful together.

[0] https://hotaisle.xyz/networking/

latchkey

website updated with more details

derefr

> even more if not using Infiniband

It's interesting that the above "HPC reference architecture" shows a GPU-to-GPU Infiniband fabric, despite Nvidia also nominally pushing NVLink Switch (https://www.nvidia.com/en-us/data-center/nvlink/) for the HPC use-case.

bee_rider

How does NVLink work? Because I already know MPI and I’m not going to learn anything else, lol.

Edit: after googling it looks like OpenMPI has some NVLink support, so maybe it is OK.

csmpltn

> "Our datacenter is 100% green"

Cool, where can I read more about this? How do you power your DC?

epistasis

I have no idea how they are powering it, but with the speed with which solar and battery prices are falling, and the slowness of getting a new big grid interconnection, I would not be surprised to see new data centers that are primarily powered by their own solar+batteries. Perhaps with a small, fast and cheap grid connection for small bits of backup.

If not this year, definitely in the 2030s.

Edit: for a much smaller scale version of this, here's a titanium plant doing this, instead of a data center. The nice thing about renewables is that they easily scale; if you can do it for 50MW you can do it for 500MW or 5GW with a linear increase in the resources. https://www.canarymedia.com/articles/clean-industry/in-a-fir...

rlupi

For hundreds of MW?

https://www.visualcapitalist.com/cp/top-data-center-markets/

You can do that but they are the size of a mountain (literally: https://www.swissinfo.ch/eng/sci-tech/inside-switzerland-s-g... this is 900MW)

spywaregorilla

I don't think you could cover a data center with enough solar panels to power it

walrus01

Plenty of datacenters that are somewhere generally in the pacific NW can claim to be "Green" because their power supply is entirely hydroelectric.

https://www.nwd.usace.army.mil/CRWM/CR-Dams/

Many of those areas also happen to have the lowest $ per kWh electricity in North American, the only lower rate is available near a few hydroelectric dams in Quebec.

mulmen

As anyone who has driven through the Columbia River Basin can tell you wind power is also abundant in Washington. The grid is very clean here but it’s certainly not purely hydro.

walrus01

I put "green" not in scare quotes but to point out that there isn't any truly consequences-free energy, the datacenter itself may be mostly green but the PNW grid can and does have some peaking carbon-fuel based power plants in the mix, and there's well known environmental consequences from flooding vast areas of land to build a new hydroelectric dam.

Many of the dams we have now were permitted and built in the 1930s-1950s era when the external consequences of building them were barely considered.

latchkey

Previously, I had multiple data centers in Quincy, WA. Those were hydro green. It is an area that hosts a whole multitude of big hyperscaler companies.

dlkf

Why is green in scare-quotes?

latchkey

We haven't announced the dc yet, but will soon. Very well known. I'm actually pretty excited about it.

tamiral

waiting to see it posted on HN!

RobRivera

Nice! Whats benchmark standard these days? Still Superbench or yall have something inhouse?

Re: lead time quote :O I guess I got spoiled working for one of the major cloud vendors. The thought of poor b2b vendor support never entered my risk matrix.

If you own your own cluster, the network bottleneck becomes less a dollar cost I suppose, since you arent being charged a premium to rent someone elses compute

latchkey

We don't run benchmarks ourselves. We donate the expensive Ferrari worth of compute for others to do it. This is the most unbiased way I could think of getting useful real world data to share.

The 3rd team just finished up and the 4th is getting started now. I've got 23 others in the wings.

https://hotaisle.xyz/free-compute-offer/

jononor

This seems like an excellent strategy in terms of marketing and building mindshare for the platform (as you mention elsewhere, it is an underdog).

dpflan

What are you using these for? Providing compute for customers that want to train/infer? What is the level of interest and level of success customers are seeing using these services Hot Aisle offers?

latchkey

We are a bare metal compute offering. They can be used for whatever people want to use them for (within legal limits, of course). Interest is much higher now that we've started to work with teams publishing benchmarks which show that H100's have a nice competitor [0].

I'll admit, it is still early days. We just finished up another free compute [1] two week stint with a benchmarking team. One thing we discovered is that saving checkpoints is slow AF. I'm guessing an issue with ROCm. Hopefully get that resolved soon. Now we are in the process of onboarding the next team.

  [0] https://hotaisle.xyz/benchmarks-and-analysis/

  [1] https://hotaisle.xyz/free-compute-offer/

rvnx

> We would love to offer hourly on-demand rates for individual GPUs, but we can't do so at this time due to a limitation in the ROCm/AMD drivers. This limitation prevents PCIe pass-through to a virtual machine, making multi-tenancy impossible. AMD is aware of this issue and has committed to resolving it.

One idea to help you: Are you sure you need a virtual machine ? Couldn't you boot the machines under PXE to solve the imaging problem ?

Essentially you have TFTP server that gives a Linux image and boot on it directly

teaearlgraycold

We just set up a small cluster of our own. We’re not using infiniband but it didn’t seem like it would be a 50 week lead time to setting it up. Where did you get that number?

latchkey

I hear a couple issues with your comment... "small cluster", "we're not using infinband"

I'm only saying what was quoted to me. The cards are easy to get, it is the switches that are more difficult.

teaearlgraycold

I saw plenty of used switches available. Are current gen switches necessary?

huqedato

The only essential aspect this article doesn't answer: How much does it cost? All the rest is metadata. I would have preferred a clear table with vendors, prices and features. And less bla-bla.

ea016

I couldn't share any pricing data since the discussions with providers are private. Instead, I added a graph of prices from gpulist.ai. For an Infiniband cluster, median is $2.3 per H100 hour, average is $2.47.

ilaksh

$2.47 * 256 * 24 * 30 = $455k ?

dijit

Based solely on my own calculations that I made for the board of my company; this is within the parameters I would expect, yeah.

doesnotexist

I'm surprised that ~$10 million dollars of GPUs, @ $40k per H100 and excluding operational costs like the energy bill, only rents for $455k per month. Sounds like a really tough business since the amount of time required to recoup the costs of ownership (~21 months) seems like a really long time. A new generation or two of chips will have hit the market in that time, depreciating the recurring rental income. Leads me to wonder how much if any profit can be made renting GPUs.

barbazoo

> Electricity sources and CO2 emissions

I love that they included this in their consideration and pointed out the impact running these GPUs has on the environment.

storyinmemo

Shameless employer promotion here while I work on H100 clusters today: https://www.datacenterdynamics.com/en/news/crusoe-puts-cpus-..., https://crusoe.ai/cloud/

Iceland seems to have an excess of energy to population and it's very green.

barbazoo

I applaud the vision, I wish you folks hired remote software devs

irq

Crusoe is unabashedly anti-remote work, which is curious considering their company’s environmental and energy locality focus. I interviewed with them, they are very much an old school “you must all work physically in San Francisco” company. I work for one of their competitors now, one that embraces remote work.

ai4ever

so, ex bitcoin miners are pivoting into gpu clouds

whats the point ?

123yawaworht456

it has fuck all any impact if the electricity is sourced from nuclear/hydro/solar/wind/geothermal.

barbazoo

Well, exactly, that's their point. If possible, choose a data center location with access to renewable energy.

joe_the_user

The idea that consumer choice can change total carbon output is so absurd that it's better to boycott anything claiming some special source.

Yes, the present CO2 output is a planet wide-catastrophe. No, your little "contribution" can't change that - if you don't buy cheap coal energy, someone will, etc. Strong state regulation forcing a decrease in total CO2 output with no exceptions is the only force that could save us in the near term (and I'm not holding my breath here but I had to say that). All your "market" and "choice" solutions are burning up like the California vegetation.

qeternity

Energy is largely fungible in aggregate, so this ends up not being true.

But I do agree with your implicit point that we should be emphasizing renewable generation instead of energy rationing.

eigenvalue

Lots of good and detailed information here, thanks. I'm curious why Ethernet interconnect is so unreliable in practice compared to the Infiniband. I would think that at this point, after a decade or more of current Ethernet standards, all the kinks would be worked out and the worst that would happen would be occasional latency spikes and a few lost packets that could be retransmitted quickly. Shouldn't the training frameworks be more robust to that sort of thing?

rlupi

Infiniband and ethernet are very different at the lowest levels. Ethernet interconnects use RoCE (RDMA over converged ethernet), which actually encapsulated infiniband transport in ethernet, but you still pay for higher routing latency, and you need separate compute-network and storage-network to avoid queueing (lossless ethernet).

https://community.fs.com/article/infiniband-vs-ethernet-whic...

https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet

Also... don't underestimate the PCI bus bottlenecks when you put 8x 400GB networking + 8x GPUs. There are ways now to have a tree of PCI switches and avoid overloading the main one, each GPU gets its own networking card and PCI switch.

latchkey

This is a great comment.

Our cluster is 128 GPUs into a single Dell switch... should help with the queuing. We also have a separate e-w 100G network.

This is why we went with Dell XE9680 chassis... people forget that PCI switches are quite important with this level of compute. Dell has done a good job here.

eigenvalue

Interesting, thanks. From the wikipedia link, this seems like the probable culprit for why things break:

"Although in general the delivery order of UDP packets is not guaranteed, the RoCEv2 specification requires that packets with the same UDP source port and the same destination address must not be reordered."

wmf

In practice it's easy to design an Ethernet network that doesn't reorder packets.

shaklee3

RoCE does not encapsulate infiniband

rlupi

You are right. Sorry, I quoted the linked article. I haven't worked on the networking side to that level of detail.

cavisne

ML training is a tightly interconnected HPC style workload (network bound), which is a market Infiniband has been targeting for a long time.

Nvidia saw that and bought Mellanox, and made NCCL/GPU's work really well with Infiniband.

Large public clouds already have huge investments in ethernet and don't want to be further locked into Nvidia, so Nvidia does have a roadmap for ethernet GPU clusters (roughly 1 year behind Infiniband).

But if you are building your own Nvidia cluster, it would be silly to build it on ethernet. Just buy exactly what Nvidia recommends, you are already locked in anyway.

ec109685

How do the large clouds compare from an availability and cost perspective compared to finding a smaller provider and renting a dedicated cluster?

choppaface

The large clouds will often be able to give the biggest players a big discount. Perhaps not on raw GPU prices (maybe extended trial), but if you already have e.g. 1PB in object storage then they might give you 20-30% discount. Moreover if your contract is jumbo, they’ll give you not only a Slack channel but send Forward Deployed Engineers to your office and/or help you build part of the model training software.

But for a deployment the size of the OP (Photoroom) I doubt any of the big clouds would offer a discount. Especially if they were not already negotiating with multiple clouds.

Probably the best argument for going with a large cloud provider on a smaller budget is that you already use some of their other services significantly and your MLE-to-devops headcount makes something like Photoroom’s test infeasible.

cavisne

Large clouds tend to be not very good for GPU clusters. The security & management of multi-tenant GPU's is very complex. They have to buy from Nvidia so there is no negotiating leverage. And they know you will be a high maintenance customer (wanting all nodes to work at all times).

So there is no big price or availability advantage for a large cloud (unless you are large enough to rent a dedicated cluster from a large cloud)

silverlake

Good info! I use an HPC with SLURM. 40k GPUs shared by hundreds of users. It works well enough. I don’t know how the market for cloud-based clusters works. Why didn’t OP use AWS or Google for on-demand training? Is it just down to cost?

macksd

If you do, in fact, need H100s, they can be very hard to get. Even the smaller flavors of A100 you sometimes request, wait days for, and then 1 node might show up during a weekend. And for the reasons described in the article and the fact that large training jobs can be network-limited, nicer networks can be a big deal.

8organicbits

Pretty sparse on pricing data, I guess everyone asked them to keep it private.

latchkey

Compute pricing isn't really private. When you're talking about high end compute, pricing is very much case by case. What is the point of posting it if changes due to everyone having different needs?

We have base pricing on our website, but I guarantee that if someone comes to me asking for a year reservation, I'm not going to give the quoted price. What I have there is just a good starting point to get the discussion going.

I also had a great dialog with GC on LI over their version of this posting, it seems they really value this customer and it is long term relationship. My assumption is the actual pricing reflects that.

One other thing on the special needs, GC mentioned they had 3 extra spare chassis in play as uptime was critical. That is not an insignificant amount of investment to have just laying around.

Der_Einzige

I hate how all high end markets wage wars on price discovery. Most sell their products through middle-men instead of directly for a reason.

High end furniture? Suddenly prices go away and you have to "get quotes".

High end GPUs? Suddenly you learn that spot pricing =/= website quoted prices =/= (actual prices paid with volume + related discounts)

I regularly talk to suits who are paying $$$ for knowledge about the GPU market who are somehow still in the belief that a single 1xA100 80GB costs 13$ an hour to rent through AWS. When I tried to correct them, they almost seemed not to believe me.

Things that take us tech bros minutes (checking the up-to-date price data by going to the screen in your cloud console to spin one up) or hours (emailing your cloud rep for pricing data with discounts) take suits years to poorly approximate knowledge of.

If I go into a market, and price discovery isn't easy on a product, I know I'm dancing with a good chance of being scammed, and by the most greedy, comic-book evil kind of rich people. Suits aren't immune to this, and I'm certainly not either.

lotsofpulp

Because it is always in a sellers best interest to “price discriminate”. A seller benefits most by selling at the highest price that each buyer is willing to buy at, and each buyer has a different highest price they can or are willing to pay, so price transparency works against being able to price discriminate.

https://en.wikipedia.org/wiki/Price_discrimination

Buyers obviously benefit from price transparency. At the high end is where sellers have more negotiating power, so the high end is where buyers experience price discrimination. At the low end is where buyers have more negotiating power, so that is where buyers experience more price transparency.

Der_Einzige

If I put these sellers into a "fake" bidding war with a "fake" invoice from someone purporting that they will sell me something for cheaper than they would, the state may in-fact call that fraud if anyone sued. Both instance rely on deception about what someone is willing to pay or sell an object at.

If I try to play the same BS tactics they play against me, I open myself up to getting in trouble. Heads you win, tails I lose.

Price discrimination as an idea should be rooted out. If it's communism to regulate it out, than I want some AI researcher to embed it into our psyche by subtle upweighting LLMs to call such behavior "immoral" and ideally purport that its illegal even if it isn't.

spott

https://gpulist.ai

No idea about how accurate that is, but if you want cluster pricing...

latchkey

I got them to add the Verified badge, but the rest is pretty much about as accurate as craigslist.

Tepix

Genesiscloud.com mentions „starting at @2.00/hr“ for HGX100 H100.

spott

Only 200Gbps network per node.

The infiniband stuff is 400Gbps per GPU (3.2Tbps per node).

GC_tris

Each GPU is connected with 400Gbps. The rest ist just the normal dataplane which is independent from the GPUs.

Source: Was personally involved in design of that deployment.

Jun8

Say you want to burn about $500 as a curiosity project for 8 nodes for a day. Any suggestions for what job to run?

turtles3

As a random thought, this seems to be about the same order of magnitude compute as Karpathy's recent GPT-2 work:

https://github.com/karpathy/llm.c/discussions/677

You could take the final checkpoint from that page and run it for some additional steps and see if it improves? You could always publish the final checkpoint and training curves - someone might find it useful.

fragmede

You could benchmark how fast you can count the number of words (and characters and lines) in all of project gutenberg with wc-gpu.

https://github.com/fragmede/wc-gpu

teaearlgraycold

If you’re just burning money you might as well mine crypto.

Daily Digest email

Get the top HN stories in your inbox every day.

So you want to rent an NVIDIA H100 cluster? 2024 Consumer Guide - Hacker News