Hacked Nvidia 4090 GPU driver to enable P2P

Daily Digest email

Get the top HN stories in your inbox every day.

andersa

Incredible! I'd been wondering if this was possible. Now the only thing standing in the way of my 4x4090 rig for local LLMs is finding time to build it. With tensor parallelism, this will be both massively cheaper and faster for inference than a H100 SXM.

I still don't understand why they went with 6 GPUs for the tinybox. Many things will only function well with 4 or 8 GPUs. It seems like the worst of both worlds now (use 4 GPUs but pay for 6 GPUs, don't have 8 GPUs).

georgehotz

tinygrad supports uneven splits. There's no fundamental reason for 4 or 8, and work should almost fully parallelize on any number of GPUs with good software.

We chose 6 because we have 128 PCIe lanes, aka 8 16x ports. We use 1 for NVMe and 1 for networking, leaving 6 for GPUs to connect them in full fabric. If we used 4 GPUs, we'd be wasting PCIe, and if we used 8 there would be no room for external connectivity aside from a few USB3 ports.

andersa

That is very interesting if tinygrad can support it! Every other library I've seen had the limitation on dividing the heads, so I'd (perhaps incorrectly) assumed that it's a general problem for inference.

spmurrayzzz

There are some interesting hacks you can do like replicating the K/V weights by some factor which allows them to be evenly divisible by whatever number of gpus you have. Obviously there is a memory cost there, but it does work.

WanderPanda

Did you at least front run the market and stocked up of 4090ies before this release? Also gamers are probably not too happy about these developments :D

tinco

4090's have consistently been around 2000 dollars. I don't think there's many gamers who would be affected by price fluctuations of the 4090 or even the 4080.

swalsh

Gamers have a TON of really good really affordable options. But you kind of need 24gb min unless you're using heavy quantization. So 3090 and 4090's are what local llm people are building with (mostly 3090's as you can get then for about $700, and they're dang good)

davidzweig

Is it possible a similar patch would work for P2P on 3090s?

btw, I found a Gigabyte board on Taobao that is unlisted on their site: MZF2-AC0, costs $900. 2 socket Epyc and 10 PCIE slots, may be of interest. A case that should fit, with 2x 2000W Great Wall PSUs and PDU is 4050 RMB (https://www.toploong.com/en/4GPU-server-case/644.html). You still need blower GPUs.

georgehotz

It should if your 3090s have Resizable BAR support in the VBIOS. AFAIK most card manufacturers released BIOS updates enabling this.

Re: 3090 NVLink, that only allows pairs of cards to be connected. PCIe allows full fabric switch of many cards.

davidzweig

Update, I checked with the case company, toploong, they say that board is a 5mm too big or so for the case.

cjbprime

Doesn't nvlink work natively on 3090s? I thought it was only removed (and here re-enabled) in 4090.

doctorpangloss

Have you compared 3x 3090-3090 pairs over NVLink?

IMO the most painful thing is that since these hardware configurations are esoteric, there is no software that detects them and moves things around "automatically." Regardless of what people thing device_map="auto" does, and anyway, Hugging Face's transformers/diffusers are all over the place.

AnthonyMouse

Is there any reason you couldn't use 7? 8 PCIe lanes each seems more than sufficient for NVMe and networking.

Tepix

6 GPUs because they want fast storage and it uses PCIe lanes.

Besides the goal was to run a 70b FP16 model (requiring roughly 140GB VRAM). 6*24GB = 144GB

andersa

That calculation is incorrect. You need to fit both the model (140GB) and the KV cache (5GB at 32k tokens FP8 with flash attention 2) * batch size into VRAM.

If the goal is to run a FP16 70B model as fast as possible, you would want 8 GPUs with P2P, for a total of 192GB VRAM. The model is then split across all 8 GPUs with 8-way tensor parallelism, letting you make use of the full 8TB/s memory bandwidth on every iteration. Then you have 50GB spread out remaining for KV cache pages, so you can raise the batch size up to 8 (or maybe more).

renewiltord

I’ve got a few 4090s that I’m planning on doing this with. Would appreciate even the smallest directional tip you can provide on splitting the model that you believe is likely to work.

Tepix

I know there's some overhead, it's not my calculation.

https://www.tweaktown.com/news/97110/tinycorps-new-tinybox-a...

Quote: "Runs 70B FP16 LLaMA-2 out of the box using tinygrad"

liuliu

6 seems reasonable. 128 Lanes from ThreadRipper needs to have a few for network and NVMe (4x NVMe would be x16 lanes, and 10G network would be another x4 lanes).

numpad0

I was googling public NVIDIA SXM2 materials the other day, and it seemed SXM2/NVLink 2.0 just was a six-way system. NVIDIA SXM had updated to versions 3 and 4 since, and this isn't based on none of those anyway, but maybe there's something we don't know that make six-way reasonable.

andersa

It was probably just before running LLMs with tensor parallelism became interesting. There are plenty of other workloads that can be divided by 6 nicely, it's not an end-all thing.

dheera

What is a six-way system?

TylerE

oid school way of saying core (or in this case GPU), basically.

boromi

Any chance you could share the details of the build you'd go for. I need a server for our lab, but am kinda out of my depth with all the options/

ShamelessC

> Many things will only function well with 4 or 8 GPUs

What do you mean?

andersa

For example, if you want to run low latency multi-GPU inference with tensor parallelism in TensorRT-LLM, there is a requirement that the number of heads in the model is divisible by the number of GPUs. Most current published models are divisible by 4 and 8, but not 6.

bick_nyers

Interesting... 1 Zen 4 EPYC CPU yields a maximum of 128 PCIE lanes so it wouldn't be possible to put 8 full fat GPUs on while maintaining some lanes for storage and networking. Same deal with Threadripper Pro.

segfaultbuserr

It's more difficult to split your work across 6 GPUs evenly, and easier when you have 4 or 8 GPUs. The latter setups have powers of 2, which for example, can evenly divide a 2D or 3D grid, but 6 GPUs are awkward to program. Thus, the OP argues that a 6-GPU setup is highly suboptimal for many existing applications and there's no point to pay more for the extra 2.

cjbprime

I don't think P2P is very relevant for inference. It's important for training. Inference can just be sharded across GPUs without sharing memory between them directly.

andersa

It can make a difference when using tensor parallelism to run small batch sizes. Not a huge difference like training because we don't need to update all weights, but still a noticeable one. In the current inference engines there are some allreduce steps that are implemented using nccl.

Also, paged KV cache is usually spread across GPUs.

namibj

It massively helps arithmetic intensity to batch during inference, and the desired batch sizes by that tend to exceed the memory capacity of a single GPU. Thus desire to do training-like cluster processing to e.g. use a weight for each inference stream that needs it every time it's fetched from memory. It's just that you can't fit 100+ inference streams of context on one GPU, typically, thus the desire to shard along less-wasteful (w.r.t. memory bandwidth) dimensions than entire inference streams.

qeternity

You are talking about data parallelism. Depending on the model tensor parallelism can still be very important for inference.

corn13read2

A macbook is cheaper though

tgtweak

The extra $3k you'd spend on a quad-4090 rig vs the top mbp... ignoring the fact you can't put the two on even ground for versatility (very few libraries are adapted to apple silicone let alone optimized).

Very few people that would consider an H100/A100/A800 are going to be cross-shopping a macbook pro for their workloads.

LoganDark

> very few libraries are adapted to apple silicone let alone optimized

This is a joke, right? Have you been anywhere in the LLM ecosystem for the past year or so? I'm constantly hearing about new ways in which ASi outperforms traditional platforms, and new projects that are optimized for ASi. Such as, for instance, llama.cpp.

andersa

Sure, it's also at least an order of magnitude slower in practice, compared to 4x 4090 running at full speed. We're looking at 10 times the memory bandwidth and much greater compute.

chaostheory

Yeah, even a Mac Studio is way too slow compared to Nvidia which is too bad because at $7000 maxed to 192gb it would be an easy sell. Hopefully, they will fix this by m5. I don’t trust the marketing for m4

faeriechangling

Buying a MacBook for AI is great if you were already going to buy a MacBook, as this makes it a lot more cost competitive. It's also great if what you're doing is REALLY privacy sensitive, such as if you're a lawyer, where uploading client data to OpenAI is probably not appropriate or legal.

But in general, I find the appeal is narrow because either consumer GPUs are better for training in general and inferencing at scale[1]. Cloud services also allow the vast majority of individuals to get higher quality inferencing at lower cost. The result is Apple Silicon's appeal being quite niche.

[1] Mind you, Nvidia considers this a licensing violation, not that GeoHot has historically ever been all scared to violate a EULA and force a company to prove its terms have legal force.

llm_trw

So is a TI-89.

amelius

And looks way cooler

numpad0

4x32GB(128GB) DDR4 is ~$250. 4x48GB(192GB) DDR5 is ~$600. Those are even cheaper than upgrade options for Macs($1k).

papichulo2023

No many consumer mobo support 192GB DDR5.

thangngoc89

training on MPS backend is suboptimal and really slow.

wtallis

Do people do training on systems this small, or just inference? I could see maybe doing a little bit of fine-tuning, but certainly not from-scratch training.

chriskanan

This is great news. As an academic, I'm aware of multiple labs that built boxes with 4090s, not realizing that Nvidia had impaired P2P communication among cards. It's one of the reasons I didn't buy 4090s, despite them being much more affordable for my work. It isn't nvlink, but Nvidia has mostly gotten rid of that except for their highest end cards. It is better than nothing.

Late last year, I got quotes for machines with four nvlink H100s, but the lead time for delivery was 13 months. I could get the non-nvlink ones in just four months. For now, I've gone with four L40S cards to hold my lab over but supply chain issues and gigantic price increases are making it very hard for my lab to do it's work. That's not nearly enough to support 6 PhD students and a bunch of undergrads.

Things were a lot easier when I could just build machines with two GPUs each with Nvlink for $5K each and give one to each student to put under their desks, which is what I did back in 2015-2018 at my old university.

uniqueuid

And before that, Nvidia made our lives harder by phasing out blower-style designs in consumer cards that we could put in servers. In my lab, I'd take a card for 1/4 the price that has half the MTBF over a card for full price anytime.

photonbeam

How does cost compare with some of the GPU-cloud providers?

uniqueuid

Not op, but I found this benchmark of whisper large-v3 interesting [1]. It includes the cloud provider's pricing per gpu, so you can directly calculate break-even timing.

Of course, if you use different models, training, fine tuning etc. the benchmarks will differ depending on ram, support of fp8 etc.

[1] https://blog.salad.com/whisper-large-v3/

jstanley

What does P2P mean in this context? I Googled it and it sounds like it means "peer to peer", but what does that mean in the context of a graphics card?

__alexs

It means you can send data from the memory of 1 GPU to another GPU without going via RAM. https://xilinx.github.io/XRT/master/html/p2p.html

ot1138

Is this really efficient or practical? My understanding is that the latency required to copy memory from CPU or RAM to GPU negates any performance benefits (much less running over a network!)

llm_trw

Yes, the point here is that you do a direct write from one cards memory to the other using PCIe.

In older NVidia cards this could be done through a faster link called NVLink but the hardware for that was ripped out of consumer grade cards and is only in data center grade cards now.

Until this post it seemed like they had ripped all such functionality of their consumer cards, but it looks like you can still get it working at lower speeds using the PCIe bus.

jmalicki

For very large models, the weights may not fit on one GPU.

Also, sometimes having more than one GPU enables larger batch sizes if each GPU can only hold the activations for perhaps one or two training examples.

There is definitely a performance hit, but GPU<->GPU peer is less latency than GPU->CPU->software context switch->GPU.

For "normal" pytorch training, the training is generally streamed through the GPU. The model does a batch training step on one batch while the next one is being loaded, and the transfer time is usually less than than the time it takes to do the forward and backward passes through the batch.

For multi-GPU there are various data parallel and model parallel topologies of how to sort it, and there are ways of mitigating latency by interleaving some operations to not take the full hit, but multi-GPU training is definitely not perfectly parallel. It is almost required for some large models, and sometimes having a mildly larger batch helps training convergence speed enough to overcome the latency hit on each batch.

zamadatix

Peer to peer as in one pcie slot directly to another without going through the CPU/RAM, not peer to peer as in one PC to another over the network port.

brrrrrm

Yea. It’s one less hop through slow memory

publicmail

PCIe busses are like a tree with “hubs” (really switches).

Imagine you have a PC with a PCIe x16 interface which is attached to a PCIe switch that has four x16 downstream ports, each attached to a GPU. Those GPUs are capable of moving data in and out of their PCIe interfaces at full speed.

If you wanted to transfer data from GPU0 and 1 to GPU2 and 3, you have basically 2 options:

- Have GPU0 and 1 move their data to CPU DRAM, then have GPU2 and 3 fetch it

- Have GPU0 and 1 write their data directly to GPU2 and 3 through the switch they’re connected to without ever going up to the CPU at all

In this case, option 2 is better both because it avoids the extra copy to CPU DRAM and also because it avoids the bottleneck of two GPUs trying to push x16 worth of data up through the CPUs single x16 port. This is known as peer to peer.

There are some other scenarios where the data still must go up to the CPU port and back due to ACS, and this is still technically P2P, but doesn’t avoid the bottleneck like routing through the switch would.

whereismyacc

this would be directly over the memory bus right? I think it's just always going to be faster like this if you can do it?

fulafel

Yes, networking is similarly pointless.

haunter

Shared memory access for Nvidia GPUs

https://developer.nvidia.com/gpudirect

CamperBob2

The correct term, and the one most people would have used in the past, is "bus mastering."

wmf

PCIe isn't a bus and it doesn't really have a concept of mastering. All PCI DMA was based on bus mastering but P2P DMA is trickier than normal DMA.

publicmail

I consider it bus mastering when the endpoints initiate the transactions

undefined

[deleted]

undefined

[deleted]

undefined

[deleted]

amelius

Stupid terminology. Might as well call an RS-232 link "peer to peer".

userbinator

I wish more hardware companies would publish more documentation and let the community figure out the rest, sort of like what happened to the original IBM VGA (look up "Mode X" and the other non-BIOS modes the hardware is actually capable of - even 800x600x16!) Sadly it seems the majority of them would rather tightly control every aspect of their products' usage since they can then milk the userbase for more $$$, but IMHO the most productive era of the PC was also when it was the most open.

rplnt

Then they couldn't charge different customers different amounts for the same HW. It's not a win for everyone.

axus

The price of 4090 may increase now, in theory locking out some features might have been a favor for some of the customers.

Sayrus

But it wouldn't if all cards supporting this were "unlocked" by default and thus the other "enterprise-grade" cards weren't that much more expensive. Of course that'd reduce profits by a lot.

greggsy

Which (as controversial as it sounds in this kind of forum) is a sensible pricing model to recover and fund R&D and finance operations.

golergka

If I'm a hardware manufacturer and my soft lock on product feature doesn't work, I'll switch to a hardware lock instead, and the product will just cost more.

matheusmoreira

> the most productive era of the PC was also when it was the most open

The openness certainly was great but it's not actually required. People can figure out how to work with closed systems. Adversarial interoperability was common. People would reverse engineer things and make the software work whether or not the manufacturer wanted it.

It's the software and hardware locks that used to be rare and are now common. Cryptography was supposed to be something that would empower us but it ended up being used against us to lock us out of our own machines. We're no longer in the driver's seat. Our operating systems don't even operate the system anymore. Our free Linux systems are just the "user OS" in the manufacturer's unknowable amalgamation of silicon running proprietary firmware, just a little component to be sandboxed away from the real action.

mhh__

nvidia's software is their moat

thot_experiment

That's a huge overstatement, it's a big part of the moat for sure, but there are other significant components (hardware, ecosystem lock-in, heavy academic incentives)

mhh__

No software -> hardware is massively hobbled. Evidence: AMD.

Ecosystem -> Software. At the moment especially people are looking for arbitrages everywhere i.e. inference costs / being able to inference at all (llama.cpp)

Academics -> Also software but easily fiddled with a bit of spending as you say.

undefined

[deleted]

No1

The original justification that Nvidia gave for removing Nvlink from the consumer grade lineup was that PCIe 5 would be fast enough. They then went on to release the 40xx series without PCIe 5 and P2P support. Good to see at least half of the equation being completed for them, but I can’t imagine they’ll allow this in the next gen firmware.

HPsquared

Is this one of those features that's disabled on consumer cards for market segmentation?

mvkel

Sort of.

An imperfect analogy: a small neighborhood of ~15 houses is under construction. Normally it might have a 200kva transformer sitting at the corner, which provides appropriate power from the grid.

But there is a transformer shortage, so the contractor installs a commercial grade 1250kva transformer. It can power many more houses than required, so it's operating way under capacity.

One day, a resident decides he wants to start a massive grow farm, and figures out how to activate that extra transformer capacity just for his house. That "activation" is what geohot found

bogwog

That's a poor analogy. The feature is built in to the cards that consumers bought, but Nvidia is disabling it via software. That's why a hacked driver can enable it again. The resident in your analogy is just freeloading off the contractor's transformer.

Nvidia does this so that customers that need that feature are forced to buy more expensive systems instead of building a solution with the cheaper "consumer-grade" cards targeted at gamers and enthusiasts.

bpye

This isn’t even the first time a hacked driver has been used to unlock some HW feature - https://github.com/DualCoder/vgpu_unlock

segfaultbuserr

Except that in the computer hardware world, the 1250 kVA transformer was used not because of shortage, but because of the fact that making a 1250 kVA transformer on the existing production line and selling it as 200 kVA, is cheaper than creating a new production line separately for making 200 kVA transformers.

hatthew

And then because this residential neighborhood now has commercial grade power, the other lots that were going to have residential houses built on them instead get combined into a factory, and the people who want to buy new houses in town have to pay more since residential supply was cut in half.

zten

This represents pretty well how gamers (residential buyers) are going to feel when the next generation of consumer cards are scooped up for AI.

HPsquared

Excellent analogy of the other side of this issue.

cesarb

That's a bad analogy, because in your example, the consumer is using more of a shared resource (the available transformer, wiring, and generation capacity). In the case of the driver for a local GPU card, there's no sharing.

A better example would be one in which the consumer has a dedicated transformer. For instance, a small commercial building which directly receives 3-phase 13.8 kV power; these are very common around here, and these buildings have their own individual transformers to lower the voltage to 3-phase 127V/220V.

m3kw9

Where is the hack in this analogy

pixl97

Taking off the users panel on the side of their house and flipping it to 'lots of power' when that option had previously been covered up by the panel interface.

rustcleaner

I am sure many will disagree-vote me, but I want to see this practice in consumer devices either banned or very heavily taxed.

xandrius

You're right. Especially because you didn't present your reasons.

wmf

Of course power users want an end to price discrimination because it benefits them... at a cost of more expensive products for the masses.

yogorenapan

Curious as to your reasoning,

imtringued

Well, they have zero incentives to implement and test this feature for consumer GPUs. Multi GPU setups never really worked that well for gaming.

undefined

[deleted]

ivanjermakov

I was always fascinated by George Hotz's hacking abilities. Inspired me a lot for my personal projects.

jgpc

I agree. It is fascinating. When you observe his development process (btw, it is worth noting his generosity in sharing it like he does) he gets frequently stuck on random shallow problems which a perhaps more knowledgable engineer would find less difficult. It is frequent to see him writing really bad code, or even wrong code. The whole twitter chapter is a good example. Yet, himself, alone just iterating resiliently, just as frequently creates remarkable improvements. A good example to learn from. Thank you geohot.

zoogeny

This matches my own take. I've tuned into a few of his streams and watched VODs on YouTube. I am consistently underwhelmed by his actual engineering abilities. He is that particular kind of engineer that constantly shits on other peoples code or on the general state of programming yet his actual code is often horrendous. He will literally call someone out for some code in Tinygrad that he has trouble with and then he will go on a tangent to attempt to rewrite it. He will use the most blatant and terrible hacks only to find himself out of his depth and reverting back to the original version.

But his streams last 4 hours or more. And he just keeps grinding and grinding and grinding. What the man lacks in raw intellectual power he makes up for (and more) in persistence and resilience. As long as he is making even the tiniest progress he just doesn't give up until he forces the computer to do whatever it is he wants it to do. He also has no boundaries on where his investigations take him. Driver code, OS code, platform code, framework code, etc.

I definitely couldn't work with him (or work for him) since I cannot stand people who degrade the work of others while themselves turning in sub-par work as if their own shit didn't stink. But I begrudgingly admire his tenacity, his single minded focus, and the results that his belligerent approach help him to obtain.

ctrw

There are developers who have breadth and developers who have depth. He is very much on the breadth end of the spectrum. It isn't lack of intelligence but lack of deep knowledge of esoteric fields you will use once a decade.

That said I find it a bit astonishing how little Ai he uses on his streams. I convert all the documentation I need into a rag system that I query stupid questions against.

spirobelv2

link your github. want to see your raw intellectual power

gorkish

If a stopped clock is right twice a day, relentlessly winding a clock forward will make it right quite frequently. That is geohot.

vrnvu

I agree, I feel so inspired with his streams. Focus and hard work, the key to good results. Add a clear vision and strategy, and you can also accomplish “success”.

Congratulations to him and all the tinygrad/comma contributors.

sambull

He's got that focus like a military pilot on a long flight.

postalrat

Any time I open guys steam half of it is some sort of politics

CYR1X

You can blame chat for that lol

Jerrrry

His Xbox360 laptop was the crux of teenage-motivation, for me.

llm_trw

Skimming the readme this is p2p over PCIe and not NVLink in case anyone was wondering.

formerly_proven

RTX 40 doesn’t have NVLink on the PCBs, though the silicon has to have it, since some sibling cards support it. I’d expect it to be fused off.

llm_trw

A cursory google search suggests that it's been removed at the silicon level.

steeve

Some do: https://wccftech.com/gigabyte-geforce-rtx-4090-pcb-shows-lef...

jsheard

I'm pretty sure that's just a remnant of a 3090 PCB design that was adapted into a 4090 PCB design by the vendor. None of the cards based on the AD102 chip have functional NVLink, not even the expensive A6000 Ada workstation card or the datacenter L40 accelerator, so there's no reason to think NVLink is present on the silicon anymore below the flagship GA100/GH100 chips.

undefined

[deleted]

HeatrayEnjoyer

How to unfuse it?

magicalhippo

I don't know about this particular scenario, but typically fuses are small wires or resistors that are overloaded so they irreversibly break the connection. Hence the name.

Either done during manufacture or as a one-time programming[1][2].

Though sometimes reprogrammable configuration bits are sometimes also called fuse bits. The Atmega328P of Arduino fame uses flash[3] for its "fuses".

[1]: https://www.nxp.com/docs/en/application-note/AN4536.pdf

[2] https://www.intel.com/programmable/technical-pdfs/654254.pdf

[3]: https://ww1.microchip.com/downloads/en/DeviceDoc/Atmel-7810-...

mepian

Use a Focused Ion Beam instrument.

klohto

afaik 4090 doesn’t support 5.0 so you are limited to 4.0 speeds. Still an improvement.

jsheard

It'll be nice while it lasts, until they start locking this down in the firmware instead on future architectures.

mnau

Sure, but that was something that was always going to happen.

So it's better to have it at least for one generation instead of no generation.

jagrsw

Was it George himself, or a person working for a bounty that was set up by tinycorp?

Also, a question for those knowledgeable about the PCI subsys: it looked like something NVIDIA didn't care about, rather than something they actively wanted to prevent, no?

toast0

PCI devices have always been able to read and write to the shared address space (subject to IOMMU); most frequently used for DMA to system RAM, but not limited to it.

So, poking around to configure the device to put the whole VRAM in the address space is reasonable, subject to support for resizable BAR or just having a fixed size large enough BAR. And telling one card to read/write from an address that happens to be mapped to a different card's VRAM is also reasonable.

I'd be interested to know if PCI-e switching capacity will be a bottleneck, or if it'll just be the point to point links and VRAM that bottlenecks. Saving a bounce through system RAM should help in either case though.

namibj

Fixed large bar exists in some older accelerator cards like e.g. iirc the MI50/MI60 from AMD (the data center variant of the Radeon Vega VII, the first GPU with PCIe 4.0, also famous for dominating memory bandwidth until the RTX 40-series took that claim back. It had 16GB of HBM delivering 1TB/s memory bandwidth).

It's notably not compatible with some legacy boot processes and iirc also just 32bit kernels in general, so consumer cards had to wait for resizable BAR to get the benefits of large BAR (that being notably direct flat memory mapping of VRAM so CPUs and PCIe peers can directly read and write into all of VRAM, without dancing through a command interface with doorbell registers. AFAIK it allows a GPU to talk directly to NICs and NVMe drives by running the driver in GPU code (I'm not sure how/if they let you properly interact with doorbell registers, but polled io_uring as an ABI would be no problem (I wouldn't be surprised if some NIC firmware already allows offloading this).

mtlynch

Commits are by geohot, so it looks like George himself.

throw101010

I've seen him work on tinygrad on his Twitch livestream couple times, so more than likely him indeed.

squarra

He also documented his progress on the tinygrad discord

throwaway8481

I feel like I should say something about discord not being a suitable replacement for a forum or bugtracker.

guywhocodes

We are talking about a literal monologue while poking at a driver for a few hours, this wasn't a huge project.

undefined

[deleted]

rfoo

Glad to see that geohot is back being geohot, first by dropping a local DoS for AMD cards, then this. Much more interesting :p

jaimehrubiks

Is this the same guy that hacked the PS3?

dji4321234

He has a very checkered history with "hacking" things.

He tends to build heavily on the work of others, then use it to shamelessly self-promote, often to the massive detriment of the original authors. His PS3 work was based almost completely on a presentation given by fail0verflow at CCC. His subsequent self-promotion grandstanding world tour led to Sony suing both him and fail0verflow, an outcome they were specifically trying to avoid: https://news.ycombinator.com/item?id=25679907

In iPhone land, he decided to parade around a variety of leaked documentation, endangering the original sources and leading to a fragmentation in the early iPhone hacking scene, which he then again exploited to build on the work of others for his own self-promotion: https://news.ycombinator.com/item?id=39667273

There's no denying that geohotz is a skilled reverse engineer, but it's always bothersome to see him put onto a pedestal in this way.

pixelpoet

There was also that CheapEth crypto scam he tried to pull off.

hatware

[dead]

delfinom

[flagged]

mikepurvis

Yes, but he spent several years in self-driving cars (https://comma.ai), which while interesting is also a space that a lot of players are in, so it's not the same as seeing him back to doing stuff that's a little more out there, especially as pertains to IP.

nolongerthere

Did he abandon this effort? That would be pretty sad bec he was approaching the problem from a very different perspective.

mepian

Yes, that's him.

WithinReason

And the iPhone

yrds96

And android

undefined

[deleted]

undefined

[deleted]

modeless

What are the chances that Nvidia updates the firmware to disable this and prevents downgrading with efuses? Someday cards that still have older firmware may be more valuable. I'd be cautious upgrading drivers for a while.

Daily Digest email

Get the top HN stories in your inbox every day.