Rust-CUDA: Fast GPU code fully in Rust

Daily Digest email

Get the top HN stories in your inbox every day.

sxp

> "Extremely fast"

When people make claims like this, it would be good if they put the benchmarks on the first page. E.g, how does it compare with https://github.com/gfx-rs/wgpu which lets you target Vulkan, Metal, DX, GL or WASM+WebGPU with rust?

smoldesu

I hope this doesn't come off as handwaving, but you're kinda comparing apples to oranges here. CUDA has always been in a class of it's own when it comes to GPU compute, for better and worse, so I think the people out there who want to use this will pretty quickly get an idea of who it's for. Benchmarks would be nice, but I don't really think they'd be germane when comparing a proprietary compute system with a generic cross-platform GPU binding.

rak1507

What? So you can just claim something is 'fast' with no evidence, but then if it is slower than proprietary things, who cares? That's ridiculous - if you claim something is 'extremely fast' you should back that up with benchmarks.

joe_guy

If it's not relative to anything, than the word "fast" doesn't have much meaning.

seeekr

"Fast", to me, from a software development perspective can still be meaningful, in the sense of knowing what techniques, patterns, paths, ... enable performant execution, and providing easy and straightforward paths for the user along those. Which, ultimately, leads to high performance in most ways the user will apply the provided framework (in a more general sense of the word). Hope that makes sense.

And it must be OK to claim "fast" as a goal, from the early stages of a project, even before it may be possible to create any meaningful benchmarks. As long as it's discernable for the intended audience the precise stage of development or maturity the project is currently at. Which, I believe, the project in question is communicating just fine ("still in early development").

outworlder

Would have it made any difference had the parent mentioned OpenCL?

nynx

wgpu is a library for running wgsl on GPUs, not Rust.

nextaccountic

It turns out that when wgpu is combined with rust-gpu, it can run rust on gpus too

https://github.com/EmbarkStudios/rust-gpu/tree/main/examples...

(on the "builder" directory it builds shaders with the spirv-builder crate)

zozbot234

https://github.com/embarkstudios/rust-gpu would be the closest equivalent, AFAICT.

undefined

[deleted]

raggi

I wrote some OpenCL code recently, wrapped in a rust program using opencl3. My CL code was pretty slow early on, it's a math heavy problem implemented initially with a bunch of for loops.

I largely "optimized" it by smattering the code with #pragma unroll, which was exceptionally effective for the problem at hand, given that I had tons of statically defined loops that could be easily unrolled.

I know rust has a tendency to aggressively inline and flatten everything, but I'm curious about things like this, where it can be so important in this domain to ensure things are unrolled, as the loop conditionals can otherwise introduce pipeline/concurrency issues.

wpietri

When they say "extremely fast GPU code" I think they mean relative to Rust's normal home: standard CPUs. So I don't think the claim needs any specific support.

quietbritishjim

To me, it is clear that "fast GPU code" means GPU code that is faster than other GPU code.

Like a "fast sports car" would obviously have to be faster than other sports cars. You couldn't (truthfully) make that claim if was faster than regular cars but slower than other sports cars.

"Accelerate Rust code by using the GPU" would be a more honest description.

wpietri

I agree it could be read both ways. I'm just saying what I think their intent is.

dragontamer

https://github.com/RDambrosio016/Rust-CUDA/blob/master/guide...

* Missing Atomics -- Gamebreaker IMO. Atomics are absolutely essential when you are dealing with 10,000+ threads on a regular basis. You'll inevitably come across a shared data-structure that requires write-access from each thread, and some coordination mechanism is needed for that. Atomics are one important fit.

Ironic, a few days ago, I argued for the use of Fork-join parallelism in most cases (aka: Kernel launch / synchronized kernel exits). Now I find myself arguing the opposite now that we have a topic here with missing atomics. Like... atomics need to be used very, very rarely, but those rare uses are incredibly important.

* Warp Vote / Match / Reduce / Shuffle missing (Very useful tools for highly-optimized code, but you can write slower code that does the same thing through __shared__ memory just fine)

------

Wait, does this support __shared__ memory at all? Raw access to memory is not really amenable to Rust's programming style, but its absolutely necessary for high-performance GPU programming.

If this is missing __shared__ memory concepts, then the entire idea of "efficient GPU code" is dead IMO. GPU threads can only communicate quickly over __shared__ memory within an OpenCL Workgroup / CUDA Block (A Workgroup or Block is roughly a grouping of 1024 "threads" or SIMD-lanes)

All other forms of communication are incredibly slow. Atomics are maybe the next fastest form of communication, but only across __shared__ memory. Relaxed Atomics to global memory are reasonably performant but once you have either Seq-cst or Acquire/Release semantics (aka: the right memory barriers in the right place), things slow down dramatically in GPU-land.

The big issue is that __shared__ memory is only 64kB in size, its locked down to workgroups / blocks. In NVidia GPUs, the __shared__ memory "eats" into your L1 cache as well (In fact: __shared__ memory can be thought of as programmer-managed cache. The caching heuristics just aren't good enough for high-performance GPU programmers. They want to manually manage that high-speed memory for maximum performance).

rdambrosio

As i mentioned, it is an early project, just making the simplest kernel compile was very difficult. Atomics and shared memory are great, but both are very difficult. Atomics need "proper" atomics (i.e. special instructions on sm_70+ and emulated on <sm_70), and shared mem needs some weird codegen support. I will get to both of them. Nevertheless, noalias does cause significant performance speedups in memory bound kernels, see this blogpost: https://developer.nvidia.com/blog/cuda-pro-tip-optimize-poin...

So please do not be surprised that an early project does not contain every single feature of cuda, something thats been around for decades

dragontamer

No problem. I understand its a work in progress.

I'd push most strongly for CUDA __shared__ support first along with thread-barriers (CUDA's __syncthreads()), followed by __shared__ atomics. Finally, global atomics + associated memory-barrier stuffs (Ex: seq-cst atomic, acq-release atomic would work but maybe be a bit difficult. Might be easier to support the older-style memory barrier instead?)

--------

EDIT: Alternatively, maybe making a Thrust-like library for Rust is a better step 1? High performance GPU-code is very peculiar and strange. I can't imagine there's a big market for it. It seems like most people writing GPU "code" these days are just Python programmers punching in a Tensorflow, rather than actually trying to write a high-performance GPU thingy.

structural

Hi! Engineer who writes many high-performance GPU things. We don't get nearly as much attention since part of the job is to write language bindings so that not everyone on the team has to look at CUDA kernels directly. There's a pretty big market for it, but it's still hard/expensive to both write performant code and train people to be able to do so. A lot of it reads roughly similar to CPU math library implementations (programming in "assembler" via compiler intrinsics). I'd say any way this can be made easier without losing a ton of performance will just make more problems more affordable to solve, since the time of senior performance/optimization engineers is so limited.

I'd argue that this project will be successful if you can write at least a few primitives of Thrust-like library in it: it'll give reasonably-sized problems to tackle, a production implementation to compare against for both testing and performance benchmarking, and demonstrate that the project has the basic functionality (optimized primitives like parallel sums/reductions are building blocks for a lot of useful things).

jabl

> EDIT: Alternatively, maybe making a Thrust-like library for Rust is a better step 1? High performance GPU-code is very peculiar and strange. I can't imagine there's a big market for it. It seems like most people writing GPU "code" these days are just Python programmers punching in a Tensorflow, rather than actually trying to write a high-performance GPU thingy.

What NVIDIA is pushing these days in the HPC space is, in addition to CUDA, so-called "standards-based parallelization" (stdpar), meaning the NVIDIA HPC SDK compilers can offload C++17 parallel algorithms and Fortran DO CONCURRENT + certain array intrinsics to GPU's. And similarly in the python world there's Legate numpy and CuNumeric. Now of course these are quite limited and can only offload certain relatively simple algorithms, but for many real-world computational problems this is all they need, and if you can get 90% of the performance of a full CUDA implementation for 1% of the effort it's a pretty attractive proposition.

You can think of it a bit like a flowchart:

1. Can your problem be solved with a well-known algorithm that already has an optimized GPU implementation (cuBLAS, cuFFT, cuTensor, etc. etc.)? Use the existing library implementation!

2. Can your problem be expressed in a relatively straightforward data-parallel fashion? Use the standards-based parallelization with C++, Fortran, Numpy.

3. Can your problem be expressed with directive-based parallelization like OpenACC or OpenMP? Use those.

4. If neither of the above work, sure, drop down to CUDA.

None of this means that CUDA is going away, NVIDIA continues to invest heavily into the CUDA ecosystem. And CUDA is the foundation upon which the first three options in the list are built upon. Think of it more like bringing new programmers into the GPGPU fold, people who before didn't bother with utilizing a GPU at all.

Here's a recentish slide deck about it: https://www.alcf.anl.gov/sites/default/files/2021-03/NVHPC-S...

Now, getting back to Rust, yes I think stdpar-like support would be nice, but it depends on robust underlying support for handling the GPU. Something that this Rust-CUDA effort could provide.

Disclaimer: I work for NVIDIA, though not in this particular space. All opinions my own.

nynx

Looks like they just haven't gotten around to it. Rust has fine language support for atomics and shared memory.

dragontamer

You're lucky I've had this discussion before with other Rust programmers. But I forgot about this issue...

CUDA __shared__ memory is a 64kB block of RAM that is located __INSIDE THE CORE__ and is incredibly fast, but has very peculiar semantics. (Since the memory never "leaves" the core, its "stuck" only on a small set of 1024 "threads". Difficult to manage when you write 30,000+ thread programs but the speed benefits are insane and well worth the trouble)

Rust "shared" memory is thread-to-thread communications that simply doesn't exist in the GPU-world.

-------

Maybe it'd be more appropriate if I used OpenCL terminology instead of CUDA terminology here, because Rust programmers have an unfortunate name conflict. In OpenCL, this 64kB on-core buffer is called __local memory. Does that help?

Whenever I said "__shared__", replace that in your mind with __local instead. Because CUDA __shared__ is NOTHING like Rust-Shared.

DSingularity

The 1024 threads of a warp/block/whatever or just the current threads or what?

nynx

That sounds a little complicated to deal with, but I see no reason why either the Rust atomic types or a new type supplied by the rust-cuda library couldn't handle that just fine.

I just want to make sure that you realize that Rust does have regular atomics (and that's how other shared memory abstractions are generally implemented underneath).

WithinReason

> Atomics are absolutely essential

Most problems don't need atomics to solve

> If this is missing __shared__ memory concepts, then the entire idea of "efficient GPU code" is dead IMO

Sure, shared memory is great, but not always needed and communicating through global memory can be fast if you stay inside the L1/L2 cache and hide latency.

dragontamer

> Most problems don't need atomics to solve

How do you handle a global, concurrent memory write and/or read ?? (across many different blocks, maybe even across different grids).

For example: lets say you have a global hash table and 30,000 CUDA-threads are running. How do you insert data into the hash table safely?

> Sure, shared memory is great, but not always needed and communicating through global memory can be fast if you stay inside the L1/L2 cache and hide latency.

Scan operations (sum, min, max, AND, OR to name a few) through the parallel prefix pattern (https://en.wikipedia.org/wiki/Prefix_sum). How do you plan to do it, if not through __shared__ memory?

This is a fundamental operation in almost all GPU code I've ever seen. Just pushing / popping to a stack will require a prefix-sum to determine the size across the workgroup.

If you can't do prefix-sum, you won't be able to do effective load-balancing on a GPU. This is something that'd normally take a dozen clock ticks, but if you do it over L2 you're looking at hundreds of clock ticks instead.

------

Sorting networks are also probably best implemented in __shared__ memory... with maybe warp-level intrinsics beating them. (But warp-level programming is much much harder and I prefer to avoid it).

WithinReason

As I said, these features are often not needed. You can implement e.g. a neural network library without needing atomic operations.

> How do you plan to do it, if not through __shared__ memory?

Can't you use __shared__ memory the same way you use workgroup barriers and global memory? Might be slower, but good caching should make it comparable, which should be the case of prefix sum (you read right after writing, so should get good cache hit probability).

floatboth

Global hash table on GPUs sounds cursed and perverted. What you’re meant to do with them is run the same computation on all pixels in parallel independently! :P

shaklee3

shared memory and L1 are the same on modern GPUs, but shared memory gives you control of what goes on there. I agree with dragontamer that it's absolutely essentially to have atomics and shared.

WithinReason

How is it "absolutely essential" if you just want to sum 2 large arrays for example?

DiabloD3

Since this doesn't target the majority of GPUs, nor the majority of GPUs used in enterprise compute, why continue to target CUDA? If you target SPIR-V, then this can be used with all the existing modern APIs (Vulkan, OpenCL, OpenGL; Microsoft has a SPIR-V to DXIL translator, and third parties have Vulkan/SPIR-V-on-legacy-API support as well, and there is also at least one project that converts LLVM IR to SPIR-V)

... but if you target NVVM alone, you're stuck on the minority of machines that can run CUDA. Even Nvidia has admitted that SPIR-V is the future, since they are both a major contributor to Khronos specs, but also have a top notch SPIR-V compiler that produces better code for Nvidia GPUs than their NVVM compiler.

gaze

The writing is better than I might have produced as a first year college student, but this needs copy editing. I might suggest that the word “extremely” should be removed not only from the entire repository but all of the user’s repositories. Fast might be removed as well. Nobody is trying to generate slow GPU code. The salient feature is that one can write the same code as one might write in CUDA with the advantages of Rust’s type system—-which is indeed useful! However, there’s no speed to be gained by using this relative to CUDA.

rdambrosio

> there’s no speed to be gained by using this relative to CUDA.

That is not totally true, there are two main things that can make kernels generated by this codegen faster:

- noalias, which is the LLVM equivalent of __restrict__, CUDA can take massive advantage of noalias by using readonly cache if the pointer does not alias. If you don't believe me just take a look at nvidia's blog post: https://developer.nvidia.com/blog/cuda-pro-tip-optimize-poin...

- CUDA builds all the .cu files and links the PTX of them together, this means no LTO happens by default. I do something different where i actually lazily-load every module using dependency graphs, which gives you the benefit of LTO by default. Its not perfect because right now it leaves behind a lot of dead code, but i know how to fix it.

jacquesm

But there is relative to plain rust. Besides, even if nobody is trying to write slow GPU code, it's a very easy thing to get subtly wrong resulting in 10's of % speed loss.

scottlamb

> I might suggest that the word “extremely” should be removed not only from the entire repository but all of the user’s repositories.

You've reminded me of this classic, often (mis?)attributed to Mark Twain: [1]

> “Substitute 'damn' every time you're inclined to write 'very;' your editor will delete it and the writing will be just as it should be.”

[1] https://quoteinvestigator.com/2012/08/29/substitute-damn/

imron

> I might suggest...

Best way to suggest is with a pull request.

ajkjk

Not if you're critiquing the way some writes English..

imron

Depends how it’s worded.

If someone with copy editing experience made constructive suggestions for the readme and other documentation, it may well be appreciated.

A PR also gives the maintainers a way to discuss wording if there are points of disagreement - far more so than a post on HN that they may or may not even see.

127

Would be really nice to have an actual cross platform GPGPU library. It's really holding every kind of progress back to have only vendor lock-in.

Maybe WebCPU will be capable of compute to the extent that CUDA isn't necessary. https://github.com/UpsettingBoy/gpgpu-rs

neatze

> n ecosystem of libraries and tools for writing and executing extremely fast GPU code fully in Rust.

Well at least it does not say extremely safe.

esjeon

The title says fast, but no benches, but the README puts more emphasis on that it’s a more usable solution than LLVM-PTX + Rust.

I mean, what’s the point of “fast”, if LLVM-PTX is clunky with rust in the first place?

zozbot234

How would this compare with Accel (which is also built on the cuda ecosystem)?

rdambrosio

Accel uses the LLVM PTX backend, which is unusable for serious projects and doesn't work on windows, i have more about it here https://github.com/RDambrosio016/Rust-CUDA/blob/master/guide...

glacambre

>The LLVM PTX backend is still very much WIP and often doesn't have things and/or breaks.

What is missing in your opinion aside from working debug info? I've worked on experimental CUDA support in GNAT (the Ada frontend of GCC), using LLVM's ptx backend to target GPUs and it was mostly working.

Djrhfbfnsks

How does the compare with writing GPU code in Julia?

nextaccountic

Could this maybe support AMD's HIP as well?

rdambrosio

Not for the near future, HIP does not seem to be language-agnostic and the codegen is for NVVM IR, not whatever AMD uses. It might be possible to target amdgpu with llvm because all the gpu-unfriendly things are gone in my codegen. So maybe in the future? im not sure

russdpale

very cool!

m0zg

NVIDIA should hire the guy, then hire whoever he says he wants on the team and let them rip. That's what I'd do if I were in their shoes. Viable paths off C/C++ are badly needed, and currently the only real viable path with an ecosystem and community is Rust.

pjmlp

NVidia decided for Ada against Rust for their automotive firmware, their cards are designed based on C++'s memory model, and they have a big ISO C++ presence.

m0zg

NVIDIA has so much cash these days that they don't have to do just one thing. They can do _all_ the things, at the same time. This realistically would only take a 5-7 person team, including the manager. That's $2M/yr tops, all in, which is peanuts for NVIDIA. Much less if the author is not based in the US.

pjmlp

With what ROI? This isn't charity.

imtringued

Yeah do it before he gets the crazy idea of supporting AMD gpus...

Daily Digest email

Get the top HN stories in your inbox every day.