Brian Lovin
/
Hacker News
Daily Digest email

Get the top HN stories in your inbox every day.

brrrrrm

> It compiles a custom kernel for every operation, allowing extreme shape specialization.

This doesn't matter. Just look at the performance achieved by CuDNN kernels (which back PyTorch), they're dynamically shaped and hit near peak. For dense linear algebra at the size of modern neural networks, optimizing for the loop bound condition won't help much.

> All tensors are lazy, so it can aggressively fuse operations.

This matters. PyTorch teams are trying to implement that now (they have LazyTensor, AITemplate, TorchDynamo), but I'm not sure of the status (it's been tried repeatedly).

> The backend is 10x+ simpler, meaning optimizing one kernel makes everything fast.

The first part of that sentence matters, the second part doesn't. Kernels are already fast and their reuse outside of being fused into each other (which you need a full linear algebra compiler to do) isn't very high. If you make sum fast, you have not made matrix multiplication fast even though MM has a sum in it. It just isn't that easy to compose operations and still hit 80+% of hardware efficiency.

But it is easier to iterate fast and build a seamless lazy compiler if your backend is simple. You can pattern match more easily and ensure you handle edge cases without insanely complicated things like alias analysis (which PyTorch has to do).

georgehotz

> they're dynamically shaped and hit near peak

While this is true for most common GEMM looking ops, if you tread off the beaten path things get slow (odd channel sizes, batch sizes, etc...). Right now in PyTorch, GroupNorm is 2x slower than BatchNorm. There's no fundamental reason, just that the kernels loop over axes in a less than ideal order. Dynamic recompilation allows you to change the loop order too, not just deal with boundary conditions.

brrrrrm

> tread off the beaten path things get slow

Yea, makes sense. I think there's something to be said for dynamic compilation solving this problem more elegantly than providing tons of hand-tuned kernels (PyTorch is 890MB lmao https://pypi.org/project/torch/#files), but I don't think it's a strict reason for a performance win.

> change the loop order too

Memory layout as well! I'm 100% for dynamic compilation, but I'm claiming that it really finds its stride when you fuse things.

georgehotz

Agreed. For anything at all common, most of the gains will be from fusion, the rest is just free. PyTorch also uses tons of GPU memory after only initializing, I wonder if it's copying all the kernels in?

twothreeone

> Right now in PyTorch, GroupNorm is 2x slower than BatchNorm

How did you benchmark this? I think there are like 3 or 4 different GN implementations in PyTorch..

georgehotz

Whole net performance at comma, when we switch from BatchNorm to GroupNorm it adds 70ms to the training step time, and it's -70ms for no norm. We also wrote a custom AllNorm that's like 10% slower than BatchNorm (and I put several hours into trying to optimize it). Obviously not indicative of everyone's experience, but my point is BatchNorm is hyperoptimized and others, which are pretty much the same thing, aren't.

markisus

What does it mean to "fuse operations"?

brrrrrm

avoiding writes to memory and reducing the number of loops (although not FLOPs)

    for j in range(10):
      c[j] = a[j] + b[j]
    for j in range(10):
      d[j] = c[j] * 2
becomes

    for j in range(10):
      d[j] = (a[j] + b[j]) * 2

thrtythreeforty

Or, better, identifying that the machine has a primitive that is better than doing each op individually. For example, a multiply-accumulate instruction vs a multiply and separate accumulate. The source code still says "a*b+c", the compiler is just expected to infer the MAC instruction.

FL33TW00D

Any more writing on laziness in frameworks? I'm trying to implement it myself.

brrrrrm

The only thing I'd recommend is exposing "eval()" or something to let users tell you when they want you to evaluate things. It'll save a ton of time when it comes to hot-fixing performance and memory use issues. It's really hard to determine when to evaluate, and although it's a fun problem to figure out, it's nice to have an escape hatch for users to just tell you. (Flashlight has explored this and written about it here: https://fl.readthedocs.io/en/latest/debugging.html?highlight...)

If you're interested, I've looked into symbolic laziness, which allows you to infer correct input sizes even when the constraints happen later. Can be useful for errors. https://dev-discuss.pytorch.org/t/loop-tools-lazy-frontend-e...

orlp

    > It's extremely simple, and breaks down the most complex networks into 4 OpTypes:
    >
    > - UnaryOps operate on one tensor and run elementwise. RELU, LOG, RECIPROCAL, etc...
    > - BinaryOps operate on two tensors and run elementwise to return one. ADD, MUL, etc...
    > - ReduceOps operate on one tensor and return a smaller tensor. SUM, MAX
    > - MovementOps operate on one tensor and move the data around, copy-free with ShapeTracker. RESHAPE, PERMUTE, EXPAND, etc...
    >
    > But how...where are your CONVs and MATMULs? Read the code to solve this mystery.
Ok, I was curious, so I read the code. The answer is that it represents a MATMUL as a 1x1 CONV. And it lied about CONV, which is a ProcessingOps.CONV and explicitly represented and implemented: https://github.com/geohot/tinygrad/blob/c0050fab8ff0bc667e40... Quite the letdown of figuring out this 'mystery'.

georgehotz

That CONV is only used on the older backends. The GPU and LLVM backend rewrite CONV as MUL+SUM, to be fused later, and thus only use the 4 OpTypes.

https://github.com/geohot/tinygrad/blob/master/tinygrad/lazy...

WithinReason

That's cool, am I right in assuming that you want to automate the production of efficient GPU (or other accelerator) code based on these low level primitives? But you would still need a piece of sorcery that can produce high performance OpenCL code, right? And that code could be different for every device, so you would need some trial and error, benchmark-based compilation at the very least. Or would OpenCL code be generated by hand for each device?

georgehotz

Yea, benchmark based compilation, that's already happening in the tinygrad compiler we use for openpilot to determine the local group size. https://github.com/geohot/tinygrad/blob/caea34c52996cde2ed46...

Working on parameterizing a search space that includes more than the local group size. The end dream is some ML guided search to optimize the kernels :)

brrrrrm

I've done some work in the past on NN representations and you actually can represent Conv and MatMul in more primitive ways. I ended up writing an IR called loop_tool that exposes this stuff:

https://github.com/facebookresearch/loop_tool/blob/main/pyth...

The idea is basically this: https://news.ycombinator.com/item?id=28883086

WithinReason

To directly quote the source:

    # these are the llops your accelerator must implement, along with toCpu
    UnaryOps = Enum("UnaryOps", ["NOOP", "NEG", "RELU", "EXP", "LOG", "SIGN", "RECIPROCAL"])
    BinaryOps = Enum("BinaryOps", ["ADD", "SUB", "MUL", "DIV", "POW", "CMPEQ"])
    ReduceOps = Enum("ReduceOps", ["SUM", "MAX"])
    MovementOps = Enum("MovementOps", ["RESHAPE", "PERMUTE", "EXPAND", "FLIP", "STRIDED", "PAD", "SHRINK"])
    ProcessingOps = Enum("ProcessingOps", ["CONV"])
https://github.com/geohot/tinygrad/blob/caea34c52996cde2ed46...

There is a MAX but not a MIN? Is that because max(x,y) = -min(-x,-y)? But then why is there a SUB? Why is there a RELU if it's only max(0,x)? Maybe MIN is just too rare to be worth implementing?

georgehotz

Min is an HLOP.

From: https://github.com/geohot/tinygrad/blob/master/tinygrad/tens...

def min(self, axis=None, keepdim=False): return -((-self).max(axis=axis, keepdim=keepdim))

All folded together, no slower than MAX.

WithinReason

But then SUB, DIV and RELU could be an HLOP as well, no?

liuliu

Very similar idea as Jittor, convolution definitely can be break down: https://github.com/Jittor/jittor/blob/master/python/jittor/n...

jerpint

Just looking at the code from my phone, but it seems that the conv op calls another primitive and einsum, which I believe is just a fancy MUL with broadcasting? so it might still be technically correct?

PartiallyTyped

Einsum is an expressive way of doing element wise products and then possibly reducing them. An einsum is essentially a description of the dimensions of the input tensors and the dimensions of the resulting output after multiplication. If the output has reduced dimensions, then a summation is applied over them. The package einops provides reductions such as summation, averaging, and so on.

For example; the einsum " b k n p, k -> b k n p" broadcasts the second tensor b to b[None, :, None, None] and does element wise multiplication. It can be changed to a vector product by writing "b k n p, k -> b n p", which for all intents and purposes is identical to a.transpose(0, 2, 3, 1) @ b .

I can easily recommend the einops package and using einsum, simplifies things significantly.

sakras

I must say they gained instant credibility with the minimalistic website given how fast it loaded.

Code looks simple and easy to follow, and I love how the comments are constantly mentioning hardware characteristics, making maxing the hardware the goal. It seems that it’s trying to achieve this by jitting optimal code for the operations at hand rather than hand-optimizing kernels, and betting that the small number of operations will make tuning the codegen tractable.

I haven’t kept up much with what’s happening in ML, but at least in the realm of columnar database engines, interpreting a series of hand-optimized kernels seems to be the dominant approach over compiling a vectorized query plan. Are compilers good enough at optimizing ML operations that specializing on input shape makes a difference over hand-tuned kernels?

kklisura

It's geohot. He comes with credibility. [1]

[1] https://en.wikipedia.org/wiki/George_Hotz

JacobiX

I love those tiny DNN frameworks, some examples that I studied in the past (I still use PyTorch for work related projects) :

thinc.by the creators of spaCy https://github.com/explosion/thinc

nnabla by Sony https://github.com/sony/nnabla

LibNC by Fabrice Bellard https://bellard.org/libnc/

Dlib dnn http://dlib.net/ml.html#add_layer

emaro

I love this website. Their style tag literally is:

    <style>
      body {
        font-family:'Lucida Console', monospace
      }
    </style>
Also look like a very cool project.

arketyp

I like this too and I don't understand the downvotes. It says a lot about the philosophy of the project. Minimalist, bold, brutalist, no-frills first principles thinking. For better and worse.

therealchiggs

There's an interesting roadmap in the "cherry" folder of the git repo[0]. It begins by bringing up a design on FPGA and ends with selling the company for $1B+ by building accelerator cards to compete with NVIDIA:

  Cherry Three (5nm tapeout)
  =====
  * Support DMA over PCI-E 4.0. 32 GB/s
  * 16 cores
  * 8M elements in on board RAM of each core (288 MB SRAM on chip)
  * Shared ~16GB GDDR6 between cores. Something like 512 GB/s
  * 16x 32x32x32 matmul = 32768 mults
  * 1 PFLOP @ 1 ghz (finally, a petaflop chip)
  * Target 300W, power savings from process shrink
  * This card should be on par with a DGX A100 and sell for $2000

  * At this point, we have won.
  * The core Verilog is open source, all the ASIC speed tricks are not.
  * Cherry will dominate the market for years to come, and will be in every cloud.
  * Sell the company for $1B+ to anyone but NVIDIA
[0] https://github.com/geohot/tinygrad/blob/master/accel/cherry/...

lr1970

As it was recently discussed at length here on HN [0] (401 comments), George Hotz (the lead of tinygrad) is taking time off his self-driving startup comma.ai [1]. Curious if this would help or hurt tinygrad progress.

[0] https://news.ycombinator.com/item?id=33406790

[1] https://comma.ai/

fragmede

Of course, the stable diffusion tie-in is not to be missed!

https://github.com/geohot/tinygrad/blob/master/examples/stab...

eterevsky

How is it compared to JAX? After TensorFlow and PyTorch, JAX seems very simple, basically an accelerated numpy with just a few additional useful features like automatic differentiation, vectorization and jit-compilation. In terms of API I don't see how you can go any simpler.

learndeeply

JAX is a DSL on top of XLA, instead of writing Python. Example: a JAX for loop looks like this:

   def summ(i, v): return i + v
   x = jax.lax.fori_loop(0, 100, summ, 5)
A for loop in TinyGrad or PyTorch looks like regular Python:

   x = 5
   for i in range(0, 100):
      x += 1
By the way, PyTorch also has JIT.

eterevsky

I've just tried making a loop in a jit-compiled function and it just worked:

    >>> import jax
    >>> def a(y):
    ...   x = 0
    ...   for i in range(5):
    ...     x += y
    ...   return x
    ...
    >>> a(5)
    25
    >>> a_jit = jax.jit(a)
    >>> a_jit(5)
    DeviceArray(25, dtype=int32, weak_type=True)

koningrobot

It definitely works, JAX only sees the unrolled loop:

  x = 0
  x += y
  x += y
  x += y
  x += y
  x += y
  return x
The reason you might need `jax.lax.fori_loop` or some such is if you have a long loop with a complex body. Replicating a complex body many times means you end up with a huge computation graph and slow compilation.

cl3misch

He mentioned in a recent stream that he dislikes the complexity of the XLA instruction set used by JAX. So it's less the user-facing API, and more the inner workings of the library.

tucosan

Can someone from the ML crowd ELI5 to me what tinygrad does, how it plugs into an ML pipeline and what it's use cases are?

gamegoblin

There are libraries like tensorflow and PyTorch that allow the user to define their neural net in simple, readable Python code, and they internally "compile" and optimize your neural net to run on GPUs and such.

Tinygrad is like a very, very lean PyTorch with a different philosophy -- it intends to keep the codebase and API surface very very small and focus most of its energy on optimizing the way the output neural net runs on physical hardware.

The author, George Hotz, has observed in the last few years that neural net performance is hindered by lack of optimization here, particularly around memory accesses.

ivalm

“Almost 9k stars” is actually 7.3k stars…

But otherwise very cool project :)

queuebert

Maybe he hired a bunch of bot accounts to star it; then when the accounts were banned the stars were removed? ^_^

dqft

because milestones for children of the 90s

ivalm

I am dumb but now it makes sense. His power is indeed over 9k.

dedoussis

It's funny that geohot/tinygrad chooses to not meet the PEP8 standards [0] just to stay on brand (<1000 lines). Black [1] or any other python autoformatter would probably 2x the lines of code.

[0] https://peps.python.org/pep-0008/

[1] https://github.com/psf/black

kurisufag

to anybody experienced in writing functional-esque oneliners, PEP8 is an appalling waste of space

koningrobot

More like 10x. Black is truly a terrible thing.

Daily Digest email

Get the top HN stories in your inbox every day.