Launch HN: Deepsilicon (YC S24) – Software and hardware for ternary transformers

Hey Hacker News! We’re Abhi and Alex from deepsilicon (https://deepsilicon.com). We are building software and hardware for training and inferencing ternary transformer models. Here's a video of the software: https://www.youtube.com/watch?v=VqBn-I5D6pk.

Transformer-based models are getting bigger every generation, making the inference hardware requirements more and more expensive. Running large transformer models on device is even more challenging. Usually, they require trillions of FLOPs to run at decent speeds and use too much energy and space.

Our solution is to train ternary transformer models. There are two advantages to using ternary values. The first is that the weights can now be stored in two bits (or even less) from 16 bits. This represents an almost 8x compression ratio for every weight matrix in the transformer model (slightly less because of the float16 scaling value and extra norm, but that’s negligible). The second advantage is a reduction in the arithmetic intensity. If we do a dot product between ternary values and INT8 values, we either add the INT8 if the ternary value is 1, subtract the INT8 if the ternary values is -1, or do nothing if the ternary value is 0. There are numerous ways to take advantage of this change in arithmetic, from look up tables to bit mask reductions. As for why ternary and not quaternary/binary, ternary hits a sweet spot of compression and (symmetric) representational value for weights in our experiments.

Currently, hardware is not really optimized for extreme low bit-width matrix operations (whether multiplication or otherwise). We’ve tried various implementations of kernels on both CPUs/GPUs (really only NVIDIA GPUs). We don’t even come close to the theoretical maximum speed for our kernels, and a large part of the failure is because the architecture of existing hardware isn’t optimized for the operations we want them to do. Creating custom silicon for ternary LLMs can accelerate inference by implementing and designing algorithms/circuits that only work for ternary LLMs. Unlike most hardware companies, which need silicon to show improvements, we can already show improvements to active VRAM usage and throughput with our custom kernels on existing hardware. This sets pretty impressive lower bounds for custom silicon.

We originally started working on this after reading the BitNet paper from Microsoft, and were disenchanted that we couldn't run SOTA models on our consumer hardware (3090 and 3070M). Both Alex and I worked on research at Dartmouth, I worked more on the ML/model architecture side, while Alex worked on randNLA CUDA kernels to accelerate training. The research experience, and opportunity to talk to professors, made us realize that if we could pull off ternary transformers, it could solve the large scale inference problem on the edge and cloud.

First, we must either retrain or pretrain a model with our custom linear layers based on the Bitnet 1.58 layers (we’re working on open sourcing our framework for training, data labelling, and synthetic data generation here: https://github.com/deepsilicon/Sila). The model is trained with FP16 weights, but the weights are quantized and the quantization function is detached from the computational graph to allow gradients to flow, and the loss is measured w.r.t. the quantized weights. Once the model converges, we can inference the model with our custom kernels written for CPUs or GPUs (we are working on Inferentia and TPU support). The end goal is to create purpose-built custom silicon to work with the ternary weights, where we can have better compression, throughput, latency, and energy improvements compared to our kernels on existing hardware.

We know this is a highly challenging problem due to technical and market difficulties. Plenty of hardware companies have tried to accelerate inference, but most are not profitable. The biggest problem in the ML hardware market, perplexingly, is software. It's challenging to convince companies to switch to some new hardware when their entire infrastructure and software stack has been configured for some other hardware. On the technical side, we must support various deployment options and model architectures to make large-scale custom silicon production worthwhile. This is compounded by the fact we want to have a single line of code handle everything, abstracting what we're doing away from the ML engineers. So, we need to handle everything on the technical side: compiling the right kernels for your platform, generating the right bindings for ONNX/TensorRT, tuning the kernels, setting the mode to training or inference, etc.

We’d love to hear your opinions about ASICs for transformer inference - and if you know anyone who might be interested in deploying these models, my email is abhi@deepsilicon.com. We can’t wait to hear what you all think!

Daily Digest email

Get the top HN stories in your inbox every day.

danjl

In my experience, trying to switch VFX companies from CPU-based rendering to GPU-based rendering 10+ years ago, a 2-5x performance improvement wasn't enough. We even provided a compatible renderer that accepted Renderman files and generated matching images. Given the rate of improvement of standard hardware (CPUs in our case, and GPU-based inference in yours), a 2-5x improvement will only last a few years, and the effort to get there is large (even larger in your case). Plus, I doubt you'll be able to get your HW everywhere (i.e. mobile) where inference is important, which means they'll need to support their existing and your new SW stack. The other issue is entirely non-technical, and may be an even bigger blocker -- switching the infrastructure of a major LLM provider to a new upstart is just plain risky. If you do a fantastic job, though, you should get aquahired, probably with a small individual bonus, not enough to pay off your investors.

areddyyt

We're targeting the edge market first, such as NVIDIA's Jetson line, because it's far less supported/focussed on. In our experience, whenever we did training runs on H100 clusters with x86, any pip package would be easily installable, and a wide array of software just worked. This is not the case in Jetson, where we constantly have to rebuild packages from source, and in general, NVIDIA will only release a better board every five years. As for the second part of your question, we agree. Much of our work has been trying to make switching to our software layer straightforward (a single line of code). The ideal endgame is that, given an ONNX file, we can parse the generated node tree and determine if our hardware supports all the nodes. Of course, this is assuming we have a large enough share of the market using our software, so we know what operations we need to support on the hardware side of things.

danjl

I cannot see any way of building HW profitably for the Jetson market. You are really competing with Raspberry PI, not Jetson, IMO. I mean, I'm no expert, but I would suggest doing a deep dive on your business plan if you intend to target the small hardware world rather than spending any time designing HW or SW. Then reduce your estimate by at least half since doing anything in that embedded/edge world has many more technical issues.

areddyyt

In general, Jetson has quite a large market. Vehicle companies use automotive-rated Jetson Orins, and defense companies also use Jetson Orins to power ML applications on the edge (Anduril). Many of the companies we currently talk to are robotics companies that are forced to use Jetsons because they are both the least of the bad options and the only edge compute provider with enough juice to run larger transformer models.

danjl

I think the only way this could work is if you had the backing of one of the major LLM providers who decided that your ideas are worth doing a PoC. That way you actually have a client on board before you spend all the money. I know you guys probably like the designing of the HW and SW, and maybe the implementation of both, but really, what you need now is to do sales.

lumost

There are multiple ways to run a business like this.

1. Go deep on the tech, there are funders who will want equity stakes in risky startups because they operate in adjacent markets. It's often cheaper to invest 1MM on a startup than internal R&D activities. If it has promising results, those same investors may ramp up their spend or pivot to an acquisition strategy.

2. Get early customers, if you have 1-10 large enterprises with a committed spend - then you are likely golden. However as nice as this option sounds, there are few avenues to get this type of commitment. If you are in the fortunate position of knowing the exec/founding/investor team of a large LLM provider - it's possible. But easier said than done.

3. Build it and they will come, business strategies take time to develop - maybe that time is poorly spent. Build the best version of your product and someone might take it up. There are a few investors who will take a flyer on this type of founder mentality. Benefit to the investor is that they can get a much larger equity stake/board position in exchange for the early creative freedom. If it works out, the investor can get a lot of alpha. A card which handled LLM inference at 1/100th the cost of an H100 could produce quite a bit of value for the right buyer.

areddyyt

Agreed. We don't plan on making hardware until there is enough demand from customers to make it economically viable.

acidhax

I'm currently working on a portable computer vision project using Pi/Jetson with some Luxonis camera modules and I completely see where you're headed. In the long-game I think you could capture hw accelerated robotics CV.

ben_ja_min

Why not target the enthusiast first? The buzz created around something interesting an "amateur" cooked up may be what you need. The investment involved with creating dev hardware should be minimal, correct?

simne

I may be wrong, but from few other enthusiast niches I conclude, enthusiasts number is very little to feed hardware development. - Need millions sells, but really most real project have made thousands sells.

And this is long known - even Raspberry born for other market, fortunately, was not just killed but conversed to target enthusiast and even now incomplete project.

jroesch

Having been working in DL inference for now 7+ years (5 of which at startup) which makes me comparably ancient in the AI world at this point. The performance rat race/treadmill is never ending, and to your point a large (i.e 2x+) performance improvement is not enough of a "painkiller" for customers unless there is something that is impossible for them to achieve without your technology.

The second problem is distribution: it is already hard enough to obtain good enough distribution with software, let alone software + hardware combinations. Even large silicon companies have struggled to get their HW into products across the world. Part of this is due to the actual purchase dynamics and cycle of people who buy chips, many design products and commit to N year production cycles of products built on certain hardware SKUs, meaning you have to both land large deals, and have opportune timing to catch them when they are evening shopping for a new platform. Furthermore the people with existing distribution i.e the Apple, Google, Nvidia, Intel, AMD, Qualcomms of the world already have distribution and their own offerings in this space and will not partner/buy from you.

My framing (which has remained unchanged since 2018) is that for silicon platform to win you have to beat the incumbents (i.e Nvidia) on the 3Ps: Price (really TCO), Performance, and Programmability.

Most hardware accelerators may win on one, but even then it is often theoretical performance because it assumes their existing software can/will work on your chip, which it often doesn't (see AMD and friends).

There are many other threats that come in this form, for example if you have a fixed function accelerator and some part of the model code has to run on CPU the memory traffic/synchronization can completely negate any performance improvements you might offer.

Even many of the existing silicon startups have been struggling with this since them middle of the last decade, the only thing that saved them is the consolidation to Transformers but it is very easy for a new model architecture to come out and require everyone to rework what they have built. This need for flexibility is what has given rise to the design ethos around GPGPU as flexibility in a changing world is a requirement not just a nice to have.

Best of luck, but these things are worth thinking deeply about as when we started in this market we were already aware of many of these things but their importance and gravity in the AI market have only become more important, not less :)

areddyyt

We've spent a lot of time thinking about these things, in particular, the 3Ps.

Part of making the one line of code work is addressing programmability. If you're on Jetson, we should load the CUDA kernels for Jetson's. If you're using a CPU, we should load the CPU kernels. CPU with AVX512, load the appropriate kernels with AVX512 instruction, etc.

The end goal is that when we introduce our custom silicon, one line of code should make it far easier to bring customers over from Jetson/any other platform because we handle loading the correct backend for them.

We know this will be bordering impossible, but it's critical to ensure we take on that burden rather than shifting it to the ML engineer.

danjl

Why start a company to make this product? Why not go work at one of the existing chip manufacturers? You'd learn a ton, get to design and work on HW and/or SW, and not have to do the million other things required to start a company.

gchadwick

Is GPU rendering used today for VFX? From a quick google it seems that yes GPU based rendering is definitely an option, even if there's various reasons to still prefer CPU. So in your case was it really what you were aiming to do was pointless or simply your particular solution failed to succeed?

You're right that as a small player it's very hard to gain traction, even if the tech is fantastic because it's risky to switch your tech stack over. Though if you do do a good job with the tech I'd say you have a decent chance of an acquisition from a bigger player who wants a ready-made (or 90% of the way there) solution they can make their own. Perhaps you can call this an aquihire but I think you're significantly underplaying the potential upside of this exit. Imagine this startup is seen as having a great ternary transformer solution and ternary transformers are the way to go you could get multiple large players eyeing up an acquisition to get ahead pushing the price up.

My feeling is custom ASICs for ternary transformers is a great area to look at. There is a genuine chance of providing a significant step up from GPUs in terms of power efficiency and potentially performance. Plenty of risk of course, ternary models might just not perform as well as the full fat equivalents and building custom silicon, especially as a start-up, comes with all kinds of issues.

jsheard

> Is GPU rendering used today for VFX?

Yes by small studios with the agility to change their workflow without too much friction, and whose projects are small enough to fit into the constraints of GPU renderers, but largely not by huge studios who already have in-house CPU farms and whose projects need hundreds of gigs of RAM to render anyway.

kridsdale3

The Unreal Engine I hear is getting a lot of work these days.

cs702

Watching the video demo was key for me. I highly recommend everyone else here watches it.[a]

From a software development standpoint, usability looks great, requiring only one import,

  import deepsilicon as ds

and then, later on, a single line of Python,

  model = ds.convert(model)

which takes care of converting all possible layers (e.g., nn.Linear layers) in the model to use ternary values. Very nice!

The question for which I don't have a good answer is whether the improvement in real-world performance, using your hardware, will be sufficient to entice developers to leave the comfortable garden of CUDA and Nvidia, given that the latter is continually improving the performance of its hardware.

I, for one, hope you guys are hugely successful.

---

[a] At the moment, the YouTube video demo has some cropping issues, but that can be easily fixed.

areddyyt

Thank you!

CUDA and Nvidia are practically impenetrable on the server side. To be very concrete, we did training for our models on AWS with parallel cluster. We used P5 instances (8xH100) that were scheduled with SLURM. A problem we ran into however, was that our training jobs were containerized. Thankfully, pyxis and enroot exist to run containerized jobs on SLURM. And who else, but Nvidia, develop and maintain those plugins. For practically any weird niche use case, Nvidia seems to have some software solution - but only on x86.

Jetson is a whole other beast. There is no guarantee any pip package you install has an aarch64/arm64 wheel. For example, we could not use torch_tensorrt, to compile to TensorRT via Torch Inductor. Why? Because the Bazel build system was only configured to build for Jetpack 4.6 or Jetpack 5.1, and we were using Jetpack 6. While Nvidia provides docker images for x86 systems that come with torch_tensorrt installed, their L4T (Linux for Tegra) images do not. Instead we had to manually write out a new workspace file and compile for Jetpack6 to provide TensorRT compiling support.

tl;dr: Nvidia and CUDA have a great walled garden on x86, not so much on their edge computing devices

cs702

My understanding is that, so far, most deployments of AI on edge devices are on mass-market mobile and entertainment devices relying on software and hardware tightly controlled by a handful of mega-corporations, such as Apple (iOS), Google (Android), Samsung (phones, TVs, etc.), and Tesla (proprietary in-car chips for FSD), and so on. Aren't those mega-corporations, not Nvidia, the ones who have the actual walled gardens on AI edge computing?

Do you think otherwise?

areddyyt

You're absolutely right about mobile devices (Apple, Google, etc.). However, most companies, with the exception of Tesla, do use Nvidia for edge computing capabilities. We know for a fact that most of the automotive industry uses automotive rated Orins (the 32GB unified RAM SKU) [1] and Anduril also use Orins. Our primary GTM is with robotics companies, and we have not met a single robotics company not using Jetson, I'm not exaggerating.

[1] Particularly vehicles with advanced self driving capabilities. Qualcomm is another large vendor of hardware for vehicles (though they have even worse support)

areddyyt

Video cropping issues should be fixed!

0xDA7A

I think the part I find most interesting about this is the potential power implications. Ternary models may perform better in terms of RAM and that's great, but if you manage to build a multiplication-free accelerator in silicon, you can start thinking about running things like vision models in < 0.1W of power.

This could have insane implications for edge capabilities, robots with massively better swarm dynamics, smart glasses with super low latency speech to text, etc.

I think the biggest technical hurdle would be simulating the non linear layers in an efficient way, but you can also solve that since you already re-train your models and could use custom activation functions that better approximate a HW efficient non linear layer.

areddyyt

The non-linear layers, particularly the softmax(QK^T), will be crucial to getting ultra-low latency and high throughput. We're considering some custom silicon just for that portion of every transformer block

jacobgorm

I was part of a startup called Grazper that did the same thing for CNNs in 2016, using FPGAs. I left to found my own thing after realizing that new better architectures, SqueezeNet followed by MobileNets, could run even faster than our ternary nets on off-the-shelf hardware. I’d worry that a similar development might happen in the LLMs space.

areddyyt

It's always possible, but transformers have been around since 2017 and don't seem to be going anywhere. I was bullish on Mamba and researched extended context for structured state-space models at Dartmouth. However, no one cared. The bet we're taking is that Transformers will dominate for at least a few more years, but our bet could be wrong.

nicoty

Could the compression efficiency you're seeing somehow be related to 3 being the closest natural number to the number e, which also happens to be the optimal radix choice (https://en.wikipedia.org/wiki/Optimal_radix_choice) for storage efficiency?

areddyyt

We don't achieve peak compression efficiency because more complex weight unpacking mechanisms kill throughput.

To be more explicit, the weight matrix's values belong to the set of -1, 0, and 1. When using two bits to encode these weights, we are not effectively utilizing one possible state:

10 => 1, 01 => 0, 00 =>-1, 11 => ?

I think selecting the optimal radix economy will have more of a play on custom silicon, where we can implement silicon and instructions to rapidly decompress weights or work with the compressed weights directly.

nostrebored

What do you think about the tension between inference accuracy and the types of edge applications used today?

For instance, if you wanted to train a multimodal transformer to do inference on CCTV footage I think that this will have a big advantage over Jetson. And I think there are a lot of potentially novel use cases for a technology like that (eg. if I'm looking for a suspect wearing a red hoodie, I'm not training a new classifier to identify all possible candidates)

But for sectors like automotive and defense, is the accuracy loss from quantization tolerable? If you're investing so much money in putting together a model, even considering procuring custom hardware and software, is the loss in precision worth it?

areddyyt

Great question. So a little bit of background about quantization (apologies if you are already familiar).

There are two types of quantization (generally), post training quantization (PTQ) and quantization aware training (QAT).

PTQ almost always suffers from some kind of accuracy degradation. This is because usually the loss is measured with respect to the FP16/BF16 parameters, and so the weights and distribution are selected to minimize the loss with those weights. Once the quantization function is applied, the weights and distribution change in some way (even if it's by a tiny amount), resulting in your model no longer being at minima.

We do QAT to get around the problem of PTQ. We actually quantize the weights during the forward pass of training, and measure the loss with respect to the quantized weights. As a result, once we converge the model, we have converged the ternary weights as well, and the accuracy it achieved at the end of training is the accuracy of the quantized model. At ~3B parameters the accuracy on downstream task performance between FP16 and ternary weights is identical.

henning

I applaud the chutzpah of doing a company where you develop both hardware and software for the hardware. If you execute well, you could build yourself a moat that is very difficult for would-be competitors to breach.

sidcool

Congrats on launching. This is inspiring. .

transfire

Combine it with TOC, and then you’d really be off to the races!

https://intapi.sciendo.com/pdf/10.2478/ijanmc-2022-0036#:~:t...

areddyyt

Funnily enough, our ML engineer, Eddy, did a hackathon project working with Procyon to make a neural network with a photonic chip. Unfortunately, I think Lightmatter beat us to the punch.

Edit: I don't think the company exists in its current form anymore

Havoc

> This represents an almost 8x compression ratio for every weight matrix in the transformer model

Surely you’d need more ternary weights though to achieve same performance outcome?

A bit like a Q4 quant is smaller than a Q8 but also tangibly worse so the “compression” isn’t really like for like

Either way excited about more tenary progress.

areddyyt

We do quantization-aware training, so the model should minimize the loss w.r.t. the ternary weights, hence no degradation in performance.

stephen_cagle

Is one expectation from moving from a 2^16 state parameter to a tristate one that the tristate one will only need to learn the number of states of the 2^16 states that were actually significant? I.E. we can prune the "extra" bits from the 2^16 that did not really affect the result?

mikewarot

Since you're flexible on the silicon side, perhaps consider designing things so that the ternary weights are loaded from an external configuration rom into a shift register chain, instead of fixed. This would allow updating the weights without having to go through the whole production chain again.

areddyyt

We actually were thinking about this to flush the weights in at initialization

mikewarot

Cool.... if you want to make a general purpose compute engine out of it, you could go full BitGrid[1]. ;-)

[1] https://bitgrid.blogspot.com/2005/03/bitgrid-story.html

areddyyt

This seems super cool. I'll have my cofounder look into it :)

Daily Digest email

Get the top HN stories in your inbox every day.