Show HN: Port of OpenAI's Whisper model in C/C++

Hi HN,

OpenAI recently released a model for automatic speech recognition called Whisper [0]. I decided to reimplement the inference of the model from scratch using C/C++. To achieve this I implemented a minimalistic tensor library in C and ported the high-level architecture of the model in C++. The entire code is less than 8000 lines of code and is contained in just 2 source files without any third-party dependencies. The Github project is here:

https://github.com/ggerganov/whisper.cpp

With this implementation I can very easily build and run the model - “make base.en”. It also allows me to run it on a wide range of devices. For example, I have provided examples of running the model on an iPhone, Raspberry Pi 4 and even in a web page via WebAssembly!

The implementation runs fully on the CPU and utilizes FP16, AVX intrinsics on x86 architectures and NEON + Accelerate framework on Apple Silicon. The latter is especially efficient and I observe that the inference is about 2-3 times faster compared to the current PyTorch implementation provided by OpenAI when running it on my MacBook M1 Pro. The WASM port utilizes SIMD 128-bit intrinsics - a feature supported in some modern web browsers [1].

I am very happy with the performance that I observe on Apple Silicon devices. I didn’t expect that the Accelerate framework [2] (i.e. CBLAS) offers such a dramatic performance boost for matrix multiplications so I was very pleasantly surprised! To enable the framework in your C/C++ projects, all you have to do is add `-framework Accelerate` to your clang command-line flags.

This entire exercise of implementing the Whisper model was very interesting to me and helped me understand a lot about how the transformer architecture works. I also got a lot of positive feedback from people finding and using my project. We brainstormed on a lot of interesting tools that can potentially be created with this library (such as speech-to-text plugin for Vim, RPi4 voice assistant, WASM chat bot, etc). If interested, checkout the “Examples” section and the “Show and tell” discussions for some ideas!

Would love to know what you think about this project and about your experience with using the Accelerate framework in any of your projects. Cheers!

[0] https://github.com/openai/whisper

[1] https://chromestatus.com/feature/6533147810332672

[2] https://developer.apple.com/documentation/accelerate

Daily Digest email

Get the top HN stories in your inbox every day.

GistNoesis

Great work. It's a real bowl of fresh air coming from huge framework that use cuda.

So many different cuda version, with each framework using its own, that all rely on a different driver, and everything needs a new version every 3 months and takes ~10G, (and don't even talk about cudnn needing some manual logged-in install).

Here everything is just two files. For embedded system that don't have a GPU it's perfect.

Here the parallelization and vectorization has been done by hand, but there is a glimmer of hope coming from the side of various compiler projects :

Here is an interesting intel project that does the parallelization and vectorization automatically for different architecture that's definitely worth a look : https://ispc.github.io/ispc.html

For the auto-differentiation when I need performance or memory, I currently use tapenade ( http://tapenade.inria.fr:8080/tapenade/index.jsp ) and/or manually written gradient when I need to fuse some kernel, but Enzyme ( https://enzyme.mit.edu/ ) is also very promising.

MPI for parallelization across machines.

ShamelessC

> MPI for parallelization across machines.

Some things never change.

ProtoAES256

Ditto about the CUDA and cuDNN part. My project that was running fine for the past 4 years just "died" after a colleague's oversight on upgrading the GPU(1080Ti -> 3090) which isn't compatible with the new cuDNN. It is just too much of a hassle maintaining that *expletive* jargon so I did the wise decision to kill it.

stephc_int13

100%.

So much more practical to hack around and or build small apps.

AMICABoard

I vouch for this. Pretty solid and keeps improving. The OP is in the class of Magic Wizards of programming like Fabrice Bellard!

There are frequent updates and performance improvements. There is also a small community of active users around this.

All most all feedbacks get implemented and the OP is very responsive.

The OP made it possible to do state of the art voice recognition without the PyTorch baggage and in C/C++, pretty incredible! Its one of those rare high value projects.

Very grateful for this project and respect to the OP!

Some day if a ChatGPT open version becomes available, this could mean voice assistants that speak sense and understand the human - as long you have a beefy machine.

The current efficiency is pretty surprising, even on a low spec device it performs faster than real time.

I don't know what to say. But I'm blown away.

I expect to see more magic from the OP in future.

He has even a project for a cool sound modem that works over ultrasonic! Not new stuff, but the implementation is the most robust I have seen.

I recommend hackers here to check out his other project too and maybe contribute with testing and patches and stuff!

stevenhuang

Yup this is so magical. I've always felt there was something off about requiring setting up what is essentially a pytorch/ml dev environment everytime end users "just" want to run inference.

A single binary that does this all w/o the python stack is just incredible!

edit: Got it going in 1 min!

I grabbed the prebuilt artifacts (windows)

- https://github.com/ggerganov/whisper.cpp/actions/runs/363552...

Then downloaded ggml-base.bin (148mb) and put it in models/ggml-base.en.bin

- https://huggingface.co/datasets/ggerganov/whisper.cpp/blob/m...

Ran it and everything worked! Amazing. Note that only the large (3GB) whisper-v2 is available at the moment, but haven't seen any errors yet from the older small ones. Wild.

NWoodsman

Can you expand on your steps a bit more? I've never used Github Actions which seems like step 1. Not sure how to get an installer.

NWoodsman

Ok nevermind, figured it out, it requires login. Then the archive is at the very bottom of the page.

gdz

I've been watching this repo pretty much since the beginning, and the amount of work you've achieved is incredible.

I've started tinkering with the code about last week and despite knowing nothing about C/C++, I was able to make some edits to fit my use case, and connect it to a custom Python front end (I initially tried to use Qt in C++ but struggled so much to get to it to compile that I've switched to Python instead). This probably means your code is very clean and well documented.

It's a game changer in terms of accessibility: it can caption almost anything in live!

I'm very grateful for the effort that you've lead. Thank you ggerganov, and thanks to everyone who contributed.

thot_experiment

10/10 you're doing god's work my friend, can't wait to spend some time this weekend to try and understand what's going on here. I can't overstate how much I value small libraries. I can't think of a faster way to learn about a concept than to step through someone else's barebones implementation.

ggerganov

Thanks! Indeed, I agree that the project has an educational aspect and value. For me, it helped me get a better understanding of the neural network layers involved in the transformer model. Also, it was a good playground to practice my low-level optimization techniques. I guess another cool thing was that with the help of the community, we came up with a faster way to evaluate the Encoder (at the cost of some accuracy), which ultimately enabled the WASM and RPi4 examples (see #137 if interested in the discussion).

ahgamut

I liked reading the different implementations of the low-level tensor ops (simple C/AVX/AVX2/WASM128bit/ARM-NEON) -- it will help me learn about how to use x86 ASM. Thank you for writing this! Do you have any other recommendations/examples on how numerical code can be optimized via SIMD routines?

ggerganov

I don't have other recommendations as I am a novice myself when it comes to SIMD. I think the multiplication routines in `whisper.cpp` are relatively basic - dot product and fused multiply-add. With a few trial and errors I came up with these implementations - not sure if they are optimal.

NWoodsman

For those who want to try, here are the steps I took over about an hour to set it up:

1. Downloaded the Win10 artifact: https://github.com/ggerganov/whisper.cpp/actions/runs/363552... at the bottom of the page by logging in to Github. Extract and placed this folder in my F:\ drive renaming it to 'Whisper'.

2. Downloaded `ggml-large.bin` here: https://huggingface.co/datasets/ggerganov/whisper.cpp/tree/m.... Within F:\Whisper, add a folder named 'models'. Move ggml-large.bin to the 'models' folder.

3. Downloaded ffmpeg, extracted the archive to F:\FFMpeg, and set the environment variable by going to (right click) This PC -> Properties -> Advanced system settings -> (Advanced tab) -> Environment Variables -> click Path -> Edit -> (paste in ffmpegs path i.e. F:\FFMpeg\)

4. Use PowerShell to run ffmpeg against an mp3 file, to convert it to WAV (which is the only format that works) i.e.:

ffmpeg -i F:\Rec\input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le F:\Output\output.wav

5. Open PowerShell again, `cd` to the Whisper folder, and ran this:

./main -m models/ggml-large.bin -f F:\Rec\output.wav

eigenvalue

This is really awesome, great work! I downloaded a long video from YouTube that is very challenging to transcribe (it is an interview with Hayek, who was both very soft spoken and had a thick German accent) because I wanted to evaluate OpenAI’s claims about whisper being “superhuman” in recognition. It was a bit picky about having the audio in the exact right format (it needs to be 16kHz wav files— would be really nice it just include ffmpeg in the release and automatically pipe any input first to ffmpeg to convert to the desired format), but once it got started it just cranked away extremely quickly on my iMac M1. And the results do seem to be pretty good. I just wish the model also did some basic speaker identification, so it could insert “Speaker1:” or something at the beginning of each line/timestamp. Even if it’s not sure, it could insert “Speaker<?>:” and that would still be useful.

eigenvalue

For those interested:

Video link: https://www.youtube.com/watch?v=34Bre91Ey3Q

Resulting transcript text: https://pastebin.com/5M1iW8yf

The whole thing took 1.5 minutes to run on an M1.

eigenvalue

Part 2: https://pastebin.com/dEzpXLcL

ggerganov

Btw, there is the `yt-wsp.sh` helper script to download, convert and transcribe a video by given url:

  ./examples/yt-wsp.sh <video-url>

eigenvalue

I actually couldn't get the script to work-- it keeps complaining that it can't find the Whisper executable. Do you know what I would have to do first to get it to work starting from a "blank slate" and cloning the repo? Thanks!

zimpenfish

It's trying to run `WHISPER_EXECUTABLE` which it defaults to `whisper` - but if you follow the instructions in the README, you end up with `main`.

    WHISPER_EXECUTABLE=main examples/yt-wsp.sh youtube_url

Or the script itself prints out a handy hint.

    Whisper needs to be built into the main binary with make, then you can rename it to something like 'whisper' and add it to your PATH for convenience.

eigenvalue

Ah thank you, I somehow missed that! Amazing job on this. You could literally create whole new companies from this system.

codetrotter

This is very cool. Excellent work!

Perhaps in combination with https://www.npmjs.com/package/peertube-plugin-transcription your port of Whisper could be used for generating subtitles for videos in PeerTube?

I just recently set up a PeerTube instance of my own and uploaded my first video on it ("No Brain Required - ChatGPT solves Advent of Code in Rust, episode 1", https://video.nstr.no/w/6z7PxB4J92H3NHhgMmfYVw)

I want to try and make use of your port of Whisper on my PeerTube instance, so that I can have subtitles generated for my videos on it :D

fimdomeio

I don't fully understand what's going on behind the scenes, but I tried to use the repo some days ago (guess there's even a new model now) and everything seemed very simple to build to try so thank you for your amazing work.

stephc_int13

Offline models, especially speech recognition, is a game changer for many apps.

A fully CPU based implementation, simple enough with minimal dependencies is also something that helps tremendously reduce the initial friction and enable potential low-cost applications.

Excellent and impressive work, can’t wait to try this thing at home.

buzzier

Live demo: https://whisper.ggerganov.com/

naillo

Holy crap, I just loaded a random video file into the tiny model and it did it perfectly and quickly too. This is amazing.

IshKebab

This is awesome. I'm sure the fact that most ML models require an insane mess of Python packages is holding applications back.

Hook this up to ChatGPT and you've got something better than Google Assistant with almost no work.

(You can tell ChatGPT an API, and ask it to generate a script in response to a voice assistant query.)

wfcollins

Hi Georgi,

I am experimenting with your code now. Is there a way to force Whisper to only consider a limited vocabulary and then respond with confidence levels? I am working on an app where it is important to restrict answers and I would like to know how confident that a response is one of a set of words. If the answer could be word A with a confidence level of 95% and word B with a level of 50%, I would want to know that so that I could perform context verification.

Thanks!

Bill

jjwiseman

Hopefully this version will add the prompting ability that the original Whisper has. In the original Whisoer, you would be able to give it a prompt for the recognition like "Please respond with only one of the following words: A, B, or C." It wouldn't be foolproof, but it helps.

https://github.com/openai/whisper/discussions/117#discussion...

ggerganov

A follow up on this - I came up with an interesting strategy to achieve this. Still a prototype, but I think it looks very promising:

https://github.com/ggerganov/whisper.cpp/pull/271

The source code is in the `command.cpp` and I will soon write some more details how it works. If you give it a try, definitely let me know if it worked for you.

ggerganov

Hi, it's not obvious how to achieve this, but it feels it could be done. I think all the "tools" are available in the existing interface in `whisper.h` - for example, `whisper_get_probs()` gives you the probability for each token.

Daily Digest email

Get the top HN stories in your inbox every day.