Whisper – open source speech recognition by OpenAI

Daily Digest email

Get the top HN stories in your inbox every day.

pen2l

Neat, https://github.com/openai/whisper - they have open-sourced it, even the model weights, so they are living up to their name in this instance.

The 4 examples are stunningly good (the examples have speakers with heavy accents, speaking in foreign language, speaking with dynamic background noise, etc.), this is far and away better than anything else I've seen. Will be super curious to see other folks trying it out and seeing if it's as robust as it seems, including when confronted with audio speech with natural tics and uhhh's and uhmm's and everything in-between.

I think it's fair to say that AI-transcription accuracy is now decidedly superior to the average human's, what the implications of this are I'm not sure.

anigbrowl

It was already better. I edit a podcast and have > a decade of pro audio editing experience in the film industry, and I was already using a commercial AI transcription service to render the content to text and sometimes edit it as such (outputting edited audio).

Existing (and affordable) offerings are so good that they can cope with shitty recordings off a phone speaker and maintain ~97% accuracy over hour-long conversations. I'm sure it's been an absolute godsend for law enforcement other people who need to gather poor-quality audio at scale, though much less great for the targets of repressive authority.

Having this fully open is a big deal though - now that level of transcription ability can be wrapped as an audio plugin and just used wherever. Given the parallel advances in resynthesis and understanding idiomatic speech, in a year or two I probably won't need to cut out all those uuh like um y'know by hand ever again, and every recording can be given an noise reduction bath and come out sounding like it was recorded in a room full of soft furniture.

thfuran

>~97% accuracy over hour-long conversations. I'm sure it's been an absolute godsend for law enforcement

97% accuracy means roughly three or four errors per minute of speech. That seems potentially extremely problematic for something like law enforcement use where decisions with significant impact on people's day and/or life might be made on the basis of "evidence".

anigbrowl

No it isn't. That just means 2-3% of your content needs to be double-checked by a person at the audio level, saving huge amounts of time - equally true of human transcription, in which individual words are often [UNINTELLIGEBLE].

Would you want to review this fully before going into court, absolutely - because you'd want to play the recording to a jury for emotional impact. Can you rely on it when you want to quickly read through hours of conversation and make decisions about whether to invest further resources (which might just mean another hour of listening back to the original audio)? Also absolutely. Bear in mind that a lot of these errors have little to no semantic impact, being on the same level as typos or misspellings in a written communication.

Bear in mind too that if law enforcement (honest or not) is so interested in you that they're willing to record your conversations, your day is already ruined, you just don't know it yet. The change here is one of scale rather than quality.

gs17

Yeah, I tried to use automated transcription for a research project and we had to do it all manually because the few errors (I would say it did pretty well given our recording quality) were often dropping words like "not", which changed the whole meaning of a sentence! It was a useful assistance during transcription, but I really hope they would verify it was correct before arresting anyone based on it.

hadlock

Microsoft announced their voice transcription technology a couple years ago and were also touting ~97-98% accuracy which was actually better than human transcription error rates. The errors are usually in part people garbling their own speech, or they move their head while talking and the microphone misses a syllable. Anything in that error bar would probably fall under "reasonable doubt"

selfmodruntime

I've worked with similar technology in the law enforcement space and the software is never used to make decisions. You can make out critical timestamps in conversations and a law enforcement officer will always manually confirm the software's assessments.

CTDOCodebases

I imagine a certain percentage of a given population is on a voice call at any one time.

1. Set up a computer with voice recognition software that flags certain patterns.

2. Connect computer to voice call communication network.

3. Configure computer to switch between calls every x number of seconds.

Think of it like a system to generate leads for law enforcement that can be integrated with other systems to produce the best quality leads.

Thorentis

Not really. Imagine that they do simple keyword matching on the text. Anything that's missed (part of the 97%) the criminals get away with. Anything that matches in the 3% is then checked by a human (by listening to the audio at that time stamp). So you only need to manually check the 3%, and even then only if something you're interested in is found.

golem14

One would think that the few crucial bits of information gleaned are listened to manually, and the machine translation is not the only thing the judge or a jury sees.

adamgordonbell

I've not found that to be the case.

For technical content, I use Rev.com and provide a glossary and real humans do the transcript. Other AI transcription services get lots wrong because the context often matters. Words like "TCP/IP" or "FAT disk format" or "Big Endian" I've never found AI so far to handle well.

I'm interested to test out whisper on this one.

https://corecursive.com/063-apple-2001/

deegles

There's already software that can imitate a person's voice, so we have all the pieces already to do speech-to-text, clean up with GPT-3, and back to text-to-speech in the original person's voice. Maybe with a style transfer to keep the person's inflections etc the same?

Karuma

I think something similar already exists. See this, for example: https://koe.ai/recast/

Although I don't know if they're using anything similar to what you suggest. Very cool idea, anyway!

biomcgary

Since you work on podcasts, do any open source transcription tools currently identity the speaker in the output? This would be particularly helpful for interviews.

nico

Not sure about open source, but in general, automated transcription systems need a separate track for each different speaker. So for example, for a phone call with one person on each end, you need two separate channels (recording systems usually split them left/right on one stereo file).

nonoesp

I'm not sure if you've tried Descript, but their ML-based "Studio Sound" filter makes bad audio sound like it was recorded and edited nicely.

solarmist

Any recommendations for particular services?

anigbrowl

I use a service called sonix.ai. It's paid but I think they have a free tier or trial period, and it's not very expensive. I'm excited about this new OpenAI thing because I'd rather do it on my own hardware than send it to the cloud, but this company has earned its commercial success.

solarmist

That is an exciting possibility. Being able to fix bad setups and missed takes automagically. It’s always been possible, just expensive and time consuming for moderate improvements.

bambax

The French version is a little contrived. The speaker is a native speaker, but the text is obviously the result of a translation from English to French, not idiomatic French.

I will try to put the code to the test, see how it goes.

pen2l

Interesting, I'm a non-native French speaker, the original French piece struck me as being entirely normal (but maybe it was just the perfect French accent that swayed me). Can you please point out what he said which wasn't idiomatic or naturally-worded French?

bambax

Little details. The second sentence is really bizarre:

> Nous établissons que l'utilisation de données d'un tel nombre et d'une telle diversité est la raison pour laquelle le système est à même de comprendre de nombreux accents...

It doesn't sound natural at all. An idiomatic formulation would be more along the lines of:

Le recours à un corpus [de données] si riche et varié est ce qui permet au système de comprendre de nombreux accents (With 'corpus', 'données' is implied.)

Of course this is just an example, and I'm sure other French speakers could come up with a different wording, but "données d'un tel nombre et d'une telle diversité" sounds really wrong.

This is also weird and convoluted:

> Nous distribuons en tant que logiciel libre le code source pour nos modèles et pour l'inférence, afin que ceux-ci puissent servir comme un point de départ pour construire des applications utiles

It should at least be "le code source DE nos modèles" and "servir DE point de départ", and "en tant que logiciel libre" should placed at the end of the proposition (after 'inférence').

Also, "construire" isn't used for code but for buildings, and "applications utiles" is unusual, because "utiles" (useful) is assumed. "...pour le développement de nouvelles applications" would sound more French.

_plg_

At the start, the "Nous établissons" part, for example. You wouldn't write that if you were starting scratch from French.

not_math

You can see from the transcript where the model made some errors, for example:

> We distribute as a free software the source code for our models and for the inference [...]

Should be

> We are open-sourcing models and inference code [...]

Another example

> We establish that the use of such a number of data is such a diversity and the reason why our system is able [...]

Should be

> We show that the use of such a large and diverse dataset leads to improved robustness [...]

octref

I'm interested in building something with this to aid my own French learning. Would love to read your findings if you end up posting it somewhere like twitter/blog!

bambax

Last try for tonight with Baudelaire.

Original:

    Trois mille six cents fois par heure, la Seconde
    Chuchote Souviens-toi !– Rapide, avec sa voix
    D'insecte, Maintenant dit Je suis Autrefois,
    Et j'ai pompé ta vie avec ma trompe immonde !

    Remember ! Souviens-toi ! prodigue ! Esto memor !
    (Mon gosier de métal parle toutes les langues )
    Les minutes, mortel folâtre, sont des gangues
    Qu'il ne faut pas lâcher sans en extraire l'or !

Transcription:

> Trois mille six cents fois par heure, la seconde chuchote « Souviens toi », rapide, avec sa voix d''insecte, maintenant dit « Je suis autrefois », et j''ai pompé ta vie avec ma trompe immonde. « Remember, souviens toi, prodigue, est au mémoire, mon gosier de métal, parle toutes les langues, les minutes, mortelles folâtres, sont des gangs qu''il ne faut pas lâcher sans en extraire l''or. »

Not bad! Far from perfect but it's a difficult text. Interesting that it works better with Baudelaire than Pascal.

bambax

Tried again with Blaise Pascal -- the famous fragment of a letter where he says he's sorry he didn't have enough time to make it shorter.

Original:

> Mes révérends pères, mes lettres n’avaient pas accoutumé de se suivre de si près, ni d’être si étendues. Le peu de temps que j’ai eu a été cause de l’un et de l’autre. Je n’ai fait celle-ci plus longue que parce que je n’ai pas eu le loisir de la faire plus courte. La raison qui m’a obligé de me hâter vous est mieux connue qu’à moi. Vos réponses vous réussissaient mal. Vous avez bien fait de changer de méthode ; mais je ne sais si vous avez bien choisi, et si le monde ne dira pas que vous avez eu peur des bénédictins.

Transcription:

> Mes rêves errent pères, mais l'detre navais pas accoutumé de se suivre de si près ni d'detre si étendu. Le peu de temps que j'sais eu a été cause de l'de l'de l'de autre. J'sais n'detre plus longue que parce que j'sais pas eu le loisir de la faire plus courte. La raison qui m'sa obligée de me hâter vous est mieux connue qu'moi. Vos réponses vous réussissaient mal. Vous avez bien fait de changer de méthode, mais je ne sais pas si vous avez bien choisi et si le monde ne dira pas que vous avez eu peur des bénédictes.

Here there are many more mistakes, so many that the beginning of the text is unintelligible. The language from the 17th century is probably too different. Still on the "medium" model, as the large one crashes the Colab (not sure how to select a beefier machine.)

Still fascinating and exciting though.

bambax

I'm playing with a Colab posted in this thread (https://news.ycombinator.com/item?id=32931349), and it's incredibly fun and accurate!

I tried the beginning of L'étranger (because you seem to be a fan of Camus ;-)

Here's the original:

> Aujourd’hui, maman est morte. Ou peut-être hier, je ne sais pas. J’ai reçu un télégramme de l’asile : « Mère décédée. Enterrement demain. Sentiments distingués. » Cela ne veut rien dire. C’était peut-être hier.

> L’asile de vieillards est à Marengo, à quatre-vingts kilomètres d’Alger. Je prendrai l’autobus à deux heures et j’arriverai dans l’après-midi. Ainsi, je pourrai veiller et je rentrerai demain soir. J’ai demandé deux jours de congé à mon patron et il ne pouvait pas me les refuser avec une excuse pareille. Mais il n’avait pas l’air content. Je lui ai même dit : « Ce n’est pas de ma faute. » Il n’a pas répondu. J’ai pensé alors que je n’aurais pas dû lui dire cela. En somme, je n’avais pas à m’excuser. C’était plutôt à lui de me présenter ses condoléances.

Here's the transcription:

> Aujourdhui, maman est morte, peut être hier, je ne sais pas. J''ai reçu un télégramme de l''asile. Mère décédée, enterrement demain, sentiment distingué. Cela ne veut rien dire. C''était peut être hier.

> L''asile de Vieillard est à Maringot, à 80 km d''Alger. Je prendrai l''autobus à deux heures et j''arriverai dans l''après midi. Ainsi, je pourrai veiller et je rentrerai demain soir. J''ai demandé deux jours de congé à mon patron et il ne pouvait pas me les refuser avec une excuse pareille. Mais il n''avait pas l''air content. Je lui ai même dit, ce n''est pas de ma faute. Il n''a pas répondu. J''ai alors pensé que je n''aurais pas dû lui dire cela. En somme, je n''avais pas à m''excuser. C''était plutôt à lui de me présenter ses condoléances.

Except for the weird double quotes instead of the single apostrophe ('), it's close to perfect, and it only uses the "medium" model.

This is extremely exciting and fun! Happy to try other texts if you have something specific in mind!

suyash

More of this is welcome, they should live up their name and original purpose and share other models (code, weights, dataset) in the open source community as well.

Workaccount2

Can't wait to see twelve new $49.99/mo speech parser services pop up in the next few weeks.

quickthrower2

Make hay before Google gives away free hay.

That said there is value in integration of this into other things.

quickthrower2

This has been running on my laptop all day for a 15 min mp3! Definitely not cheap to run then (wont imagine how much AWS compute cost is required).

knaik94

It seems far from good with mixed language content, especially with English and Japanese together. The timestamps are far from perfect. It's far from perfect. It's nowhere close to human for the more ambiguous translations that depend on context of word. It's far below what anyone that spoke either language would consider acceptable. Maybe it's unfair to use music, but music is the most realistic test of whether it's superior to the average human.

quickthrower2

Some music is hard for even people to make out the lyrics to.

undefined

[deleted]

darepublic

> Neat, https://github.com/openai/whisper - they have open-sourced it, even the model weights, so they are living up to their name in this instance.

Perhaps it will encourage people to add voice command to their apps, which can be sent to gpt3

pabs3

Is the training dataset and code open too?

jfoster

It seems like OpenAI are finally living up to their name for once with this release? Anything I'm missing?

From what I can gather:

1. Includes model weights. I can't find the URL, but they reference them enough and have a CLI tool, so I presume I just haven't found them yet.

2. Includes code: https://github.com/openai/whisper

3. Released under MIT License: https://github.com/openai/whisper/blob/main/LICENSE

thesausageking

It's one model and in a non-strategic area where there are existing open source projects (Kaldi, DeepSpeech, ...).

For a company that raised $1B, that's not exactly living up to their name and original mission.

blagie

Yes. The same is true of many products from many companies.

I feel bad about GPT-3 and DALL-E being released under the terms they were, but I don't feel bad about this. I'm not going to condemn OpenAI for the good things they did, but I will hold them accountable for bad things or good ones they didn't do.

I'd given up on OpenAI being open or ethical, but this is a start. It took them down from "evil super-villain" status to mere villain.

whimsicalism

> It's one model and in a non-strategic area where there are existing open source projects (Kaldi, DeepSpeech, ...).

I can already tell this is much better than any of the existing open source projects with the exception of the wav2* sequence of projects and potentially nvidia's nemo.

thesausageking

Kaldi is an open, pluggable framework and is a ton more flexible and powerful than this. It's used by hundreds of teams, including a number of consumer tech companies you've heard of. They're not going to move to this over it.

Especially because ASR is a living organism. You have to constantly update your language model as new people, ideas, and words move into the normal lexicon. As people start talking about "COVID", "metaverse", "king charles", or whatever new things that happen, these need to be added to your language model. You need these updates monthly at a minimum and OpenAI didn't release the raw data which means you can't retrain it even if you wanted to spend the time/resources to.

So, this is an interesting research project and helpful for small teams and side projects, but it's unlikely it makes any real impact on the industry.

solarmist

This kind of model is harder to abuse, so I guess it passed their internal checks much more easily.

I can understand not releasing GPT-3, even if I disagree with the decision.

ignoramous

> This kind of model is harder to abuse, so I guess it passed their internal checks much more easily.

The version I choose to believe: stability.ai ate DALL-E for lunch, and that woke them up.

solarmist

This is probably also true.

jfoster

True. The potential of GPT-3 to cause internet mayhem was/is significant. I would argue that the mere act of announcing it was still a catalyst for an eventual GPT-3-like model being released. In revealing it, they established a target for what open source models could aim to achieve, and simultaneously got bad actors thinking about ways to abuse it.

zarzavat

It was a credible argument when GPT-3 was released. But now there are open models that are as capable as GPT-3 and that mayhem has not materialized, with the possible exception of GPT-4chan. They could release it now under a non-commercial license, if they cared to.

dwohnitmok

> I can understand not releasing GPT-3, even if I disagree with the decision.

Why do you disagree?

solarmist

Two reasons. First, someone else will release something similar. Second, I didn’t see a related push from them to work with other in the industry to do something productive towards safety with the time they got by delaying availability of these kinds of models. So it felt disingenuous.

bigyikes

I don’t see how GPT-3 is any more dangerous than Stable Diffusion, Photoshop, that fake news website the crazy person you’re friends with on Facebook really likes, or any of the number of other tools and services that can be used to generate or spread fake information.

mmh0000

Because why should the wealthy and connected be the only ones -allowed- have access to such life improving technology?

StevenWaterman

(Model weights from https://github.com/openai/whisper/blob/main/whisper/__init__... )

"tiny.en": "https://openaipublic.azureedge.net/main/whisper/models/d3dd5..."

"tiny": "https://openaipublic.azureedge.net/main/whisper/models/65147..."

"base.en": "https://openaipublic.azureedge.net/main/whisper/models/25a85..."

"base": "https://openaipublic.azureedge.net/main/whisper/models/ed3a0..."

"small.en": "https://openaipublic.azureedge.net/main/whisper/models/f953a..."

"small": "https://openaipublic.azureedge.net/main/whisper/models/9ecf7..."

"medium.en": "https://openaipublic.azureedge.net/main/whisper/models/d7440..."

"medium": "https://openaipublic.azureedge.net/main/whisper/models/345ae..."

"large": "https://openaipublic.azureedge.net/main/whisper/models/e4b87..."

mmastrac

Large is 3GB to save everyone a click. Tiny is 72MB.

anigbrowl

That's unexpectedly lightweight - enough to run in some phones.

danso

This is an astonishing package. Every AI voice-to-text model I've tried on "The Wire's" famous "fuck" scene [0] usually fails, because the youtube clip's audio quality is bad and it's a scene with virtually no dialogue except breathing and "Fuck". But Whisper returned impressive results [1]

[0] https://www.youtube.com/watch?v=DS6pE88Xg3s

[1]

    $ yt-dlp --extract-audio --audio-format mp3 -o wire-fuck.mp3 https://www.youtube.com/watch?v=DS6pE88Xg3s

    $ whisper --language en wire-fuck.mp3
    [00:00.000 --> 00:02.000]  Oh
    [00:13.260 --> 00:15.260]  Fuck
    [00:15.260 --> 00:31.260]  Motherfucker
    [00:50.700 --> 00:52.700]  Fuck
    [00:52.700 --> 00:58.700]  Oh
    [00:58.700 --> 01:10.700]  Fuck
    [01:28.700 --> 01:55.900]  Fuck
    [02:02.340 --> 02:03.700]  Motherfuck.
    [02:10.220 --> 02:11.220]  Oh, fuck.
    [02:11.780 --> 02:12.780]  Oh, fuck.
    [02:25.900 --> 02:27.900]  Fuck, fuck, fuck, fuck, fuck, fuck.
    [02:27.900 --> 02:28.900]  Motherfucker.
    [02:32.900 --> 02:33.900]  Oh, fuck.
    [02:34.900 --> 02:35.900]  Fuck.
    [02:35.900 --> 02:36.900]  Oh, fuck.
    [02:36.900 --> 02:37.900]  Oh, fuck.
    [02:37.900 --> 02:38.900]  Oh, fuck.
    [02:48.900 --> 02:49.900]  Motherfucker.
    [02:53.900 --> 02:54.900]  Fucking A.
    [02:54.900 --> 02:56.900]  Mm hmm.
    [02:56.900 --> 03:12.900]  Fuck.
    [03:26.900 --> 03:28.900]  Motherfucker.
    [03:28.900 --> 03:32.900]  Fuck me.
    [03:58.900 --> 04:01.900]  Oh.
    [04:28.900 --> 04:34.900]  Fuck.

marcelfahle

As interesting as it is funny. Great benchmark! Here's the rev.ai output for comparison:

  Speaker 0    00:00:12    Oh, fuck motherfucker. Okay. Fuck, fuck, fuck, fuck, fuck, fuck, fuck, fuck. 
 My little fuck.  
  Speaker 1    00:02:10    Oh, fuck. Oh, fuck,  
  Speaker 0    00:02:25    Fuck, fuck, fuck, fuck, fuck, fuck, fuck, fuck my motherfucker.  
  Speaker 1    00:02:53    Fucking a.  
  Speaker 0    00:02:54    Mm-hmm. <affirmative> motherfucker. Fuck me. Um,

AndrewKemendo

I've been on HN since 2012 and this might be one of the best comments I've ever read

owenpalmer

nsfw

TaylorAlexander

Hey this looks great! I like to record audio notes while driving in my car after work, to kind of decompress my thoughts from the day. But I never go back and listen as they can be long and meandering. Sometimes in the audio log I will sum up my thoughts, but this might be 20 minutes in and hard to find. I really wish I had transcriptions so I could easily scan the full contents. I have tried Mozilla Deepspeech (I don't want a cloud solution) and I was surprised to find that I could not get Deepspeech to reliably transcribe them. There is a bit of road noise, though I think for a human listener they are easy to understand. It looks like this one might actually do the trick!

EDIT: Tried it and it worked great! It is very easy to use. I just did the pip install line in the readme and was ready to go. You literally just run the one pip install line, and then you run the program in the format "whisper my_audio.wav" and it goes. Really nice job OpenAI!

Snitch-Thursday

Google's recorder app for android will let you record audio files and make some transcriptions, right on the device.

capableweb

Is that application actually doing on-device transcription? Under "Data safety" on the Google Play page it says "This app may share these data types with third parties: Audio" which doesn't exactly instill confidence that my audio will 100% always stay on my device. It also says "Data is encrypted in transit" but if data stays on the device, why it has to be "encrypted in transit"? There should be no transit at all.

bruckie

Yes, it works completely offline, including transcription and recognition of music. There's an optional cloud sync feature, which I assume is the reason for the notice on Google Play.

(Work for Google, don't speak for them.)

Tenoke

I just tested it and it was pretty mediocre at least with my accent. I can definitely benefit from a decent app for quick note recording with a button press->transcribe->upload to gdrive/good UI app for later grepping.

TaylorAlexander

Was this with the default base model, or the medium or large model? This can be specified with the —model flag.

olao99

Google's recorder app is NOT available for most phones. Only Pixels and a couple of other selected handsets

petercooper

I'll probably explore using this, but I've used an app called Just Press Record to do what you say. Runs on Apple Watch too, so you can tap a complication at any time in the day, speak, and you get a transcript on your phone, etc.

zhynn

I do this too! I have been doing it for about a year now, and haven't ever run into someone else that does this kind of audio-journaling. Would you be up for comparing notes sometime about how it is working out for you? I am finding that it is extremely effective form of self-care, but with lots of personal caveats. I would be so interested to hear your experience.

TaylorAlexander

Oh cool! Yeah I have stopped doing it lately as I was not really using them (I would like to use them for making rough notes for future youtube video scripts), though in general it does seem like good self care too even if I don't review them. That said I just tried the base model on one of my voice logs and it was pretty good! Trying the medium model now and it seems basically perfect. So I will have to start doing these logs more!

Anyway I am pretty terrible with email but short exchanges can work for me, or maybe we can connect over signal. Send me a message to my email in my profile and I would be happy to sync up!

tekacs

I do this too, and I’ve built some software for it just for myself.

I’d love to chat and hear about how you use this! My email is in my profile, or I’m @tekacs on Twitter (and everywhere). :)

blueberrychpstx

Count me in!! Working on tools actually to turn these transcriptions into something more social

gok

Comparing this model's word error rates to the state of the art [1] on a few common test sets:

                           Whisper    SoTA
  LibriSpeech test-clean      2.7%     1.8%
  LibriSpeech test-other      5.6%     2.9%
  Switchboard                13.1%     4.9%
  CallHome                   15.8%     9.5%

The authors do explicitly state that they're trying to do a lot of fancy new stuff here, like be multilingual, rather than pursuing just accuracy.

[1] https://github.com/syhw/wer_are_we

lunixbochs

I suspect Whisper is more robust than other "SOTA" models, but this release is likely leaving a fair bit of accuracy on the table considering the amount of resources OpenAI is capable of throwing at training it.

Comparing the readily available test sets from the paper to some of my personal robust models (for the Talon models, this is greedy decoding, no language model):

                       Talon  Talon  Talon  Whisper  wav2vec 2.0
                       28M    300M   1B     Large    960h
    librispeech clean   3.21   2.52   2.40   2.7      2.7
    librispeech other   8.21   6.56   5.63   5.6      6.2
    common voice       13.88  11.65   8.86   9.5     29.9
    tedlium             7.51   6.55   5.47   4.0     10.5

I have a battery of more difficult tests on hand (including adversarial tests, and diverse accent-specific metrics). I'll look at running these tests on each of the Whisper model sizes and following up with a larger comparison.

ma2rten

I'm looking forward to your comparison. It's really hard to make sense of how good this model actually is without being an expert in the area.

lunixbochs

Just posted results here: https://twitter.com/lunixbochs/status/1574848899897884672

allanrbo

Talon was the first thing that came to my mind when I saw this news. Would be nice if it could benefit from Whisper. (Big fan of your work on Talon!)

nshm

It is interesting how they compare with wav2vec2 instead of nemo conformer (which is more accurate) in Table 2.

sjnair96

Indeed interesting.

On that note, a core Nvidia NeMo developer I follow posted this: https://twitter.com/HaseoX94/status/1572748653189791745

He calls it a "T5 for ASR" paper :) More insights in there, have a look! Curious to see what your blog would put up as well!

StevenWaterman

One of the things they point out is that the SoTA on e.g. LibriSpeech is only good at LibriSpeech, and doesn't generalise as well.

> Because Whisper was trained on a large and diverse dataset and was not fine-tuned to any specific one, it does not beat models that specialize in LibriSpeech performance, a famously competitive benchmark in speech recognition. However, when we measure Whisper’s zero-shot performance across many diverse datasets we find it is much more robust and makes 50% fewer errors than those models.

lunixbochs

My own experience agrees: the generally available "SOTA" models are not especially robust, and can be _extremely_ bad (>50% absolute error rate) at some tasks. I'll post some preliminary numbers in a sibling comment and look into running my full set of tests on Whisper.

It looks like Whisper is probably leaving a lot of accuracy on the table, but initially it does seem to be a lot more robust than general "SOTA" models.

For a quick comparison, Silero's accuracy charts are kind of nice because they post results for a large variety of datasets. Scroll down to the EN V6 xlarge EE model (not the xlarge CE) [1]

[1] https://github.com/snakers4/silero-models/wiki/Quality-Bench...

petercooper

Just tested this on some developer podcasts which usually fail hard given they're full of technical jargon, brand names, etc. Whisper is a revolution! It's picking up terms like Heroku, DigitalOcean, GitHub, ECS, AWS, etc. and capitalizing properly - something nothing else did unless you provided a whole pile of guiding vocabulary.

ma2rten

Did these podcasts have transcripts? You might be inadvertently evaluating it on data that it was trained on, which is basically cheating. Even if not, it might be trained on similar podcasts. Judging how good these kinds of models are is really hard.

petercooper

No transcripts, no. And recent episodes, within the past couple of weeks, so probably not part of the training either.

WiSaGaN

True. The test should only be done on the material released after the model.

andy_xor_andrew

Hold on, it does not only speech recognition, but also language translation, in the same model?

What an interesting approach. What benefits does this have over having two dedicated models, one for speech-to-text, and another for translation?

It just seems so odd, given the problems of speech-to-text and Spanish-to-English seems so different from one another (in terms of the problem domain). Seems so unusual to have both handled by one model!

Does knowledge of speech-to-text carry over into knowledge of translation? Does knowledge of translation carry over into knowledge of speech-to-text? So weird.

TaylorAlexander

It seems these days that language-oriented models are commonly becoming multilingual by default. There are a lot of common threads when understanding sentence construction between different languages. French and English have different rules but they will still have things like nouns, adjectives, subjects, prepositions, etc. It seems that by training models on many languages you get both a more robust understanding of language, and it saves you the trouble of having to make many more localized models for every language. I also believe that the other languages help the models construct sentences in languages which have very small training sets. If it has a few examples in a rare language as well as good translations to a better-known language, then it can provide good support for the rare language.

We also see in image generation models that multi-modal networks are more powerful than single purpose networks. As we move towards more advanced AI systems I suspect we will see more and more generalizable networks with distinct advantages over separate networks that get plugged together.

magicalhippo

Would a multilingual modal perhaps also be better at understanding non-natives speech?

TaylorAlexander

Good question but I don’t know the answer.

newhaus1994

My understanding is that multi-modal models are the primary focus of OpenAI right now, due to their stated goal of achieving AGI. This product is probably better thought of as an offshoot of their work to create a fully generalizable model, rather than a specific attempt to provide translation/transcription services.

ByThyGrace

Judging from the chart in their github README, Whisper performs much better in parsing Spanish audio than any other language and that in particular blows my mind. I would have expected English to be at the top of any such model, it being such an IT lingua franca.

Now I wonder if it works equally well with Spanish from Spain (and its different regions) and Spanish from the New World (and in its myriads of different flavours).

beanlog

It sounds useful to me because you can use tone information to help with the translation, which text-to-text translation can't do. But I'm not sure if that's how this model actually works.

thuttinger

I tried running it in realtime with live audio input (kind of).

If you want to give it a shot, you can find the python script in this repo: https://github.com/tobiashuttinger/openai-whisper-realtime

A bit more context on how it works: The systems default audio input is captured with python, split into small chunks and is then fed to OpenAI's original transcription function. It tries (currently rather poorly) to detect word breaks and doesn't split the audio buffer in those cases. With how the model is designed, it doesn't make the most sense to do this, but i found it would be worth trying. It works acceptably well.

kkielhofner

Haven’t tried it yet but love the concept!

Have you thought of using VAD (voice activity detection) for breaks? Back in my day (a long time ago) the webrtc VAD stuff was considered decent:

https://github.com/wiseman/py-webrtcvad

Model isn’t optimized for this use but I like where you’re headed!

thuttinger

Interesting. I'll take a look at this, thanks!

Curiositry

Perhaps this could be adapted?

https://github.com/mozilla/DeepSpeech-examples/blob/master/m...

catfan

secret-noun

impressive

adeptima

Japanese results looks pretty impressive!

Took マッコウクジラ14頭が海岸に打ち上げられるオーストラリア(2022年9月21日) https://www.youtube.com/watch?v=bZkNIzeRBk4

Extracted audio with youtube-dl -f bestaudio https://www.youtube.com/watch\?v\=bZkNIzeRBk4

Converted into [00:00.000 --> 00:13.000] オーストラリア南部の島で、真っ向くじら14棟が海岸に打ち上げられて死んでいるのが見つかり、専門家が調査のため原地入りしました。 [00:13.000 --> 00:25.000] 原地メディアによりますと、オーストラリア南部のキング棟で、19日、少なくとも14棟の真っ向くじらが海岸に打ち上げられて死んでいるのが見つかりました。 [00:25.000 --> 00:31.000] ほとんどが若いオーストを見られ、専門家が現場に重むき調査に当たっています。 [00:31.000 --> 00:41.000] くじらの死害は大きく運んだり埋めたりすることが難しいため、自然に分解されるのを待つ方針が検討されています。 [00:41.000 --> 00:52.000] また、死害を狙い、サメが海に集まる可能性があるとして、原地東局はサーファーなどに周囲に近づかないように呼びかけています。 [00:52.000 --> 01:02.000] 一方、21日にはタスマニア棟でおよそ230棟のくじらが浜辺に打ち上げられた状態で見つかりました。 [01:02.000 --> 01:07.000] およそ半数がまだ生きている模様で急助活動が進められています。 [01:07.000 --> 01:23.000] 見つかったのは、ゴンドーくじらの仲間と見られています。

gzer0

Shocked at how good the results are, and how easy of an installation it is.

Here are the exact steps to follow to get it running on Ubuntu 22.04 via WSL and yt-dlp:

  1. pip install git+https://github.com/openai/whisper.git

  2. yt-dlp -f 'ba' -x --audio-format mp3 https://www.youtube.com/watch/?v\=bZkNIzeRBk4

  3. renamed the file to test.mp3

  4. whisper test.mp3 --language Japanese --task translate --model large

Note: the large model will download a ~3Gb file

NaturalPhallacy

I did something similar (my ytdl is ytdlp too). You don't even have to grab just the audio, it'll take a webm: https://i.imgur.com/03UFGc8.gif

Amazing work.

adeptima

cause ffmpeg inside

https://github.com/openai/whisper/blob/main/requirements.txt

should process most formats

adeptima

"--model large" option produces much better results at higher resources consuming costs

knaik94

Did you try translating them to english? I want to see if you get a similar error as me with a random phrase "Translated by Releska" showing up.

lynguist

It's called hallucination. As the model is trained on unsupervised data, such errors do seldom happen. The model picks up that such phrases occur in translations and inserts them even if they do not appear in the source. This is described in the paper.

knaik94

I came across it during a silent/instrumental portion in the song I was testing. I asked only because I am curious how frequently the error might show up, I don't expect it to be very common. It's looking at phrase level instead of word level timestamps which is going to make it hard to tokenize music. I asked simply because the parent comment also tested on Japanese.

dom96

This really makes me want to build a Amazon Echo/Google Nest/etc replacement that's open hardware, open source and most importantly recognises voice completely offline. I find that I don't use these smart devices for much more than setting timers anyway so this seems like an easy project.

I just wonder what system requirements Whisper has and whether there are open source voice recognition models that are specifically built for embedded devices.

MacsHeadroom

I really want all this too. The smallest model is ~80mb and the largest is 3gb. Not sure about system requirements yet; but models that small suggest this may be doable locally on a single board computer.

Edit: According to this comment[0] the base model runs in real time on an M1 CPU. The tiny model apparently decodes an audio file twice as fast. These are promising results.

[0] https://news.ycombinator.com/item?id=32927360#32929739

lunixbochs

For an offline (non-streaming) model, 1x realtime is actually kind of bad, because you need to wait for the audio to be available before you can start processing it. So if you wait 10 seconds for someone to finish speaking, you won't have the result until 10 seconds after that.

You could use really small chunk sizes and process them in a streaming fashion, but that would impact accuracy, as you're significantly limiting available context.

dom96

I'd be interested to see how well it performs on something like an RPi. M1 is pretty beefy.

olao99

To be more precise the original comment said "M1 Max" which in itself is significantly beefier a bare "M1"

suyash

This is only one side of the coin, you still need really good models for Speech Synthesis and then be able to have it all working in almost real time, ideally locally on device.

ricopags

As far as TTS goes, Mycroft.ai[0] has released a decent offline one.

[0]https://mycroft.ai/

arbol

I'm pretty sure mycroft sends your speech snippets to Google for processing so it's not exactly offline.

https://mycroft-ai.gitbook.io/docs/using-mycroft-ai/customiz...

I'm currently trying to setup a deepspeech server on my raspberry pi to see if it works ok for commanding spotify.

Edit: just realised you said `TTS` not `STT`

smcameron

pico2wave with "-l=en-GB" option to get the British lady voice is pretty decent (way better than the other voices it does for some reason).

solarkraft

Are you thinking about reimplementing Mycroft?

The Mycroft has done a lot of cool and important work in the field to ship an actual personal assistant product (stuff like wake word detection).

dom96

hah, of course someone had the idea already and executed on it. But yeah, basically that but without the screen (probably would go a long way to decrease the cost, $299 is pretty steep for such a device)

sheepybloke

One thing they don't touch much on is the STT, as they use models from third parties. You could definitely do something that utilizes this model and then feeds the tokens to some of their parsing code. I've been working on something similar to this, but burned out around adding the STT portion [0].

[0]: https://github.com/Sheepybloke2-0/trashbot - It was called trashbot because the final implementation was going to look like oscar the grouch in a trashcan displaying the reminders.

MayeulC

Well, you can always install Mycroft on a Pi, or on your computer.

Almond is also interesting as a voice assistant, though I think it doesn't perform speech recognition itself.

mwlp

Super impressive. I tested it on a Japanese streamer whose enunciation isn't exactly perfect and it did a decent job: https://www.youtube.com/watch?v=ROiOU1scaNA

  [00:00.000 --> 00:06.500]  Since the last one started, the number of times I've eaten has decreased.
  [00:06.500 --> 00:11.000]  If I get too carried away with the last one, I'll get hungry and do it.
  [00:11.000 --> 00:14.500]  I don't have time to eat.
  [00:15.500 --> 00:18.000]  I'm going to eat now.
  [00:20.000 --> 00:23.000]  It's going to take about 10 minutes from here.
  [00:23.000 --> 00:31.000]  It's been a while since I've had my last meal.
  [00:31.000 --> 00:36.000]  I feel like I'm losing my女子力.
  [00:36.000 --> 00:39.000]  I have to go back to my original self.
  [00:39.000 --> 00:44.000]  I have to get ready and go to bed.
  [00:44.000 --> 00:46.000]  It's not good.
  [00:46.000 --> 00:51.000]  I've been drinking a lot lately, so I'm going home.
  [00:51.000 --> 00:53.000]  I have to get my nails done this fall.
  [00:53.000 --> 00:54.000]  Halloween nails.
  [00:54.000 --> 00:57.000]  Halloween, Halloween, Halloween.
  [00:57.000 --> 00:59.000]  I'm going to the beauty salon today.
  [00:59.000 --> 01:02.000]  I'm going to get my nails done the day after tomorrow.
  [01:02.000 --> 01:10.000]  I used to look at a lot of clothes, but I stopped looking at them.
  [01:10.000 --> 01:12.000]  I'm going crazy.
  [01:12.000 --> 01:22.000]  My stomach's stopped in the middle of summer.

magicalhippo

It's struggling with Norwegian. Which I guess isn't shocking. The large model performs a fair bit better than the small, though neither is "good".

Though I assume the amount of Norwegian it has been exposed to is fairly limited, so in that light I'm actually impressed as well.

I tried it on a news segment from the radio[1], this is the large model output:

    [00:14.000 --> 00:17.200]  En skamløs krenking av FN pakten.
    [00:17.200 --> 00:24.000]  USAs president og verdensledere svarer på den russiske presidentens atomtrusler og krigsmobilisering.
    [00:25.500 --> 00:29.400]  Arbeidsklær som er ment til å være til begge kjønn, har det med å være tilpasset.
    [00:29.400 --> 00:33.400]  Men hvordan ville det gått, om det var motsatt?
    [00:34.100 --> 00:38.900]  Dyrevernsorganisasjon vil ha digital merking av regnstyr,
    [00:38.900 --> 00:44.900]  men næringen selv insisterer på den gamle tradisjonsrike måten med rissing av kniv.
    [00:45.600 --> 00:51.400]  Mange strømselskaper er positive til å tilby kundene fastpris på strøm, og det årevis.
    [00:51.400 --> 00:59.900]  Da risikerer de å måtte betale mye i nettopp åretsvis, sier aktører som aldri tilbyr fastpris.
    [00:59.900 --> 01:21.900]  Dette er onsdagens Dagsnytten. Jeg heter Espen Ås.

For reference, here's what he actually said, from the source[1] itself:

    * En skamløs krenking av FN-pakten. USAs president og verdensledere svarer på den russiske presidentens atomtrusler og krigsmobilisering.
    * Arbeidsklær som er ment å være til begge kjønn, er som regel tilpasset ... menn. Hvordan hadde det gått om det var motsatt?
    * Dyrevernsoganisasjon vil ha digital merking av reinsdyr, men næringen selv insisterer på den gamle tradisjonsrike måten med rissing av kniv.
    * Mange strømselskaper er positive til å tilby kundene fastpris på strøm - og det i årevis.
    - Da risikerer de å måtte betale mye i nettopp; årevis, sier aktør som aldri tilbyr fastpris
    Dette er onsdagens Dagsnytt 18 - jeg heter Espen Aas.

The translation didn't fare that well though:

    [00:14.000 --> 00:17.000]  A shameless violation of the UN treaty.
    [00:17.000 --> 00:24.000]  The US president and world leaders respond to the Russian president's nuclear threats and war mobilization.
    [00:24.000 --> 00:33.000]  Work clothes that are meant to be for both genders have to be suitable, but how would it be if it was the other way around?
    [00:34.000 --> 00:44.000]  The animal welfare organization will have a digital marking of reindeer, but the industry itself insists on the old traditional way of tearing a knife.
    [00:45.000 --> 00:51.000]  Many electricity companies are positive in offering customers fixed electricity prices, and that is annual.
    [00:51.000 --> 00:58.000]  Then they risk having to pay a lot in just a year, says an actor who has never offered fixed prices.
    [00:58.000 --> 01:20.000]  This is Wednesday's Dagsnytt 18. My name is Espen Ås.

For reference, here's Google Translate's attempt, which is pretty good:

    * A shameless violation of the UN Charter. The US president and world leaders respond to the Russian president's nuclear threats and war mobilization.
    * Work clothes intended for both sexes are usually adapted to ... men. How would it have gone if it had been the other way around?
    * Animal welfare organizations want digital marking of reindeer, but the industry itself insists on the old, traditional way of marking with a knife.
    * Many electricity companies are positive about offering customers a fixed price for electricity - and for years.
    - Then they risk having to pay a lot in precisely; for years, says a player who never offers a fixed price
    This is Wednesday's Dagsnytt 18 - my name is Espen Aas.

[1]: https://radio.nrk.no/podkast/dagsnytt_atten/l_5ce3e323-97a3-... (not sure if it's available outside of Norway)

magicalhippo

Re-reading the transcription, I guess I was a bit harsh by saying it's not "good". It gets most of it right, but it keeps messing up some key words. Like "regnstyr" (not a word) rather than "reinsdyr" (reindeer), or "Dagsnytten" rather than "Dagsnytt 18".

It also didn't handle the hanging "... menn", instead thinking it was the start of the following sentence. Almost everyone would understand it was the end of the sentence based on the context.

The double-A vs Å is not an issue as it's the same letter, double-A is the older form.

The small model was considerably worse than the large one though.

karencarits

I am impressed; some of the words are not that common, such as atomtrusler, krigsmobilisering, strømselskaper and dyrevernsorganisasjon, yet it got them correctly

perlgeek

Everything (and everyone, including myself :D ) seem to struggle with Norwegian, it seems the corpus size is simply too small. And/or maybe the market.

Deepl didn't do any Norwegian last I looked, even though it does most other Germanic languages (including Danish and Swedish).

Duolingo doesn't have a Norwegian class for Germans either, though they do have one with English as the source language.

olao99

How are you getting the transcription of the NRK episode? I am learning Norwegian and often struggle to find reliable transcriptions for audio where the text exactly matches the audio (often subtitles are heavily edited compared to what's actually being said)

magicalhippo

The stuff I quoted was listed as an abstract of sorts for the episode. I know NRK is very good at providing subtitles for their TV productions, but as you say they're abbreviated.

I'm guessing maybe audio books along with the actual books would be the best source for such? I mean there's Mozilla Voice, but it's quite limited in the Norwegian department and perhaps not quite as interesting as an audio book would be.

alach11

How long until this gets implemented in Twitch? Real-time subtitles for any stream in the language of your choice?! That would be huge.

adeptima

translation is not the strongest part. transcription looks very good.

shpx

We shouldn't call this open source. The model definition + the data is the source code. The model weights are a compilation artifact.

> The source code must be the preferred form in which a programmer would modify the program. [...] Intermediate forms such as the output of a preprocessor or translator are not allowed.

> https://opensource.org/osd

If I asked a programmer from OpenAI to modify the model to better support Japanese speakers from Hokkaido, their "preferred form" of the model's source code would include the 680,000 hours of audio used to train the model.

Yes that means that there are almost no open source models and yes it's awesome that they released this and made the weights available. Just don't call it open source.

pabs3

The Debian deep learning team's machine learning policy would call this a "toxic candy" model:

https://salsa.debian.org/deeplearning-team/ml-policy

BTW, wouldn't you take the existing model and do additional Hokkaido Japanese speaker training on top of it, rather than retraining the model from scratch?

rvz

Yes. It just like calling the release of compiled closed binary blobs as 'open source' even when the source of reproducing the compiled output is unavailable.

> If I asked a programmer from OpenAI to modify the model to better support Japanese speakers from Hokkaido, their "preferred form" of the model's source code would include the 680,000 hours of audio used to train the model.

Precisely. These 'users' lifting the model can't do it themselves. You will still be contacting OpenAI for support or to add support for another language and they will be the ones able to modify the model.

> Just don't call it open source.

That is true, it is still closed source and already we are seeing the hype squad already apologising to OpenAI as they 'open sourced' a closed model that you can't modify yourself.

OpenAI is still business as usual and nothing has changed.

MacsHeadroom

>You will still be contacting OpenAI for support or to add support for another language and they will be the ones able to modify the model.

This isn't quite correct. The model weights are all you need to fine tune the data on your own with your own audio.

Without the original training set this still isn't open source. But you aren't powerless to modify the model without the original training set.

This isn't really true.

You can do a lot with weights and no training data - for example you can pull the end layer off it and use it as a feature extractor.

And to modify it for Japanese speakers you'd fine train the existing model on additional data. If you wanted to modify the model you can (sometimes, depending on what you want to do) modify an existing architecture by removing layers, adding replacements and fine tuning.

I don't quite know what the right analogy of trained data is. In many ways it is more valuable than the training data because the compute needed to generate it is significant. In other ways it is nice to be able to inspect the data.

> The source code must be the preferred form in which a programmer would modify the program.

As a machine learning programmer I'd much prefer the weights than the raw data. It's no realistic for me to use that training data in any way with any compute I have access to.

Daily Digest email

Get the top HN stories in your inbox every day.