VibeVoice: A Frontier Open-Source Text-to-Speech Model

microsoft.github.io

Daily Digest email

Get the top HN stories in your inbox every day.

simiones

I read the comments praising these voices as very life like, and went to the page primed to hear very convincing voices. That is not at all what I heard though.

The voices are decent, but the intonation is off on almost every phrase, and there is a very clear robotic-sounding modulation. It's generally very impressive compared to many text-to-speech solutions from a few years ago, but for today, I find it very uninspiring. The AI generated voice you hear all over YouTube shorts is at least as good as most of the samples on this page.

The only part that seemed impressive to me was the English + (Mandarin?) Chinese sample, that one seemed to switch very seamlessly between the two. But this may well be simply because (1) I'm not familiar with any Chinese language, so I couldn't really judge the pronunciation of that, and (2) the different character systems make it extremely clear that the model needs to switch between different languages. Peut-être que cela n'aurait pas été si simple if it had been switching between two languages using the same writing system - I'm particularly curious how it would have read "simple" in the phrase above (I think it should be read with the French pronunication, for example).

And, of course, the singing part is painfully bad, I am very curious why they even included it.

Uehreka

Their comments about the singing and background music are odd. It’s been a while since I’ve done academic research, but something about those comments gave me a strong “we couldn’t figure out how to make background music go away in time for our paper submission, so we’re calling it a feature” vibe as opposed to a “we genuinely like this and think its a differentiator” vibe.

phildougherty

Totally felt the same way! Singing happens spontaneously? What?

lyu07282

They mention that in the FAQ here: https://github.com/microsoft/VibeVoice/tree/main?tab=readme-...

> In fact, we intentionally decided not to denoise our training data because we think it's an interesting feature for BGM to show up at just the right moment. You can think of it as a little easter egg we left for you.

It's not a bug, it's a feature! Okaaaaay

jstummbillig

Is there any better model you can point at? I would be interested in having a listen.

There are people – and it does not matter what it's about – that will overstate the progress made (and others will understate it, case in point). Neither should put a damper on progress. This is the best I personally have heard so far, but I certainly might have missed something.

Uehreka

It’s tough to name the best local TTS since they all seem to trade off on quality and features and none of them are as good as ElevenLabs’ closed-source offering.

However Kokoro-82M is an absolute triumph in the small model space. It curbstomps models 10-20x its size in terms of quality while also being runnable on like, a Raspberry Pi. It’s the kind of thing I’m surprised even exists. Its downside is that it isn’t super expressive, but the af_heart voice is extremely clean, and Kokoro is way more reliable than other TTS models: It doesn’t have the common failure mode where you occasionally have a couple extra syllables thrown in because you picked a bad seed.

If you want something that can do convincing voice acting, either pay for ElevenLabs or keep waiting. If you’re trying to build a local AI assistant, Kokoro is perfect, just use that and check the space again in like 6 months to see if something’s beaten it. https://huggingface.co/hexgrad/Kokoro-82M

refulgentis

There's a certain know-nothing feeling I get that makes me worried if we start at the link (which has data showing it > ElevenLabs quality), jump to eh it's actually worse than anything I've heard then last 2 years, and end up at "none are as good as ElevenLabs" - the recommendation and commentary on it, of course, has nothing to do with my feeling, cheers

sandreas

What is your opinion about F5-TTS or Fish-TTS?

lynx97

I cobbled together llm-tts to run as many local (and remote) TTs models s I could find and get working.

https://github.com/mlang/llm-tts

Strictly speaking, even music generation fits the usage pattern: text in, audio out.

llm-tts is far from complete, but it makes it relatively "easy" to try a few models in an uniform way.

nipponese

Not OS or local, but just try ChatGPT Voice Conversation mode. To my ears, it's a generation ahead of these VibeVoice samples.

riquito

Probably not even the best ones, but among some recent models I find Dia and Orpheus more natural

- http://dia-tts.com/

- https://github.com/canopyai/Orpheus-TTS

popalchemist

Higgs Audio v2 is currently SOTA in OSS TSS.

satellite2

Elevenlabs v3 (not local)

whimsicalism

i think orpheus and sesame sound better

rcarmo

One of the things this model is actually quite good at is voice cloning. Drop a recorded sample of your voice into the voices folder, and it just works.

watsonmusic

bonus usage

IshKebab

I agree. For some reason the female voices are waaay more convincing than the male ones too, which sound barely better than speech synthesis from a decade ago.

selkin

Results correlate to investment, and there’s more in synthesizing female coded voices. As for the why female coded voices gets more investments, we all know, only difference is in attitude towards that (the correct answer, of course, is “it sucks”)

recursive

We all know? Female voices have better intelligibility? That's my guess anyway.

odie5533

It's good but not the best free model. I find Chatterbox to be more realistic with no robot-sounding and better (though not perfect) intonation.

lastdong

Chatterbox sounds great, their demo page is a good introduction: https://resemble-ai.github.io/chatterbox_demopage/

eaglehead

I agree. We switched from elevenlabs to chatterbox (hosted on Resemble.ai) and it is much much cheaper and better.

iansinnott

The English/Mandarin section was VERY impressive. The accents of both the woman speaking English and the man speaking Chinese were spot on. Both sound very convincingly like they are speaking a second language, which anyone here can hear from the Chinese woman speaking English voice. I'd like to add that the foreigner speaking Chinese was also spot on.

echelon

This is close to SOTA emotional performance, at least the female voices.

I trust the human scores in the paper. At least my ear aligns with that figure.

With stuff like this coming out in the open, I wonder if ElevenLabs will maintain its huge ARR lead in the field. I really don't see how they can continue to maintain a lead when their offering is getting trounced by open models.

kamranjon

Hmmmm… what is your opinion on the examples showcased here vs the ones on the Dia demo page?

https://yummy-fir-7a4.notion.site/dia

I am not sure why but I find the pacing of the parakeet based models (like Dia) to be much more realistic.

watsonmusic

11labs is facing a real competitor

skripp

The male Chinese speakers had THICK American accents. Nothing really wrong with the language, but think the stereotype German speaking English. That was kind of strange to me.

ascorbic

I think it's because it was using the American voice for it. Conversely the female voice in the Mandarin conversation spoke English with a Chinese accent.

giancarlostoro

I really hope someone within Microsoft is naming their open source coding agent Microsoft VibeCode. Let this be a thing. Its either that or "Lo" then you can have Lo work with Phi, so you can Vibe code with Lo Phi.

https://techcommunity.microsoft.com/blog/azure-ai-foundry-bl...

simiones

Knowing the history of Microsoft marketing, it will either be called something like "Microsoft Copilot Code Generator for VSCode" or something like "Zunega"...

giancarlostoro

Well don't forget "Microsoft SQL" ;) They'll name something as though they invented it and now have the worse possible way to google it.

kelvinjps10

For me it doesn't sounds like they invented it but that it's Microsoft version of SQL idk but I hate Microsoft version of anything

loloquwowndueo

“Microsoft Word” haha reminds me of the old joke : “Microsoft Works” is an oxymoron.

parineum

Just like MariaDB sounds as though they invented databases, right?

cush

Later renamed to Microsoft Zune, a handheld AI companion that lives in your pocket

polytely

GitHub Dotnet Copilot Code Generator for VSC (new)

datadrivenangel

(preview)

yellowapple

Microsoft Copilot .NET for Workgroups

airstrike

Now I need a new project just so I can call it Zunega... lmao

undefined

[deleted]

watsonmusic

genius

malnourish

This is clearly high quality but there's something about the voices, the male voices in particular, which immediately register as computer generated. My audio vocabulary is not rich enough to articulate what it is.

heeton

I'm no audio engineer either, but those computer voice sound "saw-tooth"y to me.

From what I understand, it's more basic models/techniques that are undersampling, so there is a series of audio pulses which give it that buzzy quality. Better models are produced smoother output.

https://www.perfectcircuit.com/signal/difference-between-wav...

codebastard

I would describe it as blockly, as if we visualise the sound wave it seems to be without peaks and cut upwards and downwards producing a metallic boxy echo.

jofzar

Yeah it sounds super low bitrate to me, reminds me of someone on Bluetooth microphone

lvncelot

After hearing them myself, I think I know what you mean. The voices get a bit warbly and sound at times like they are very mp3-compressed.

strangescript

The male voices seem much worse than the female voices, borderline robotic. Every sample of their website starts with a female voice. They clearly are aware of the issue.

jsomedon

I felt the same, male voice feels kinda artificial.

davorak

Any insight on my the code and the large model were removed? Some copies are floating around and are MIT licensed. In cases like this I do not know why the projects are yanked. If the project was mistakenly released under MIT, copied elsewhere, is any damage control possible by yanking the copies you have control over? Mostly seems like bad PR, if minor.

androiddrew

Ok anyone have a link to the code and weights?

fivestones

Wondering this too.

aargh_aargh

Is there a current, updated list (ideally, a ranking) of the best open weights TTS models?

I'm actually more interested in STT (ASR) but the choices there are rather limited.

Uehreka

Yes: https://huggingface.co/models?pipeline_tag=text-to-speech

Generally if a model is trending on that page, there’s enough juice for it to be worth a try. There’s a lot of subjective-opinion-having in this space, so beyond “is it trending on HF” the best eval is your own ears. But if something is not trending on HF it is unlikely to be much good.

odie5533

Best TTS: VibeVoice, Chatterbox, Dia, Higgs, F5 TTS, Kokoro, Cosy Voice, XTTS-2.

kroaton

Unmute.sh (same team as Kokoro) gets slept on, but it's really good.

xnx

Click leaderboard in the hamburger menu: https://huggingface.co/spaces/TTS-AGI/TTS-Arena-V2

prophesi

Is there a way to filter out hosted models? The top three winners currently are all proprietary as far as I can tell.

edit: Ah, there's a lock icon next to the name of each proprietary model.

odie5533

That's a highly incomplete comparison

watsonmusic

yes the best

TheAceOfHearts

Unfortunately it's not usable if you're GPU-poor. Couldn't figure out how to run this with an old 1080. I tried VibeVoice-1.5B on my old CPU with torch.float32 and it took 832 seconds to generate a 66 second audio clip. Switching from torch.bfloat16 also introduced some weird sound artifacts in the audio output. If you're GPU-poor the best TTS model I've tried so far is Kokoro.

Someone else mentioned in this thread that you cannot add annotations to the text to control the output. I think for these models to really level up there will have to be an intermediate step that takes your regular text as input and it generates an annotated output, which can be passed to the TTS model. That would give users way more control over the final output, since they would be able to inspect and tweak any details instead of expecting the model to get everything correctly in a single pass.

tempodox

This is ludicrous. macOS has had text-to-speech for ages with acceptable quality, and they never needed energy- and compute-expensive models for it. And it reacts instantly, not after ridiculous delays. I cannot believe this hype about “AI”, it’s just too absurd.

NitpickLawyer

> with acceptable quality

Compared to IBMs Steven Hawking's chair, maybe. But apple tts is not acceptable quality in any modern understanding of SotA, IMO.

selkin

Different use cases:

If you need a not-visual output of text, SoyA is a waste of electrons.

If you want to try and mimic a human speaker, then it ain’t.

Question is why would you need to have the computer sound more human, except for “because I can”.

baxuz

Looking forward to the day when tts and speech recognition will work on Croatian, or other less prevalent languages.

It seems that it's only variants of English, Spanish and Chinese which are somewhat working.

lukax

Have you tried Soniox for speech recognition? It supports Croatian. Or are you just looking for self-hosted open-source models? Soniox is very cheap ($0.1/h for async, $0.12/h for real-time) and you get $200 free credits on signup.

https://soniox.com/

Disclaimer: I used to work for Soniox

baxuz

I meant in general purpose tools from Google and Apple. Most of this assistant and "AI" stuff is practically useless for me because I refuse to talk to my devices in English.

In Android Auto / CarPlay I can't even get voice guidance that works properly, much less reading notifications, or composing a reply using STT

Insanity

What an odd name to me, becaus "Vibe" is, in my mind, equal to somewhat poor quality. Like "Vibe Coding". But that's probably just some bias from my side.

mxfh

Vibe coding just became a term this spring. I doubt that that the substantial part, like giving it a project code name and getting company approval of this research project started after that. It's not libe vibe has a negative connotation in general yet.

Insanity

'Vibe' as a word / product was definitely less common though. I kinda doubt that 'VibeVoice' is _not_ a consequence of 'VibeCode'.

But I do agree with you in that generally there's probably no negative connotation (yet).

andrew_lettuce

Vibe always meant "specific feel" and makes sense related to AI coding "by touch" vs. understanding what's actually happening. It's just the results have now made the word pejorative.

rafaelmn

The Spontaneous Emotion dailog sounds like a team member venting through LLMs.

They could have skipped the singing part, it would be better if the model did not try to do that :)

kridsdale1

It did get me to look up the song [1] again though, which is a great stimulator of emotion. The robot singing has a long way to go.

1. https://music.youtube.com/watch?v=xl8thVrlvjI&si=dU6aIJIPWSs...

eibrahim

Hahahah. Thats what I thought too

Meneth

Open-source, eh? Where's the training data, then?

Joel_Mckay

Most scraped data is often full of copyright, usage agreement, and privacy law violations.

Making it "open" would be unwise for a commercial entity. =3

zoobab

Open source is being abused to not provide the actual source. Stop this.

Joel_Mckay

A lot of code has multiple FOSS licenses that are not contaminating like GPL. GPL violations do occur on code, but have nothing to do with the training Data.

For example, many academic data sets are not public domain, and can't be used in a commercial context. A GPL claim on that data is often an argument of which thief showed up first.

Rule #24: A lawyers Strategic Truth is to never lie, but also avoid voluntarily disclosing information that may help opponents.

Thus, a business will never disclose they paid a fool to break laws for them... =3

crvdgc

Very impressive that it can reproduce the Mandarin accent when speaking English and English accent when speaking Mandarin.

Daily Digest email

Get the top HN stories in your inbox every day.