This Voice Doesn't Exist – Generative Voice AI

blog.elevenlabs.io

Daily Digest email

Get the top HN stories in your inbox every day.

piotr11

Hey - developers behind ElevenLabs here. Thank you so much for the constructive and positive feedback - we’re taking it onboard!

We’re currently focused on researching and deploying a different way for speech synthesis that can generate nuanced intonation and emotions by understanding text and taking context into account. Additionally, we provide creators with a way to clone their own voice based on very short samples. With the published blog post, we are now deploying a way to help them design entirely new ones!

Anyone will be able to generate that level of quality just with a copy-paste. We are planning to open up Beta later this month. Our goal is to let you convert any written content into high-quality, compelling audio.

To address a few questions that frequently came up:

- Latency for our streaming TTS is <1s with quality results available above, which is the usual problem with existing good TTS models (like tortoise-tts)

- We can clone voices instantly, based just on 5s of speech, without training required

- We are working on adding SSML-like support for better control; speed controls will be coming as part of that too

- API is directly available as part of Beta; we are preparing the infrastructure to scale easily for the release!

We are hiring researchers, frontend and full-stack developers! If you are interested, send over your GitHub account and short message to founders[at]elevenlabs.io.

diminikolaou

Hey Piotr - just wanted to say congratz for the awesome work so far man. The quality is genuinely unbelievable. I don't know if you guys are ready to take clients at scale, but I don't see any reason why all newsletter creators wouldn't use your tech right now to address whole new markets. I'll be following the journey, excited for what's to come.

hiisukun

Maybe I'm late to the party -- but this [1] graphic is great in the linked article.

Could the designer share a little about how it was made? Does it represent one of the generated voices, or is it just 'artistic'? (both are cool, I think).

[1] https://blog.elevenlabs.io/content/images/2023/01/Sequence-0...

fireant

The voices are really amazing, I couldn't really tell that they are synthetic and I was looking for it.

The only issue is that the actual recordings sound like they have been overcompressed, or poorly recorded - is there any way to improve this? Something like superresolution, but for voice?

rexreed

What is your business model? How are you deciding who gets Beta access? What does the voice generation interface look like?

matisqe

We are offering both Speech Synthesis (/TTS) and Voice Lab (Rapid Voice Cloning and Voice Design) as a standard SaaS model (w/ fixed quota of characters you can voice per month). API is directly available on the platform. Outside of standard package that flips to usage-based model and we do tailored deals for custom needs and discounts for high-volume usage.

Currently testing Beta with a range of storytelling and publishing use-cases, tackle relevant feedback and make sure the infrastructure supports it. We are planning to open up Beta to everyone by end of this month.

Voice Design interface is currently set of sliders and toggles but currently iterating on what is most accessible.

TheMrZZ

Hi! Are your models english only, or do you plan on tackling other languages?

piotr11

They will be multi-lang, the tech scales to any language and we are working to add more (it is relatively easy). Here is the demo in Polish TTS: https://www.youtube.com/watch?v=ra8xFG3keSs

pronlover723

What are the odds of this kind of thing being open source so I can use it at home. So far, most of the "good" text-to-speech systems are all commercial services

https://aws.amazon.com/polly/

https://cloud.google.com/text-to-speech

https://azure.microsoft.com/en-us/products/cognitive-service...

And now one is also a service.

I tried using tortoise-tts on my M1. Generating a 7 minute speech took 3 days and, while better than the 15 yr old text-to-speech built into the OS it wasn't close to the quality of the services above. Maybe I don't know who to use it but of course it's not as simple as text-to-speech. You need the system to ideally understand the text it can act out parts

Of course see my username. I want to generate personal adult content so I'd prefer not to upload it to a service.

yreg

Any time I see AI model news on hn nowadays, my first question is whether I can run it locally, and if not, what are the alternatives that I can run locally.

EarlKing

> what are the alternatives that I can run locally

...you will be disappointed by the answers to that question for the foreseeable future.

yreg

I'm the opposite of disappointed. The amount of public pretrained models that have been popping up recently is crazy.

Roark66

The speed of progress on this front is increasing. These days even "cheap" rockchip MCUs are packing 5TOPs AI accelerators. And both AMD and Intel are working on much more powerful ones for their cpus. Heck, I recently wrote a mobile (android) app that runs pretty powerfull AI for intensive image processing locally on mobile phones thinking improved privacy would be more in demand than sending everything "to the cloud". I was mildly surprised to discover most people don't care (after writing the app). Still, I wouldn't be surprised if in 10 years the majority of AI people use rums on end user devices.

yreg

Yeah, most people don't care, but it might also be the case that many people who care use iOS, since that's the platform where all photo machine learning provided by the system happens on device.

kerpotgh

That’s because you’re running tortoise on a CPU. It does about a sentence a minute on my 3090 gpu. It’s also quite good if you pick “high quality” and train it with 10 sec clips at the framerate and bitrate it asks for.

taf2

You could try https://github.com/mozilla/TTS

mirkonasato

Effectively superseded by https://github.com/coqui-ai/TTS

hackernewds

What kind of personal adult content do you generate? We are curious for details

didericis

I can't tell if I'm starting to get that old person "new things are scary" instinct or if my gut level of fear about the implications of these things is warranted.

As impressive as a lot of these models are, I can't help but feel like they're going to end up making an incredible amount of sterile soulless content that makes everyone's lives worse. We're already drowning in ad dominated cynical soulless computer generated search results. Are all online forums going to end up being drowned out by cynical pumped out super cheap to produce simulacrums of creative content now too?

If I want people to buy more Triscuts next year what's stopping me from writing a bunch of prompts to insert subtle marketing cues to buy Triscuts with entire fake ecosystems of users, fan art, radio call ins, user stories, etc in like every niche community in existence and flooding them with soulless fake interaction?

That exists to a certain extent already, but I don't see how this stuff won't make it way easier, way more effective, and way more widespread.

spaceman_2020

My YouTube feed is currently filled with videos of whitehats hacking into Indian scam call centers.

Most of the time, the giveaway is the callers' Indian accent. If you could simply type into a box and speak with an American accent, it would be really hard to get caught.

We're opening a pandora's box here if I'm honest. I'm hardly one for pro-regulation, but good God, we're playing with things here that can really hurt us down the line.

shaky-carrousel

> If you could simply type into a box and speak with an American accent, it would be really hard to get caught.

Not really. If they say "kindly do something", they are Indian scammers.

tux3

Yes, however if that were a problem in the scenario above, I'm pretty sure LLMs could fix that as well.

They're already very good at translation today, it stands to reason that they could do the needful when it comes to turning regional English into American English. Or Bri'ish English, if that's the accent you want your TTS model to have.

spaceman_2020

"Hey chatGPT, write a short script convincing someone that I'm from a small town in America"

moffkalast

And here I thought the giveaway was just them trying to blatantly scam you.

pclmulqdq

There are a lot of words and phrases that indicate that you are speaking Indian English, separate from the accent. Using "learning" as a noun is a very common one in tech.

spaceman_2020

If you can use AI to create a fake voice, you can also use AI to create a prompt for the voice

sebzim4500

I'm sure that could be corrected by even a very basic language model

barking_biscuit

>As impressive as a lot of these models are, I can't help but feel like they're going to end up making an incredible amount of sterile soulless content that makes everyone's lives worse.

My sentiments exactly. I think it's a bit of column A and a bit of column B. I'm reminded of the quote "everything has its pleasure and its price". The more expensive things are to produce, the the less of it there will be, but what is produced will be higher quality across the board. The less expensive it becomes to produce, the more of it will be, and the aggregate quality will be lower.

It's not always a bad thing, but the downsides are plain to see when you look at the amount of spam and low-effort content out there. That said, we've all massively enjoyed the upsides too, so it's a balancing act. I think where things were at before the recent wave generative AI tools was perhaps right on the sweet spot of "it's democratized enough that anyone can have a go, but still requires effort and a degree of talent to do well". The knowledge (and entertainment) I've been able to access thanks for randoms on YouTube is pretty incredible, and I sort of always just accepted the avalanche of spam and clickbait that came with it.

These new tools potentially push that effort/reward ratio to the point where the signal/noise ratio simply gets too high. Of course the "make money online" community is all over this stuff and today I watched a video of a guy showing how you could supposedly clone courses on Udemy using ChatGPT and other tools etc. The problem is the "course" would literally consist of generic advice, high-level information on a particular topic that suffices only as a very surface level introduction and isn't enough to help you build any functional skills in that domain, so it's effectively useless. The only person it's not useless to is him and as he would pocket a cool $5-ish per sale. It was somewhat sad and somewhat sick to hear him cackling away about being able to con people out of money while passing himself off as an expert.

And yet, it's entirely what I would expect would happen.

dwighttk

>The knowledge (and entertainment) I've been able to access thanks for randoms on YouTube is pretty incredible, and I sort of always just accepted the avalanche of spam and clickbait that came with it.

Adblockers are amazing

barking_biscuit

Are there "influencer" blockers?

ghaff

I suppose the optimistic view--such as it is--is that there is already a vast amount of low quality content out there that was created for pennies and plastered with ads and/or hoping someone will pay a modest amount. So I'm not sure that things like ChatGPT make things that much worse than they already are--and we can mostly live with things today. The pessimistic view of course is a whole new cohort of grifters decide to give it a run whether they ultimately make money or not.

vouaobrasil

I agree with this completely. Technology has always made us trade quality for low-quality quantity in exchange for convenience. People now interact more through technology which removes a lot of body language and other enriching experiences.

The most dangerous aspect of this is that each step seems relatively harmless: right now, ChatGPT and DALL-E are amusements, but each small step is building a monstrous and as you say, soulless machine that overloads us so much that we will forget what it's like to even be human.

I firmly believe (and I have given this a lot of thought) that technology is ultimately evil, and that tech companies are trading short term gain of enormous wealth for the very essence of humanity, preying upon the basic instincts of individuals who are also trading their personal worth for convenience.

If I could have one single wish fulfilled in this world, it would be that every single human being gain a natural and instinctual revulsion for advanced technology. If someone asked me what disease was the worst that ever plagued humanity, it would not be smallpox or the flu or COVID, it would be the tech company.

hexage1814

And mind-boggling you criticize and demonize technology on a forum about technology all while not only being on the internet but also using electricity, a computer, and certainty surrounded by gadgets and other amenities of modern life.

Nature is SHIT, that is why people created technology. There is nothing preventing you from going to the middle of nowhere and reject modernity, no one is forcing you, but you are that because you wanted and liked it. You talk people should have an "instinctual revulsion" towards technology, but not even you yourself has this reaction towards technology because it is a stupid idea that not even luddites like you commit to it.

If anything the technology we have nowadays is not even 0,01% of what we should have. We should have the technology to make any movie anyone ever wanted to see in a blink of an eye, all done in the best quality ever imagined. We should have the power to build a Dyson Sphere around the sun to harness its energy. We should be able to construct fully immersive virtual reality, like San Junipero from the Black Mirror's episode, we should have the power extend human life indefinitely.

aktenlage

Why are you so hostile? What sense does it make, to attack him because he does not already have, what he is wishing for?

Nature is not "SHIT", for whatever that should mean. Neither the blanket statement "Technology is evil" nor "Nature is shit" make sense. We are humans. We need nature - it is what we evolved to and our technology is not able to replace it without loss. Specific technology is great to overcome existential limitations, but most technology is not.

Sure, there is great technology out there that improves our lives. On the other hand, there is so much technology that makes our lives worse (because of how it is used: e.g., by being of advantage to few people, while being bad for everyone else or by helping individuals now but having severe effects lateron), it can hardly be ignored, that a better process for selection or containment of technology would be necessary to improve everybodies life. But mankind is bad at forgoing.

Current technology seems to be great at generating convenience and excitement. And the examples you mention (movies, infinite energy, VR and eternal life) feel like the wishes for more excitement of a teenager (and this is not meant condescending), but life is so much more than excitement. Excitement is just the cherry on top. I'd rather see more tech that is wholesome - but that area seems to be left to nature.

vouaobrasil

> And mind-boggling you criticize and demonize technology on a forum about technology all while not only being on the internet but also using electricity, a computer, and certainty surrounded by gadgets and other amenities of modern life.

I don't demonize all technology. There must be an optimum somewhere, and I would like to engage in open discourse in order to understand where that optimum is. I believe advanced AI takes us away from the optimum.

Extending human life indefinitely is a terrible idea. We have a natural lifespan and we need to function within it. We should not proceed towards being saturated in technology as that will surely destroy the natural life on this planet.

Teever

https://knowyourmeme.com/memes/we-should-improve-society-som...

8note

There's no lack of revolutionary tech that has made life overall better with higher quality.

Even like, a bic light is so much better quality than a flint and steel or fire sticks.

Smart phones are fantastic quality computers that enable cool stuff like meeting up with friends without first having to leave a note at their house some amount of time beforehand

Dishwashers and laundry machines and modern quality clothing let you avoid spending half your waking hours cleaning stuff, keeping us healthier, and enabling feminism

Electricity lets us stay awake at night without smoke inhalation from candles and fireplaces, with less likelihood of burning the house down, and advanced tech in housing standards make sure that when the building does catch fire, you'll be able to get out safely

Advancements in technology are mostly quite good, and improve both quality and convenience

vouaobrasil

Seems like for every advantage you list there's also a disadvantage.

> Even like, a bic light is so much better quality than a flint and steel or fire sticks.

And is part of the disposable society creating immense amounts of waste.

> Smart phones are fantastic quality computers that enable cool stuff like meeting up with friends without first having to leave a note at their house some amount of time beforehand

Smartphones reduce the quality of social interaction. People often check them when they should be paying attention to their friend, and they make cancelling last-minute easier thereby making people more flaky.

> Dishwashers and laundry machines and modern quality clothing let you avoid spending half your waking hours cleaning stuff, keeping us healthier, and enabling feminism

It's hard to argue with you there, though I suspect that all these "time-saving" inventions also make it more likely that we will spend more time on other things like more work and on electronic devices.

> Electricity lets us stay awake at night without smoke inhalation from candles and fireplaces, with less likelihood of burning the house down, and advanced tech in housing standards make sure that when the building does catch fire, you'll be able to get out safely

And electricity has also made it easier to stay awake at night, staying up later and reducing the quality of sleep. Hundreds of people get worse sleep by being exposed to devices at night. I think it's actually nice to wind down activities when the sun goes down though obviously that is not as easy in the latitudes closer to the pole.

Basically, I think there are a lot of hidden dangers that people accept because in the short term they don't realize that technology makes life less fulfilling.

vagrantJin

> Dishwashers and laundry machines

> enabling feminism

Thats all it took?

So if theres no electricity - its back to square one?

snek_case

> If I could have one single wish fulfilled in this world, it would be that every single human being gain a natural and instinctual revulsion for advanced technology. If someone asked me what disease was the worst that ever plagued humanity, it would not be smallpox or the flu or COVID, it would be the tech company.

Time to go live in a cabin in the woods and go write your manifesto on a typewriter...

barking_biscuit

Typewriter? You heathen! It must be chiseled into the cave wall with a bone.

gmadsen

I think what gets lost in these doom and gloom predictions is that there is a large healthy portion of young adults that do not engage in internet forums or social media.

It is perfectly viable in the modern day, to work a job, have passionate hobbies, regularly meet for social events, volunteer, etc and spend minimal to zero time engaging on the internet, besides pragmatic things like map directions

beebmam

I would have died long ago without modern technology, and the many surgeries I have needed. It's hard to take your argument seriously when I consider the consequences of what you're advocating for.

sebzim4500

Yeah but you have to balance the positives and negatives. Sure you being alive is all very well, but sometimes GP has to overhear teenagers talking about TikTok, and that is unacceptable.

barking_biscuit

>I firmly believe (and I have given this a lot of thought) that technology is ultimately evil, and that tech companies are trading short term gain of enormous wealth for the very essence of humanity, preying upon the basic instincts of individuals who are also trading their personal worth for convenience.

I've more or less come to a pretty similar conclusion. I wouldn't characterize it as evil per se, but it's a fools errand at best. My line of thinking goes somewhat like this - before the Neolithic revolution humans had an extremely small set of problems. The main problem being "what am I going to eat?", and to a large degree, life must have revolved around this problem almost entirely. There weren't that many people, there weren't that many problems, we somehow persisted in that state for hundreds of thousands of years with literally nothing to write home about. Any advance in technology has literally been trading one problem for at least three more. Now there are loads of problems, loads more people, and the standard approach to solving all the problems is to invent new technologies, which in practice seem to actually exacerbate the problems. So, I just sort of view the current state of things as "somewhere around the turn of the Neolithic Revolution we took a wrong turn, and it has widely been regarded as a bad move."

It's a weird sort of defeatist, nihilistic, melancholy worldview, but to be honest, I don't think we're wrong. I mean... what's the endgame of technology?

thatguy0900

I would put the optimal state around native American level of technology. At least some sense of medicine and first aid, food is largely figured out, but no real oppressive technologies figured out yet.

PurpleRamen

> Technology has always made us trade quality for low-quality quantity in exchange for convenience.

Technology evolves. Even if it may start with some low quality aspects, it doesn't need to stay that way.

> People now interact more through technology which removes a lot of body language and other enriching experiences.

Which is just different communication, not better, nor worse in general. Of course this kinda sucks for people who do not know the new communication-code well enough. But people do evolve communication to replace relevant missing parts. Body language for example was mostly replaced with emojis and memes, which can be better, or worse.

> we will forget what it's like to even be human.

You can't forget what you are. You are you everyday, ever minute, every second of your existence. What you speak about is people having a different culture from the one you know and understand. That's something completely different.

> technology is ultimately evil

Technology is a tool, is can't be evil or good. It's up to the users how they handle it.

undefined

[deleted]

vouaobrasil

> Technology is a tool, is can't be evil or good. It's up to the users how they handle it.

I fundamentally disagree with this premise. I believe evil is roughly equivalent to the inevitability of bringing about evil, and I believe AI falls under such a classification.

vouaobrasil

> Which is just different communication, not better, nor worse in general.

That is where we disagree fundamentally. I do posit that the communication is actually absolutely and unequivocally worse.

djmips

> Technology has always made us trade quality for low-quality quantity in exchange for convenience. People now interact more through technology which removes a lot of body language and other enriching experiences.

I went to the mall today and you can tell malls are dying. I lived in a small town where the mall died and it had a zombie like existence a long time before it finally cratered. The mall here in this larger town has that feeling. I also thought about how nice it is to go to the mall just to be out among people. The same is true of the downtown. If the endgame is for everyone to stay home and shop online that's going to be a very soulless existence.

tchaffee

Or don't shop at all and use that extra time to walk with friends in nature. Or when you really do need to shop, avoid the commute and use that extra time to spend with friends in nature. Being forced to be around strangers to get chores done doesn't put soul into my life.

I also avoid laundromats and do laundry at home and it doesn't feel soulless.

Sightline

I never went to malls to socialize.

CM30

But at the same time, there are tons of positive uses for things like this too. Imagine being a creator who wants to share their interests with the world, but hates their own voice or doesn't have the confidence to speak on camera. You could make a lot of people's lives better by creating content for YouTube, Twitch, TikTok, Instagram, etc, but you wouldn't be brave enough to otherwise.

Something like this could be incredible for those people. A natural sounding alternative to text to speech for people who dislike how they sound.

And it could also be used to anonymise people in documentaries about serious topics (like say, organised crime) without actors, letting people bring the atrocities of said folks to light without need to trust others or the risk of being found out.

Other examples could include vTubers, artists creating characters for TV shows, films and video games, etc.

All technology can be abused, and sadly with how humanity acts, like will by a small percentage of the population. But for every person abusing it for dubious purposes, there are dozens or hundreds or thousands of others who can make the world better with it.

ChildOfChaos

I think a great tool for this would be a cross over voice changer AI, so you could still speak naturally but then sound like the model voice, that way it would be a little less soulless.

CM30

Honestly, that would be incredible for so many purposes! vTubers and amateur media creators would love to be able to just speak and have it translated into their voice of their characters in a more natural way!

Would also be an interesting one for theme parks, since it could led the costumed characters speak in the voices of the relevant characters rather than remaining silent, which would add a lot to the sense of immersion there too. (something like the website on the other hand could let the animatronics, CGI characters and others hold conversations with guests too, which would also be neat)

didericis

Yes, I’m sure there are many positive uses as well, I just have a hard time seeing how that’s not going to be outweighed by the bad given the current environment. There’s going to need to be some sort of social/cultural/technological adaptation when the negative starts hitting with force to curb it towards positive uses. People need to start thinking about mitigation strategies now.

jsemrau

I am with you on this one. What defines us as a people is the ability to enjoy shared social experiences. The more tailored and personalized an experience becomes the more it isolates us. We don't (at least I am my social circle) speak about TikTok's the way we speak about YouTube videos.

But more importantly, boredom triggers innovation. As we are consuming ourselves to death, we might lose the ability to truly create. Maybe that's why the last 20 years of content feel quite generic and sterile.

mouzogu

I think we already have a lot of soulless human generated search results.

I think there will be need for a greater level of filtering and curation yes, but I see it as an opportunity both for creators and curators.

The barriers to entry for media creation will go down, but with saturation also the already low margins of profit will get worse.

affgrff2

Also, AI will do the filtering, not just blocking uninteresting content, but actually removing known and uninteresting information from content.

8note

I'm not confident that ai will ever catch up with scams for new scammer tactics.

spaceman_2020

we'll go back to the old way of consuming media - recommended by friends, vetted by known curators.

SV_BubbleTime

> As impressive as a lot of these models are, I can't help but feel like they're going to end up making an incredible amount of sterile soulless content that makes everyone's lives worse.

Eh. I’ll take a MAYBE over the past 10 years or more of the human driven social media manipulations and scams and poison. We’ve made almost literally fucking nothing of value in a decade. It’s been ads, Ponzi schemes, and a race to the bottom of tolerance.

I’ll take the democratization of content. Knowing that it will allow the good and the bad.

… so how it different from the radio or TV or “influencers” now? I have limited time to consume media and am not going to be less picky when it gets easier for people to make garbage.

fullsend

There was some innovation but 2010-2020 had some dead air as investors lavished ponzi scheme SaaS companies with cash and big firms poured the profits of early internet into VR, AR, AI, drones, self-driving, etc.

The last year and a half things are starting to pop off. OpenAI, SpaceX, Comma, Helion, many more…that doomer “everything sucks and is collapsing” mentality is on the way out in my opinion. The time for talk is over and it’s time to build, or so they say.

berniedurfee

I’m hopeful that, as with most apocalypse-capable technologies, humans will adapt and overcome.

Humans will be different on the other side of this coming wave of simulation indistinguishable from reality, but we’ll be okay.

It’ll suck living through the transition though. Not looking forward to the crap tsunami on the horizon.

But as a species, we’ll adapt and survive as always.

drewbug01

The “narrative” example is pretty good, but the “conversational” example is rather unpleasant to listen to.

(Especially if you know how well Meryl Streep delivers that monologue in the original: https://youtu.be/Ja2fgquYTCg)

tedd4u

That's a pretty high bar. Even most Hollywood productions can't afford Meryl Streep, let alone a new site, podcast, or video game.

From wikipedia:

Mary Louise "Meryl" Streep [is] often described as "the best actress of her generation." Streep is particularly known for her versatility and accent adaptability. She has received numerous accolades throughout her career spanning over five decades, including a record 21 Academy Award nominations, winning three, and a record 32 Golden Globe Award nominations, winning eight. She has also received two British Academy Film Awards, two Screen Actors Guild Awards, and three Primetime Emmy Awards, in addition to nominations for a Tony Award and six Grammy Awards.

xp84

Maybe it's because I haven't heard the source material, but that Conversational voice really appeals to me. I wish my phone and assistants used that voice.

(and also I can't wait for a "real" ChatGPT-era AI to go with it, to put those braindead jokes of an "assistant" Siri, Alexa, and Google Assistant out to pasture)

logicallee

Let's talk about this "Narrative" example.

When I listened to it, my first impression was that it must be the real actor they included for comparison purposes but that they failed to label it correctly. I thought it is not machine-generated. I couldn't tell the slightest artifact except what sounded like a low-bitrate sound encoding (maybe using a codec geared toward speech). Can you tell anything "off" about it?

As for the encoding artifact such as a tinny sound or low-bitrate sound, that is the type you hear on an MP3 or low bitrate codec for speech. For example, when I record a message on https://vocaroo.com/ the "premier" voice recording service it sounds 10x worse. Here is a sample I just recorded of my own speech: https://voca.ro/18oSJ1sHU5w5

After my first impression that the narrative example might be a real human mislabelled for comparison purposes, I listened to the next two, labelled News and Conversational. I found these very easy to tell as AI-generated.

Thinking back to why I found the narrative example so compelling, I thought perhaps the issue is that the first example is in British English which I'm less used to than American English. I grew up in the United States. Perhaps since the accent doesn't match my own, it is harder for me to perceive it as generated.

-> Can a native speaker of British English tell us whether listening to the first example you can tell in any way that it is a robot? Maybe it is as obvious to you as the next two are to me.

Still, I've listened to a fair amount of British English in my life so perhaps there is an alternative explanation for why the first one was better. For example, it could have been trained on a reader's voice who has narrated thousands of hours in very high studio quality in a fairly consistent way, leaving this type of text much easier to synthesize than the other two examples due to more training data or higher-quality audio.

For me, the first one is really indistinguishable from a narrator's true voice, though it does sound a bit tinny which could also happen as an artifact of the recording process.

In terms of "how confident are you that this is a real person" the second two examples I would put at 0 - it's totally obvious that it is not a real person, whereas the first one sounds like a 10 to me: obviously a real narrator. (With a bit of artifacting that sounds like an mp3.)

[1] The text is here https://www.nytimes.com/2001/11/19/books/chapters/the-lord-o...

iggykova

Hey! ElevenLabs here, confirming that all 3 samples (including the Narrative one) were AI-generated! We'll be opening up our platform later this month and would love for you test it yourself!

logicallee

congratulations!! I can't tell the first isn't human no matter how much I try. That is an amazing achievement.

majolo

I'm a native British English speaker and can confirm the first example is incredibly good. It would be very difficult/impossible for most people to tell that the voice is generated from that clip alone.

rkagerer

Agreed, the intro and narrative ones are great. The news one is terrible.

feoren

Okay can I ask a question that has been bothering me for a long time?

Why do seemingly all these text-to-speech programs attempt to produce spoken voice based solely on raw text? Why don't they consume a MIDI-like text-markup language where you can write phonetic pronunciations along with markup about the emotion, volume, speed, etc.? I feel like this is a huge unnecessary roadblock holding back this kind of technology. It'd be like if every music composition program rendered a wave file not by MIDI or VST, but by trying to visually read sheet music. I totally understand why TTS solutions that have to consume arbitrary content, like screen-readers, need to read purely raw text. But content creators don't need to be limited to raw text! Why is everyone doing it that way? Where is the TTS markup language for content creators?

bredren

I don’t know how many of the solutions offer this, but there is a markup language for TTS:

https://en.wikipedia.org/wiki/Speech_Synthesis_Markup_Langua...

Amazon Polly, (which seems kind of ancient with all these new solutions showing up) has supported SSML for some time.

AWS Polly SSML docs: https://docs.aws.amazon.com/polly/latest/dg/ssml.html

KRAKRISMOTT

In practice they are next to useless, the expressions are not very...expressive (just try it in the AWS editor). I suspect a LLM would be able to infer the context or we can use prompt engineering to generate the appropriate tokens encoding emotions for the intermediate neural codecs directly (Mel spectrograms are so passé now post Vall-E).

TaylorAlexander

Something I always noticed is that they get Morgan Freeman to do voiceovers for science shows, but he’s not a scientist so he has a sort of generic inflection when he talks about the various ideas in the script. And then you watch Carl Sagan’s COSMOS, where he co-wrote the material, and there is so much depth and expression to his delivery. There’s a lifetime of public speaking, specifically delivering complex scientific topics to a general audience, that Sagan drew from when recording his show.

Sagan would have learned this through conversation with people, and careful updates to his expression and delivery as he matured.

I guess an LLM could improve upon previous methods but I would also say there is a gap that even humans struggle with, which requires really complex knowledge both of public speaking and of the material. It may be a long time before we can really master that with AI systems.

slim

maybe the only way to express speech precisely is the speech itself ?

matisqe

ElevenLabs dev here - we believe this is a 2 step process and agree it is needed!

First, we want to the quality you get out-of-the-box to already by brilliant by taking context into account. Granted, that gets you sometimes 98% there and are working to add manipulation possibility to get you to that 100%; for long-texts though the quality you get is great.

For second part, currently TTS providers give complicated toggles that frequently don't affect the speech in the way you want. Initially we are adding a basic SSML-like support and have a more robust language-based idea which we hope will come over the next few months!

tkgally

Your context-aware TTS is already sounding very good. If I were using it to produce a narration that other people would be listening to, I would want to make at most couple of minor adjustments every few sentences. Most of those adjustments would fall into a few categories: stronger or weaker stress on a particular word, rising or falling intonation on a phrase, longer or shorter pauses between words, and correction of the phonemes in a word. A half dozen toggles for those adjustments might be enough for most cases.

I wonder, though, how much training people would need to understand what adjustments need to be made. Experienced actors and narrators should have a good sense of what to fix, but many people might have trouble identifying what sounds strange in the initial TTS output and how it needs to be changed.

spywaregorilla

I feel like it would be much harder to create a set of hard controls, like MIDI, to affect the voice acting vs. trying to do a co-embedding space of voices and descriptions of the voices and just saying "Say this quietly and meanly". Thoughts?

matisqe

Exactly! Only issue is having a well-labelled dataset with those type of cues. We have an idea on how to do it though!

riceart

> I feel like this is a huge unnecessary roadblock holding back this kind of technology.

There are speech synthesis markup languages, like SSML. And targeting even lower level has always been possible with commercial speech engines.

Think about how tedious and time consuming it is to mark up a large amount of copy? Unless we’re talking about little hints here and there (which is also doable) it rapidly becomes more cost effective to just pay for voice talent. For this stuff to be appealing it really must be close to fire and forget.

feoren

I think there are two "sweet spots" here.

The first is being able to correct a few things that sound off, as another poster pointed out. "Hey, that's not actually how you pronounce 'synecdoche', it should be 'sɪˈnɛk.doʊ.k'." Or "Less emphasis on the first word, more on the second". Little corrections like that. I imagine a two-stage process where the first generates 'best guess' SSML (or whatever markup) based on the text. Then the content creator can modify it as necessary before it goes into the second step of actual voice synthesis.

The second sweet spot is when your text is dynamically generated. Marking up the entire copy might be a lot of work for pre-written text, but it's a great option for dynamically generated text.

bdhcuidbebe

Just my 2 cents, but it seems to me that too little focus in the tech world has been spent on understannding what speech is. Tonality, mood, facial expression and body language all is ignored or people pretend like there are no such thing. I believe this is broadly true in western society by now- people went digital but do not yet realize why communication went to hell in the last decade.

havnagiggle

I used to work in automotive navigation. Other colleagues handled our voice systems, but I do remember all our prompts were written in SSML[1] with varrying amounts of specificity. We would use Lua to configure and customize the SSML, including some custom extensions for different voice renderers.

Even with the prompts marked up, there were huge differences between products. Some car OEMs would pay higher fees for better voice and some wouldn't. It's fairly tedious work and difficult to scale as the amount of sentences grows. We basically built up a catalog over many years and they were always explicitly stated as part of our requirements docs. Of course the renderers could say anything you wanted but letting it free form was so a big risk from a product point of view.

[1] https://en.m.wikipedia.org/wiki/Speech_Synthesis_Markup_Lang...

IanCal

There's lots of text and audio already without this, that's probably the key factor practically. Similarly then for use cases, converting text that already exists is much more approachable than creating new marked up text.

Tortoise lets you add prompts into the text like [I am angry] which modifies the voice interestingly.

montag

There's a pretty advanced Mac OS speech markup language, I wrote about it here: https://www.mattmontag.com/personal/mac-os-x-speech-synthesi...

Going back further, there was also a prosody markup for Sound Blaster speech synthesis (Dr. Sbaitso, anyone?).

IshKebab

I think markup would always be more work and less effective than using your own voice input to guide its tone.

wpietri

But not nearly as manageable. Imagine saying the same thing about music, for example. Musical notation is clearly more work than just humming a tune, but there's still a need for it.

dj_mc_merlin

The examples are insanely good. Insanely good. I can barely believe we really live in a world where this is possible. I don't have anything constructive to add.. just wow.

wand3r

I work in TTS and i just dont believe this. If these really are random text and not trained on literally the copy they are reading, with no correction I would be surprised. Also, our competitors have good voices but they also take ages to produce. Maybe these really are legit but take like 1 minute to produce or something. So while this is impressive, i doubt that in practice this would be this high quality and could even approach real time

piotr11

Thanks! ElevenLabs dev here - these are generated 6x faster than real-time, with latency of <1s. No corrections required.

We are working on long-form speech synthesis too, needless to say, the audio reading the article has been also synthesized by a voice that does not exist.

moffkalast

Ok I think it's fair to say you're either full of shit or the world leading experts in TTS.

ThePyCoder

I want to agree, but I searched on their website and found their narration service with 2 full book examples. I listened to the first one for a while and it's the first time an Ai narrator was good enough to keep me listening: https://www.audiostory.ai/2065785/11707800-alice-s-adventure...

kreddor

It's noticable worse than the examples in the blog post. I mean, it's good enough for listening, but no better than the competition.

WheelsAtLarge

I'm listening to an audiobook whose reader is not as good as some of these voices. At one level, I'm impressed but at an another I'm sadden since we are heading towards uncharted territory. We are looking at a future where we'll have content, video,audio, and text by the truckload. More does not mean better. It just means more blah stuff. I don't think that's the future I'm looking forward to live in.

Fordec

The key will be authenticity and trust. And in the world where the percentage of online content that contains this ends up in the vast minority of content, in person expertise and meetings will have to make a return out of sheer necessity.

It's starting to very much feel like we're entering the age of information manipulation outlined in the Ghost in the Shell TV series. Except it isn't a 90's/00's depiction of the future, it's just with far less robots and prosthetics and a lot more mundane.

I just keep coming back to the scene where they have satellite video footage of a nuclear submarine preparing for a nuclear attack and the discussion lamenting that it's just video, nobody will believe it as evidence.

sanroot98

I think you are overestimating the capabilities of ai to create novel content ,high genuine quality content will be always there ,but amount of bs content will increase

kerpotgh

[dead]

xeonmc

Imagine if in-game voice chat automatically converts player speech into the voice of the character they're playing -- this would resolve a lot of the gender-based harassment problems arising from competitive games requiring vocal communication, since now _everyone's_ default is hiding the actual player's voice, contrasting the "just use a voice changer if you're a girl playing" suggestion which themselves draws attention by being out of the ordinary.

yieldcrv

I’m looking forward to NPCs having dynamic responses with real voices

Doesn't have to be prerecorded, just trained

wlesieutre

Games could have more than three dialogue options again!

93po

I feel like if Bethesda really wants another industry defining game, this is the path they should be taking. AI generated conversation with AI generated voice acting with voice-to-text recognition. You can literally have microphone-voice conversations with NPCs that have rich, AI generated backgrounds and personalities.

didericis

A galaxy scale exploration game on the scale of Elite Dangerous where you could have more complex and varied interaction would be pretty amazing. The way you could apply these new AI models to video games has some wild potential. I think video games are one of the areas where I see the most potential for positive impact rather than negative impact.

yieldcrv

The file sizes would drop by 50 gigabytes again

Most of it is high definition audio these days, and then that just gets replaced by a 10gb training set, or maybe the training set becomes a shared resource on the console

LarsDu88

I'm working on a VR space game that actually uses Ssml azure cloud generated voices for dialog, but I've ditched the rogue-like procedural elements which are wickedly hard to implement

undefined

[deleted]

CM30

This would be incredible, especially with the thousands of unique characters games often have nowadays. Imagine every NPC having a unique voice, and the ability to dynamically respond to the players?

Damn that would do a ton for immersion!

DaedPsyker

Even customisable like your character's appearance. This was one of my criticisms of Fallout 4, the voice actors weren't bad, it just didn't fit very well some player characters.

petepete

Could really bring life to some older games with lots of text. Daggerfall springs to mind.

barking_biscuit

Imagine if in-game voice chat automatically converted a % of guys voices into girls voices, so they would start getting harassed, realize how awful that is, and then over time stop doing it.

lm28469

Imagine if all POC wear white people body suits, that would solve racism !

moffkalast

That's just kind of a lame workaround that doesn't tackle the actual issue though.

undefined

[deleted]

anigbrowl

Less than a week ago, I said AI would upend the market for voice actors within the next couple of years: https://news.ycombinator.com/item?id=34271948

bsenftner

Not only voice actors, include radio hosts, documentary/news content, any voice over for anything, as well as imitation of familiar voices.

holler

This will really open pandoras box for scammers and other bad actors. Grandma won't know she's speaking with an AI.

elboru

Grandma already falls for scams. Will I know I’m speaking with an AI?

AlfeG

I have an AI service from my mobile company that talks to scammers. Idea is to hold scammer on call as much as possible. Then you can listen or read transcribe of those calls.

intelVISA

Blade Runner: 2024

spotting open (closed) AI models by doing the Voight-KAPTCHA test

bongoman37

[dead]

trinovantes

There's already an AI streamer on Twitch

https://kotaku.com/neuro-sama-twitch-vtuber-ban-holocaust-mi...

QuantumGood

The "budget advantage" doesn't matter in the top half of the industry; directing a human voice talent is not going away anytime soon.

Budget clients are suspicious of AI voices and feel "cheated" if they think someone they hired are using one. This will change fastest.

jurassic

I'd like to see this technology become cheap and ubiquitous enough that everyone can choose for themselves what voice they would like to hear right at the moment of consumption. It's always a huge bummer when there's a book I want to listen to on audible with terrible narration. Somebody must have liked that voice for the person to be hired, but people's tastes differ and sometimes the people they've selected just really grate on my ears.

It would also be cool if celebrities / existing voice talent could somehow license the synthesis of their voice. I read something about James Earl Jones doing this with Disney for future Star Wars projects. I'm sure there are people out there who would love to have every work they listen to be in the voice of their favorite narrator/celebrity.

coverband

This is cooler than ChatGPT and image generation as far as I'm concerned. If they're able to bring out the emotional connectivity and purposefulness of the human voice, it will be revolutionary...

belter

The laughing examples are pretty impressive.

"The first AI that can laugh" - https://blog.elevenlabs.io/the_first_ai_that_can_laugh/

cheeseface

There are so many uses cases for this, even with the current quality. Many game developers dream of having something like this.

intelVISA

Awesome, I think a few years we'll hit levels of AI generative media tech where you can produce, as a lone greybeard, a Cyberpunk 2077 tier title. Same # of bugs too ;)

purplepatrick

Still sounds pretty fake to me. There’s a hurriedness to the speech and a monotonic uniformity in enunciation that is uncannily machine. Good to know that voice actors will have jobs for a while longer…

drivers99

I thought the Narrative one was 100% there. I'd still give the News one 99% and Conversational 98%.

affgrff2

Yes, for the sake of humanity, I hope the examples are cherry picked and The lord of the rings audiobook is in the train set...

UncleEntity

> Good to know that voice actors will have jobs for a while longer…

They don’t have to work anymore, just sell their voice and sit at home collecting royalty payments is the future according the TFA.

And they’ve been making progress on the roboticness with every new model that comes out. Just a matter of time (and data) for the AIs to figure out how words string together naturally.

janosdebugs

This assumes that legislation/ajudication won't tell AI companies that grabbing any content they can find and not reimburse the original author is "fair use" or something equivalent in other jurisdictions. Here's to hoping.

imtringued

The random voice generator is pretty bad but sometimes you actually get a reasonably good voice except you can hear clicking sounds that interrupt the voice.

Kiro

Yeah right. You would never pass a blind test on this.

sebzim4500

I think the narritive one would pass a blind test.

The conversational one wouldn't, although it could pass for a bad (human) voice actor.

kerpotgh

[dead]

Daily Digest email

Get the top HN stories in your inbox every day.