Mimic 3 by Mycroft

Daily Digest email

Get the top HN stories in your inbox every day.

synesthesiam

Hi all, author here. Besides the tech of Mimic 3 itself, I'm interested in training voices in as many (human) languages as possible. All it takes is one person willing to donate a dataset for everyone to benefit!

...well, that and a bunch of stuff with phonemes. But I'll do that part :)

dEnigma

Can't you use the Mozilla Common Voice dataset for that?

krisgesling

The Mozilla Common Voice dataset is awesome - however it's useful the opposite purpose - speech-to-text. This is because it is a lot of different people using a range of hardware, speaking similar phrases.

For good text-to-speech you need 1 person speaking different phrases but very consistently. Here's an example dataset from Thorsten a German open voice enthusiast: https://openslr.org/95/

dEnigma

Thanks for the explanation!

rjzzleep

What does it take to add Chinese and Japanese to this? Surely it's a lot more than just training sets right? I have an android phone without access to google tts, so this might actually potentially be a nice alternative.

josephg

How can people contribute? I'd be happy to sit in front of a microphone for awhile if I could use my own voice in a TTS engine!

sampo

They want you to make good quality audio recordings of you speaking about 20 000 phrases. It could take 40 to 80 hours of speaking and recording, maximum 4 hours per day.

https://github.com/MycroftAI/mimic-recording-studio

https://mycroft.ai/contribute/

synesthesiam

The amount of data depends on if there's a voice for the language already. If so, about 2 hours of data is usually good enough. Otherwise, 10-20 hours usually does it.

wilsonjholmes

Where could I donate my voice?

worthless-trash

What kind of workload are we looking at, do you care for the Australian accent?

krisgesling

Bloody oath we do!

krisgesling

Translation: "Yes"

... Hi from Darwin :D

Quequau

Does anyone know just how much of the total functionality of Mycroft is actually running on the Raspberry Pi? I asked this question four years ago on Reddit (I'll paste the response below) and now I wonder if things have changed, particularly with regard to speech to text.

There are several ‘layers’ to a voice assistant;

Wake Word - that detects when you are speaking to the device. This is local to the device and we use PocketSphinx.

Speech to text - that detects what you say to determine Intents - we currently use a cloud service for this

Intent matching - this is done locally using our own open source software - Adapt and Padatious

Skills - Intents then match to Skills. Some Skills require internet connectivity.

Text to Speech - We use our own software called Mimic for this, it’s local to the device.

krisgesling

If you download Mycroft today, currently by default everything except the STT is running on the Pi. There are options to do this locally: https://mycroft-ai.gitbook.io/docs/using-mycroft-ai/customiz... however they have their limitations.

Before the Mark II ships in September, local STT will be available by default!

diggernet

Thanks for jumping in here.

> In order to provide an additional layer of privacy for our users, we proxy all STT requests through Mycroft's servers. This prevents Google's service from profiling Mycroft users or connecting voice recordings to their identities. Only the voice recording is sent to Google, no other identifying information is included in the request. Therefore Google's STT service does not know if an individual person is making thousands of requests, or if thousands of people are making a small number of requests each.

Well, unless Google does voiceprint analysis. But Google wouldn't do that, would they? /s

Beyond that, if I'm reading right, local STT will still require a separate STT server. It won't run on the Mark II itself, right?

Quequau

thanks for the info!

capableweb

The blogpost linked in this submission says the following:

> Mimic 3: Mycroft’s newer, better, privacy-focused neural text-to-speech (TTS) engine. In human terms, that means it can run completely offline and sounds great. To top it all off, it’s open source.

If "skills" are what I think they are (something like external commands, for example "Play X on Spotify"), then my understanding would be that everything but those runs offline and local-only.

But if things like `speech to text` requires internet connection and sends the data to some cloud service, then the entire value proposition of this product falls apart.

I hope that's really not the case, as that would be outright lying and false advertisement.

krisgesling

On device STT will be available before the Mark II ships in September!

There are some options for that available if you're running Mycroft already: https://mycroft-ai.gitbook.io/docs/using-mycroft-ai/customiz...

But these are not what will be shipped on the Mark II...

bluGill

The raspberry pi 3 that is used in older products doesn't have enough power to be all offline. Maybe you could setup a server at home (but they won't help you!), but you cannot do it on the hardware they have. The next gen mycroft 2 (should ship this fall - first announced many years ago) will have a pi 4 which might have enough power to run offline, this isn't clear yet.

capableweb

> might have enough power to run offline, this isn't clear yet

It's very unclear and misleading to put "it can run completely offline" if you're not 100% sure it can actually run "completely offline", hardware be damned.

diggernet

According to their privacy policy (https://mycroft.ai/embed-privacy-policy/):

"*When you use our Services including the Mycroft Voice Assistant, your voice and audio commands are transmitted to our Servers for processing.*"

So it appears that STT is still cloud-based, which is a pity. That's the only thing keeping me from ordering one today.

krisgesling

On device STT will be available before the Mark II ships in September!

As will a new privacy policy that better reflects what we actually do.

suyash

Interesting so they run the services on cloud and advertise their company as on device, what a bunch of crooks.

krisgesling

On device STT will be available before the Mark II ships in September!

Note: I'm from Mycroft

notahacker

If they're marketing it as "fully offline", they ought to be doing the speech to text bit locally now. Worked on part of a platform which could use Rasa for this a couple of years ago, but running on something a bit more powerful than a Raspberry Pi!

krisgesling

On device STT will be available before the Mark II ships in September!

throwaway2016a

I've been dying to replace my Echos with an open source smart speaker but half of them use AWS or Azure for test to speech and speech synthesis so really all you are in control of is the software that runs on the device itself. So this is a coo step in the right direction.

Semaphor

The Rhasspy [0] author recently got hired by mycroft to work on satelites and fully local. Rhasspy requires a lot of manual work, but replacing Alexa is already possible. I’m somewhat stuck with the current hardware availability issues, but I have a Pi 3 satellite that does wakeword detection (this is supposed to be handled by Pi Zero 2 W in the future) and sends the voice to the MQTT server running on a PI 4, the data gets picked up by the Rhasspy instance also running there, it does STT, intent recognition, sends the intent to home assistant and then does TTS back to the satellite.

My main software issue is currently how to replicate the music functionality. Playing music at the satellite that requested it, lowering the volume when it recognizes the wakeword. Preselection of "commands" for band and genre names should be easily scriptable afterwards.

In a quiet room, I have no issues with wakeword detection using a playstation eye camera (I wanted the seed USB microhphone array, but between discovering it and starting with buying hardware the supply chain bit once again)

[0]: https://rhasspy.readthedocs.io/en/latest/

Havoc

Didn't realize rhasspy already has satellite support. I shall have to check that out!

I've got a home server and a seed array so would be ideal to split that mic (rasp) and processing

rcarmo

Playing music from a Plex server is a major use case for me, and I have given up on Rhasspy because I couldn’t get all the pieces to work together (I have the mic array HAT and a Synology I can run recognition on). Do you have a write-up of your setup?

Semaphor

> Playing music from a Plex server is a major use case for me, and I have given up on Rhasspy because I couldn’t get all the pieces to work together (I have the mic array HAT and a Synology I can run recognition on). Do you have a write-up of your setup?

I have not yet managed / worked enough on it (the lack of HW making everything theoretical, which kills my motivation). The way I understand it, is that there’ll either be a casting server on the satelite, or a pulse audio/pipewire server reachable via network. But I have next to no experience with consumer linux, so the configuration of those parts is… hard.

But there are many tutorials for playing multi-room audio (with icecast or something), I just assumed it would be easier without multi-room as I don’t need it, but it turns out it’s not ;)

puchatek

And how well does STT work when the room is not quiet anymore, e.g. when music is playing?

Semaphor

My understanding is, that the seeed array would work better than the PS eye, but for the volume I normally listen music at, it still works okay.

kelnos

They're also just really not great. I tested out Mycroft a couple years ago and found that the success rate for getting it to understand its wake word and listen for commands was under 10%. Maybe if you buy their prepackaged product, it works better, but that's not something I want to do. I just want to run it on a Pi 4 (which they claim works) with a mic array.

krisgesling

Yeah I think there are two sides to this coin (and just for clarity - all of this relates to Picroft, not Mimic 3 the TTS engine that just launched). The audio hardware makes a huge difference to audio input which is why we've developed the custom SJ201 board that's in the Mark II. But even on DIY units we have been making big improvements on the wake word detection by better balancing our training data sets. Once the Mark II is shipping there are additional wake word improvements on the roadmap. Eventually the system will optimize for the users of each device. So the wake word model on your device wouldn't be exactly the same as the model on mine. We've also ported the Wake Word model to Tensorflow Lite which means it uses a small fraction of the system resources that it used to :D We're also about to make some bigger changes to mycroft-core that will help to support a broader range of hardware in a more consistent way. So whilst you could try it again today and I can guarantee it's better than the last time you used it, if you want a DIY system instead of a Mark II - I'd suggest adding a reminder to check it again in a couple of months once these bigger changes land.

kelnos

That sounds fantastic! Thank you for replying; I'll definitely check back and give it another go.

stavros

10% doesn't sound much worse than my Alexas' 30%...

kevinmgranger

And if it's anything like siri, it can barely do anything useful, so it doesn't matter if it understands you.

kelnos

My Google home is pretty near 100%; I can count on one hand the number of times it hasn't "heard" me over the past year. That's my benchmark.

Brendinooo

They’ve done a lot of work in the last year on the software side. Might be worth revisiting. They’re tentatively on track to (finally!) ship in September of this year.

azalemeth

This looks awesome and I love seeing FOSS, privacy-first equivalents of Big Tech. The video was really, really cute – and you could hear the improvements of the tech as time went on. I must confess that my initial thought about watching it was that it was something to help blind or partially sighted people, however, as a document-to-words reader. Only later did I twig that they are essentially Alexa-speaker-alikes.

Therefore, I'll ask the question I always think of when I see smart speakers: what exactly is their use case? I've never used voice assistants. I've never had a PA. I have a variety of good, dumb speakers. If I am cooking, I have the radio on in the background and a smartphone in my pocket if I desperately wish to change something. I've always thought that the voice recognition was cool, but I've just never quite recognised a position where I would use it!

For the record, I live in a house with at least two raspberry pis on all the time (one as a DTV tuner) so I am far from a luddite in that regard. I just genuinely don't really know what use-case a smart speaker solves. Please enlighten me!

vidarh

I use mine for at least half a dozen timers on an average day. The more you use it, the more often you get the impulse to just set another timer, be it for "remember to stop playing that game and be productive" or remembering to leave the house on time, because it's so simple. I also use it to turn off/on the tv and lights. Not much of a point if it's a single one, but helpful when it's a number of lights, e.g. when we go to bed or leave the house, and you can address them all with a group ("alexa turn off downstairs"/"alexa turn off everything").

And to play music. Just asking it to play a track and then asking it to play similar music (very hit and miss), for example, and then asking what's playing, all without having to reach for my phone, finding an app etc.

My experience was that I bought my first one mostly because I wanted something to play music on in the living room anyway and didn't really care about getting a full on stereo setup as I'm not very picky about the sound quality, but I was curious. I never use voice assistants on my phone. But I found myself using it more and more as I got used to being able to turn things on/off without reaching for anything or when my hands where otherwise full.

It's not something I'd have the slightest difficulty of living without, but it feels like it's decreasing friction for a lot of small things.

I now have four - one by my desk, one in the living room, one in my sons bedroom and one in mine.

Fredej

My dad uses the Google speaker for two things:

- Making animal / fart sounds for his grand-children.

- Timers for cooking ("Hey Google, set timer to 4 minutes).

I recently spent a month at my parents place. I really, really miss the timer thing.

Additionally, I can see voice assistants as a pretty good interface for Home Assistant. At least for some parts.

whywhywhywhy

Timers is the only thing I've used Siri for since the first weeks of it's introduction.

End of the day feels like it's the only thing it's good for and it often fails at even doing that.

thejohnconway

Saying “hey dingus, add $item to the shopping list” is a killer feature for me. It’s so much easier than adding something manually on your phone (especially if you have your hands full cooking). Reminders and timers are something I use too. It’s also pretty good for playing music when you don’t have something particular in mind: “play some classical music” for example.

It’s definitely something I could live without, but even Apple’s speaker is pretty cheap. Especially so if it’s your main speaker (if you care about audio quality it might be a problem, but I don’t so it’s not).

wyldfire

What an awesome project! And AGPL is really perfect for this kind of work.

What's the BOM look like? I'd love to understand more about the design. The software's open source, right? After a brief skim I didn't see a repo link. Does anyone know where the source is? Do they use an AI accelerator DSP/TPU or just plain-old-software-on-a-CPU?

krisgesling

Hey there - the source for Mimic 3 can be found here: https://github.com/mycroftAI/mimic3 It can run CPU only or accelerated with a GPU.

Regarding the BOM, I assume you mean the Mark II? That you can find here: https://github.com/MycroftAI/hardware-mycroft-mark-II/tree/m... We actually ended up designing our own RPi daughterboard called the SJ201. It's mostly an audio front end with an XMOS XVF-3510 and dual mics, but also includes a 23W amp, some LEDs for feedback, buttons, a hardware mic switch, GPIO breakout and power management (amongst other things).

tunesmith

That video is hilarious.

imranq

Agree! I wonder if a sitcom using only artificial voices is coming soon

simcop2387

Written by GPT3 and illustrated by Dall-e. Coming soon to a spammy YouTube channel

follower

Here's my pitch semi-related to that topic from when the beta Mimic3 made it to HN: :)

https://news.ycombinator.com/item?id=31422342

undefined

[deleted]

riffraff

I had skipped the video, I went back and watched it, and I agree, it's brilliant.

dmos62

Let's get down to business. How do I use this as my Android TTS engine?

krisgesling

No Android release yet unfortunately, but you can drop your email in the bottom of this page and select the platforms you are interested in to be notified about specific future releases: https://mycroft.ai/mimic-3/

notjustanymike

That's a wonderfully effective marketing video. It's funny, gives me a background on the technology itself, and effectively highlights the new features.

krisgesling

Thanks :)

I'm going to assume this isn't our developer Mike's alternate account congratulating himself lol

notjustanymike

Hah no! But clearly Mikes are skilled individuals with a good sense of humor.

knodi123

How do you actually use it on a project? I see where you can order a dedicated piece of hardware, but I'd love to download this and replace pyttsx3 on my homemade IoT linux server.

But all I see is documentation, discussion of what they used to build it, and.... where's the actual softare?!?

krisgesling

Hey there, instructions for installation are all here: https://mycroft-ai.gitbook.io/docs/mycroft-technologies/mimi...

and the source code is available at: https://github.com/MycroftAI/mimic3

There is currently a Docker image, DEB package, and PyPI release.

knodi123

yeah, that did it. thanks!

NileTheGreat

Any chance you'll add just the SJ201 board on its own to your store? I'd love to experiment with my own case designs, but already have too many PCBA projects on my TODO bench, and am totally okay with paying a premium for a pre-assembled RPi daughter board.

krisgesling

We do get this request a bit but not yet at the scale where it is economically viable for us to do so.

I would however point out that the Mark II is completely hackable. So whilst it's not an SJ201 on its own, you can absolutely pull the whole thing apart, use it in other enclosures and even put it all back together again.

Another one of the reasons we made it is because having a single daughterboard greatly simplified production and made the Mark II more robust overall. There's no longer the possibility of a loose wire to the power supply or amp after it gets kicked around in the back of a delivery van. They're all in one and connected via the 40 pin GPIO header, but absolutely removable from the Mark II unit itself.

thecosmicfrog

Glad to see Popey lives on in Mimic 3. Does it have a better understanding of beans these days?

black_puppydog

And his voice sounds so realistically popey now!

synesthesiam

I believe I fixed the bean bug ;)

undefined

[deleted]

Daily Digest email

Get the top HN stories in your inbox every day.