Get the top HN stories in your inbox every day.
balloob
Some feedback to make your project easier to install and integrate better with Home Assistant (I'm the founder):
Home Assistant is building a voice assistant as part of our Year of the Voice theme. https://www.home-assistant.io/blog/2023/04/27/year-of-the-vo...
As part of our recent chapter 2 milestone, we introduced new Assist Pipelines. This allows users to configure multiple voice assistants. Your project is using the old "conversation" API. Instead it should use our new assist pipelines API. Docs: https://developers.home-assistant.io/docs/voice/pipelines/
You can even off-load the STT and TTS fully to Home Assistant and only focus on wake words.
You will see a lot higher adoption rate if users can just buy the ESP BOX and install the software on it without installing/compiling stuff. That's exactly why we created ESP Web Tools. It offers projects to offer browser-based installation directly from their website. https://esphome.github.io/esp-web-tools/
If you're going the ESP Web Tools route (and you should!), we've also created Improv Wi-Fi, a small protocol to configure Wi-Fi on the ESP device. This will allow ESP Web Tools to offer an onboarding wizard in the browser once the software has been installed. More info at https://www.improv-wifi.com/
Good luck!
timtom39
Home Assistant would be a lot more convincing if every upgrade did not completely break my install.
Flashed this on ESP I had laying around and did NOT have to upgrade HA (which would have made me not try the project).
lannisterstark
HA would be a lot more convincing if basic layout itself alongside config wasn't YAML hell. Every time I want to create some new layout or add something new to my home screen, I dread it.
I hate using it. Yet, I have no viable OSS alternatives.
peterhoeg
openHAB is very nice and completely OSS.
raman325
can you share more details about what's breaking? Is it a specific integration? Is it in general? What breaks? This is not consistent with most users' experience but it's hard to know without more specifics.
stragies
Some of the things that happened to me during the last 18 months:
- ChangeOver to the new bluetooth subsystem broke many integrations. My Bluetooth TRVs still don't work right (again).
- ONVIF support recently broke for an (admittedly shitty old) IP-Webcam. PTZ never worked/was_exposed.
- My USB-connected android devices can't be be controlled by the ADB-Integration anymore. There was some integration renaming/rescoping recently.
Home-Assistant still (imho) is best solution in this space for most combinations of metrics. I'd still recommend it to anyone.
(I tinker a lot with my HA-install/network, so maybe some of the above are issues on my end)
kkielhofner
> Flashed this on ESP I had laying around
So question is - what do you think :)?
kkielhofner
Hey there!
First of all, everyone involved in this project has been big fans and users of HA for many years (in my case at least a decade). THANK YOU! For now Willow wouldn't do anything other than light up a display and sit there without Home Assistant.
We will support the pipelines API and make it a configuration option (eventually default). HA has very rapid release cycles and as you note this is very new. At least for the time being we like the option of people being able to point Willow at older installs and have it "do something" today without requiring an HA upgrade that may or may not include breaking changes - hence the conversation API.
One of our devs is a contributor for esphome and we're heading somewhere in that direction, and he's a big fan of improv :).
We have plans for a Willow HA component and we'd love to run some ideas past the team. Conceptually, in my mind, we'll get to:
- Flashing and initial configuration from HA like esphome (possibly using esphome, but the Espressif ADF/SR/LCD/etc frameworks appear to be quite a ways out for esphome).
- Configuration for all Willow parameters from wifi to local speech commands in the HA dashboard, with dynamic and automatic updates for everything including local speech commands.
- OTA update support.
- TTS and STT components for our inference server implementation. These will (essentially) be very thin proxies for Willow but also enable use of TTS and STT functionality throughout HA.
- Various latency improvements. As the somewhat hasty and lame demo video illustrates[0] we're already "faster" than Alexa while maintaining Alexa competitive wake word, voice activity detection, noise suppression, far-field speech quality, accuracy, etc. With local command recognition on the Willow device and my HA install using Wemo switches (completely local) it's almost "you can't really believe it" fast and accurate.
I should be absolutely clear on something for all - our goal is to be the best hardware voice interface in the world (open source or otherwise) that happens to work very well with Home Assistant. Our goal is not to be a Home Assistant Voice Assistant. I hope that distinction makes at least a little sense.
You and the team are doing incredible work on that goal and while there is certainly some overlap we intend to maintain broad usability and compatibility with just about any platform (home automation, open source, closed source, commercial, whatever) someone may want to use Willow with.
In fact, our "monetization strategy" (to the extent we have one) is based on the various commercial opportunities I've been approached with over the years. Turns out no one wants to see an Amazon Echo in a doctor's office but healthcare is excited about voice (as one example) :).
Essentially, Home Assistant support in Willow will be one of the many integration modules we support, with Willow using as many bog-standard common denominator compliant protocols and transports that don't compromise our goals, while maintaining broad compatibility with just about any integration someone wants to use with Willow.
This is the very early initial release of Willow. We're happy for "end-users" to use it but we don't see the one-time configuration and build step being a huge blocker for our current target user - more technical early adopters who can stand a little pain ;).
daredoes
Thanks for all your work!
discardedrefuse
So I was just looking at the installation process for this device's dev environment (ESP-IDF from espresiff) and it seems kind of...insane.
The manual install method in the directions is not manual at all. It's a script that calls several python scripts. One has 2660 LOC and installs a root certificate (hard coded in the script itself) because of course, even though you just cloned the whole repo, it still has to download stuff from the internet. According to the code, "This works around the issue with outdated certificate stores in some installations".
Does anyone familiar with espressiff have an actual manual method of installing a dev environment for this device that doesn't involved pwning myself?
detaro
yes, do it in a container or VM. Welcome to the wonderful world of hardware manufacturer SDKs.
stintel
Indeed. This is exactly the reason why we standardized on building in a container.
yelite
If you are open to Nix, you can try https://github.com/mirrexagon/nixpkgs-esp-dev. I used it for a small project a while ago and the experience was pretty good.
kkielhofner
Nice!
For anyone who would try to use this with Willow (I like the effort and CERTAINLY don't love the ESP dev environment as-is):
- ESP ADF is actually the root of the project. ESP-IDF and all other sub-components are components themselves to ADF.
- We use bleeding edge ESP SR[0] that we also include as an ADF component.
- Plus LVGL, ESP-DSP, esp-lcd-touch, and likely others I'm forgetting ATM.
WaitWaitWha
Congratulations! This is great news!
I do not see anything posted on the Home Assistant (HA) Community forums.
> Configuring and building Willow for the ESP BOX is a multi-step process. We're working on improving that but for now...
This is crucial as your "competitors" are ready out of the box. I believe HA can be a Google/Alexa alternative to the masses only if the "out-of-the-box" experience is comparable to the commercial solutions.
Good luck, and keep us updated!
kkielhofner
Thanks!
HN was my first stop (of course) - I'll be heading over there shortly to post.
Oh yeah, we're well aware of how much of a "pain" getting Willow going can be. I don't like it (at all).
That said, you configure and build once for your environment and then get a .bin that can be flashed to the ESP BOX with anything that does ESP flashing (like various web interfaces, etc) or you can re-run the flash command across X devices. So even now, in this early stage, it's at least only painful once ;).
Down the road we want to have a Willow Home Assistant component that does everything inside of the HA dashboard so users (like esphome, maybe even using esphome) can point-click-configure-flash entirely from the HA dashboard. Not to mention ongoing dynamic configuration, over the air updates, etc.
I talk about all of this on our wiki[0].
[0] - https://github.com/toverainc/willow/wiki/Home-Assistant
freedomben
IMHO better to release early like this to a group of hackers than to wait until you have a nice out of the box setup going. This way you're going to get a lot of great feedback and hopefully some help. Awesome project!
kkielhofner
Bingo, thanks!
Cheetah26
This looks like something I've been wanting to see for a while.
I currently have a google home and I'm getting increasingly fed up with it. Besides the privacy concerns, it seems like it's getting worse at being an assistant. I'll want my light turned on by saying "light 100" (for light to 100 percent) and it works about 80% of the time, but the others it starts playing a song with a similar name.
I'd be great if this allows limiting / customizing what words and actions you want.
t-vi
Personally, I plugged a Jabra conference speaker to a Raspberry and if it hears something interesting, it sends to my local GPU computer for decoding (with whisper) + answer-getting + response sent back to the Raspberry as audio (with a model from coqui-ai/TTS but using more plain PyTorch). Works really nicely for having very local weather, calendar, ...
kkielhofner
Neat!
If you don't mind my asking, what do you mean "if it hears something interesting"? Is that based on wake word, or always listen/process?
t-vi
Both:
A long while ago, I wrote a little tutorial[0] on quantizing a speech commands network to the Raspberry. I used that to control lights directly and also for wake word detection.
More recently, I found that I can just use more classic VAD because my uses typically don't suffer if I turn on/off the microphone. My main goal is to not get out the mobile phone for information. That reduces the processing when I turn on the radio...
Not high-end as your solution, but nice enough for my purposes.
[0]. https://devblog.pytorchlightning.ai/applying-quantization-to...
kkielhofner
Totally get it!
There are at least two ways to deal with this frustrating issue with Willow:
- With local command recognition via ESP SR command recognition runs completely on the device and the accepted command syntax is defined. It essentially does "fuzzy" matching to address your light command ("light 100") but there's no way it's going to send some random match to play music.
- When using the inference server -or- local recognition we send the speech to text output to the Home Assistant conversation/intents[0] API and you can define valid actions/matches there.
[0] - https://developers.home-assistant.io/docs/intent_index/
chankstein38
This drives me nuts and happens all the time as well. To be honest, I unplugged my google home device a while back and haven't missed it. It mostly ended up being a clock for me because I'd try to change the color of my lights to a color that it mustn't have been capable of because I'd have to sit there for minutes listening to it list stores in the area that might sell those colored lights or something. It wouldn't stop. This is just one of many frustrating experiences I'd had with that thing.
schainks
THIS. It's hilarious and infuriating our digital assistants struggle to understand variants of "set lights at X% intensity".
However, if I spend the time to configure a "scene" with the right presets, Google has no issue figuring it out.
If only it could notice regular patterns about light settings and offer suggestions that I could approve/deny.
api
I love seeing lots of practical refutations of the "we have to do the voice processing in the cloud for performance" rationales peddled by the various home 1984 surveillance box vendors.
It's actually faster to do it locally. They want it tethered to the cloud for surveillance.
kkielhofner
We can do either.
For "basic" command recognition the ESP SR (speech recognition) library supports up to 400 defined speech commands that run completely on the device. For most people this is plenty to control devices around the home, etc. Because it is all local it's extremely fast - as I said in another comment pushing "Did that really just happen?" fast.
However, for cases where someone wants to be able to throw any kind of random speech at it "Hey Willow what is the weather in Sofia, Bulgaria?" that's probably beyond the fundamental capabilities of a device with enclosure, display, mics, etc that sells for $50.
That's why we plan to support any of the STT/TTS modules provided by Home Assistant to run on local Raspberry Pis or wherever they host HA. Additionally, we're open sourcing our extremely fast highly optimized Whisper/LLM/TTS inference server next week so people can self host that wherever they want.
java_beyb
first, good initiative! thanks for sharing. i think you gotta be more diligent and careful with the problem statement.
checking the weather in Sofia, Bulgaria requires cloud, current information. it's not "random speech". ESP SR capability issues don't mean that you cannot process it locally.
the comment was on "voice processing" i.e. sending speech to the cloud, not sending a call request to get the weather information.
besides, local intent detection, beyond 400 commands, there are great local STT options, working better than most cloud STTs for "random speech"
https://github.com/alphacep/vosk-api https://picovoice.ai/platform/cheetah/
kkielhofner
Thanks!
There are at least two things here:
1) The ability to do speech to text on random speech. I'm going to stick by that description :). If you've ever watched a little kid play with Alexa it's definitely what you would call "random speech" haha!
2) The ability to satisfy the request (intent) of the text output. Up to and including current information via API, etc.
Our soon to be released highly optimized open source inference server uses Whisper and is ridiculously fast and accurate. Based on our testing with nieces and nephews we have "random speech" covered :). Our inference server also supports LLaMA, Vicuna, etc and can chain together STT -> LLM/API/etc -> TTS - with the output simply played over the Willow speaker and/or displayed on the LCD.
Our goal is to make a Willow Home Assistant component that assists with #2. There are plenty of HA integrations and components to do things like get weather in real time, in addition to satisfying user intent recognition. They have an entire platform for it[0]. Additionally, we will make our inference server implementation (that does truly unique things for Willow) available as just another TTS/STT integration option on top of the implementations they already support so you can use whatever you want, or send the audio output after wake to whatever you want like Vosk, Cheetah, etc, etc.
[0] - https://developers.home-assistant.io/docs/intent_index/
COGlory
Ordered a box, can't wait to try this out! I've really been looking for something like this. My dream would be to have an LLM "agent" running locally, that knows who I am, etc, that can also double as a smart assistant for HA.
Huntszy
That's might closer than you think I guess. The thing is that the new Assist pipeline is fully customisable and can use other models as well. They already have a ChatGPT integration which does not able to control entities in HA but at least you can have a conversation with ChatGPT in speach trough HA.
So if you spin up somehow an LLM model locally and connect create a HA Assist pipeline with it and than you use Willow(s future release which should be able to leverage the new Assist featre) as a phisical interface than you are golden.
It may be hard or impossible today but I think within months HA and Willow will mature into a state where tha bigges problem will be the training and runing a good enough LLM model locally. But I bet a good amount of hackers are already hard working on that part anyway.
tmzt
Starting with this post:
https://community.home-assistant.io/t/using-gpt3-and-shorcut...
I've been trying to adapt it to an offline LLM model, probably a LLaMA-like one using the llm package for Rust, or a ggml-based C implementation like llama.c.
It could even be fine-tuned or trained to perform better and always output only the json.
This could be a good fit with open sourced tovera when that is released.
I like the idea of supporting natural language commands that feel more natural and don't have to follow a specific syntax.
It can also process general LLM requests, possibly using a third-party LLM like Bard for more up to date responses.
barbariangrunge
I never really considered getting a home assistant doodad because of the privacy issues around them. This sounds like a cool project
Huntszy
What kind of privacz issues are you reffering to? Legit qustion btw, I'm not aware of any but would like to read about it.
stintel
Thanks!
tbyehl
What's the story for multiple devices being triggered by a single utterance of the wake word?
I have a Alexa or Google device in nearly every room so that '[wake word] lights [on|off]' or whatever does the right thing for that space. Alexa devices are pretty good about processing from the 'right' device when multiple are triggered. Google, not-so-much.
(Also a gap in both platforms is that they don't pass along the triggering device information)
kkielhofner
I get really excited about this one!
Right now we don't do anything about it. BUT - I get excited because our wake word detection and speech rec is so good I have to go around my house and unplug all of my other devices when I'm doing development because otherwise a bunch of them wake. So it's good and bad right now :).
My thinking hasn't completely formed but I believe I have a few potential solutions to this issue in mind.
I've been replying to comments for 12 hours, can you let me slide on this one ;)? I promise we'll start discussing/working on it publicly fairly soon.
edf13
Sounds interesting - one question I have is about the mic array... Isn't this one of the supposed benefits of a physical Alexa device, and rumored to be sold at a loss because of the quality.
How does the esp-box compare? E.g. in a noisy environemnt, tv in the background, kids and dogs running around?
kkielhofner
The ESP BOX has an acoustically optimized enclosure with dual microphones for noise cancelation, separation, etc.
Between that and the Espressif AFE (audio frontend interface) doing a bunch of DSP "stuff" in our testing it does remarkably well in noisy environments and far-field (25-30 feet) use cases.
Our inference server implementation (open source, releasing next week) uses a highly performance optimized Whisper which does famously well with less-than-ideal speech quality.
All in, even though it's all very early, it's very competitive with Echo, etc.
glenngillen
What’s the latency on inference on a rasbpi (I assume it’s not running it direct on the device)? I think I read previously that it was up to 7 secs, and if you wanted sub-second you’d need an i5.
kkielhofner
Willow supports the Espressif ESP SR speech recognition framework to do completely on device speech recognition for up to 400 commands. When configured, we pull light and switch entities from home assistant and build the grammar to turn them on and off. There's no reason it has to be limited to that, we just need to do some extra work for better dynamic configuration and tighter integration with Home Assistant to allow users to define up to 400 commands to do whatever they want with their various entities.
With local command recognition with Willow I can turn my wemo switches on and off, completely on device, in roughly 300ms. That's not a typo. I'm going to make another demo video showing that.
We also support live streaming of audio after wake to our highly optimized Whisper inference server implementation (open source, releasing next week). That's what our current demo video uses[0]. It's really more intended for pro/commercial applications as it supports CPU but really flies with CUDA - where even on a GTX 1060 3GB you can do 3-5 seconds of speech in ~500ms or so.
We also plan to have a Willow Home Assistant component to support Willow "stuff" while enabling use of any of the STT/TTS modules in Home Assistant (including another component for our inference server you can self-host that does special Willow stuff).
stavros
Where can I get one? I can't find it on Ali :(
kkielhofner
You never know if people are going to love your pet project as much as you do. We had a hunch the community would appreciate Willow but like I said, you just never know.
My suspicion is Espressif (until now, hah) hasn't sold a lot of ESP Boxes. We were concerned that if Willow takes off they will sell out. That already appears to be happening.
Espressif has tremendous manufacturing capacity and we hope they will scale up ESP BOX production to meet demand now that (with Willow) it exists. The only gaiting item for them is probably the plastic enclosure and they should be able to figure out how to produce that en masse :).
RileyJames
I’ve been living in a house for the past few months with a google assistant. I only use it to put on music, but I have noticed I play more music due to the ease of putting it on.
But I hate the privacy invasion aspect. I’m definitely in the market for something like this. And this one looks great.
Additionally, I’ve noticed that the google voice assistant (connected to Spotify) doesn’t keep playing the albums I ask for.
It states it’s playing the album. But after 4/5 songs it starts playing different songs, or different artists.
kkielhofner
Music output is "on the list".
Biggest fundamental issue is the speaker built in the ESP BOX is optimized for speech and not going to impress anyone playing music.
That said, the ESP BOX (of course) supports bluetooth so we can definitely pair with a speaker you bring.
Willow is the first of it's kind that I'm aware of to enable this kind of functionality at anything close to this price point in the open source ecosystem. Either we (or someone else) is likely going to manufacture an improved ESP BOX with market competitive speakers built-in for music playback.
Then it's "just" a matter of actually getting the music audio but we'll figure that out ;).
COGlory
Could we not use Willow to cast music, say, via Spotify or some other network streamer, through HA, to my pre-existing sound system?
kkielhofner
The approach there would be to ignore Willow for music output and just do what it does today:
- Wake
- Get command
- Send to Home Assistant conversation/intents API[0]
- Home Assistant does whatever you define, including what you describe just like it does today
So unless I'm missing something your use case should "just work".
[0] - https://developers.home-assistant.io/docs/intent_index/
RileyJames
Nice. I guess I don’t expect willow to cover the speaker element. I’d rather connect with my existing hifi / Bluetooth speakers.
But with google I’m stuck with their integration to Spotify. It’s that component I’d like control over, and that’s why I’d use willow.
That and not being spied on in my own home.
Definitely keen for one.
kkielhofner
The ESP Box with the ESP32 S3 has robust bluetooth support and I don't see A2DP/BT/pairing management/etc being that big of a lift. In full transparency it's probably towards the bottom on the priorities list ATM but the important thing is it's on the list already and it happens to be something I'm personally interested in :).
chankstein38
It also, at least in my case, frequently won't stop playing when you tell it to. And, if you want a song that has a title that isn't family friendly, it'll completely ignore that title and just play whatever the heck it wants.
mdrzn
Very interesting, I would buy an "off the shelf" version if it worked out of the box with Vicuna13 or similar LLM.
kkielhofner
Our inference server (open source - releasing next week) has support for loading LLaMA and derivative models complete with 4-bit quantization, etc. I like Vicuna 13B myself :). Not to mention extremely fast and memory optimized Whisper via ctranslate2 and a bunch of our own tweaks.
Our inference server also supports long-lived sessions via WebRTC for transcription, etc applications ;).
You can chain speech to text -> LLM -> text to speech completely in the inference server and input/output through Willow, along with other APIs or whatever you want.
vlugorilla
Awesome work! May I ask what are you using for text-to-speech?
kkielhofner
Thanks, of course!
For wake word and voice activity detection, audio processing, etc we use the ESP SR (speech recognition) framework from Espressif[0]. For speech to text there are two options and more to come:
1) Completely on device command recognition using the ESP SR Multinet 6 model. Willow will (currently) pull your light and switch entities from Home Assistant and generate the grammar and command definition required by Multinet. We want to develop a Willow Home Assistant component that will provide tighter Willow integration with HA and allow users to do this point and click with dynamic updates for new/changed entities, different kinds of entities, etc all in the HA dashboard/config.
The only "issue" with Multinet is that it only supports 400 defined commands. You're not going to get something like "What's the weather like in $CITY?" out of it.
For that we have:
2-?) Our own highly optimized inference server using Whisper, LLamA/Vicuna, and Speecht5 from transformers (more to come soon). We're open sourcing it next week. Willow streams audio after wake in realtime, gets the STT output, and sends it wherever you want. With the Willow Home Assistant component (doesn't exist yet) it will sit in between our inference server implementation doing STT/TTS or any other STT/TTS implementation supported by Home Assistant and handle all of this for you - including chaining together other HA components, APIs, etc.
canadiantim
Wow this looks beyond epic. I've been looking for something like this.
Going to try to hack this into something my mom can use (who has trouble with confusion and memory). Could potentially be very great.
Thank you
kkielhofner
Thanks!
We are really, truly, and seriously committed to building a device that with support from Home Assistant and other integrations doesn't leave any reason whatsoever to buy an Echo or similar creepy commercial device. No compromises on cost, performance, accuracy, speed, usability, functionality, etc.
We're really looking forward to getting additional testing and feedback from the community on speech recognition results, other integrations, etc. It's just two of us working on this part time over the last month or so - this is VERY early but I think we're off to a good start!
canadiantim
Wow yeah I think you're really onto something here. No one actually wants the creepiness from Echo or Alexa etc. That's what prevented me from trying any Home Assistant thing before, but I know it could be very useful if actually sensitive to privacy-concerns.
Best of luck with the development! I'll definitely be following closely. Do you sell the pre-built hardware yourself?
kkielhofner
Thanks!
When you're releasing a pet project of love like this you never really know if other people are going to appreciate it as much as you do. Looking here on HN it seems like people appreciate it.
We don't sell the hardware currently because:
1) Espressif has well established sales channels and distribution worldwide.
2) It's not our "business model". In my capacity as advisor to a few startups in the space I've been approached by various commercial entities that want a hardware voice interface they fully control. In healthcare, for example, there are all kinds of interesting audio and speech applications but NO ONE, and I mean NO ONE is going to be ok with seeing an Echo in their doctor's office. That's where an ESP BOX or custom manufactured hardware and Willow come in.
Our business model is to combine our soon to be released very high performance inference and API server with Willow to support these commercial applications (and home users with HA, of course). In all but a few identified and very limited cases all work will come back to the open source projects like our inference server and Willow.
Get the top HN stories in your inbox every day.
As the Home Assistant project says, it's the year of voice!
I love Home Assistant and I've always thought the ESP BOX[0] hardware is cool. I finally got around to starting a project to use the ESP BOX hardware with Home Assistant and other platforms. Why?
- It's actually "Alexa/Echo competitive". Wake word detection, voice activity detection, echo cancellation, automatic gain control, and high quality audio for $50 means with Willow and the support of Home Assistant there are no compromises on looks, quality, accuracy, speed, and cost.
- It's cheap. With a touch LCD display, dual microphones, speaker, enclosure, buttons, etc it can be bought today for $50 all-in.
- It's ready to go. Take it out of the box, flash with Willow, put it somewhere.
- It's not creepy. Voice is either sent to a self-hosted inference server or commands are recognized locally on the ESP BOX.
- It doesn't hassle or try to sell you. If I hear "Did you know?" one more time from Alexa I think I'm going to lose it.
- It's open source.
- It's capable. This is the first "release" of Willow and I don't think we've even begun scratching the surface of what the hardware and software components are capable of.
- It can integrate with anything. Simple on the wire format - speech output text is sent via HTTP POST to whatever URI you configure. Send it anywhere, and do anything!
- It still does cool maker stuff. With 16 GPIOs exposed on the back of the enclosure there are all kinds of interesting possibilities.
This is the first (and VERY early) release but we're really interested to hear what HN thinks!
[0] - https://github.com/espressif/esp-box