Brian Lovin
/
Hacker News
Daily Digest email

Get the top HN stories in your inbox every day.

modeless

I made a 100% local voice chatbot using StyleTTS2 and other open source pieces (Whisper and OpenHermes2-Mistral-7B). It responds so much faster than ChatGPT. You can have a real conversation with it instead of the stilted Siri-style interaction you have with other voice assistants. Fun to play with!

Anyone who has a Windows gaming PC with a 12 GB Nvidia GPU (tested on 3060 12GB) can install and converse with StyleTTS2 with one click, no fiddling with Python or CUDA needed: https://apps.microsoft.com/detail/9NC624PBFGB7

The demo is janky in various ways (requires headphones, runs as a console app, etc), but it's a sneak peek at what will soon be possible to run on a normal gaming PC just by putting together open source pieces. The models are improving rapidly, there are already several improved models I haven't yet incorporated.

lucubratory

How hard on your end does the task of making the chatbot converse naturally look? Specifically I'm thinking about interruptions, if it's talking too long I would like to be able to start talking and interrupt it like in a normal conversation, or if I'm saying something it could quickly interject something. Once you've got the extremely high speed, theoretically faster than real time, you can start doing that stuff right?

There is another thing remaining after that for fully natural conversation, which is making the AI context aware like a human would be. Basically giving it eyes so it can see your face and judge body language to know if it's talking too long and needs to be more brief, the same way a human talks.

modeless

Yes, I implemented the ability to interrupt the chatbot while it is talking. It wasn't too hard, although it does require you to wear headphones so the bot doesn't hear itself and get interrupted.

The other way around (bot interrupting the user) is hard. Currently the bot starts processing a response after every word that the voice recognition outputs, to reduce latency. When new words come in before the response is ready it starts over. If it finishes its response before any more words arrive (~1 second usually) it starts speaking. This is not ideal because the user might not be done speaking, of course. If the user continues speaking the bot will stop and listen. But deciding when the user is done speaking, or if the bot should interrupt before the user is done, is a hard problem. It could possibly be done zero-shot using prompting of a LLM but you'd want a GPT-4 level LLM to do a good job and GPT-4 is too slow for instant response right now. A better idea would be to train a dedicated turn-taking model that directly predicts who should speak next in conversations. I haven't thought much about how to source a dataset and train a model for that yet.

Ultimately the end state of this type of system is a complete end-to-end audio-to-audio language model. There should be only one model, it should take audio directly as input and produce audio directly as output. I believe that having TTS and voice recognition and language modeling all as separate systems will not get us to 100% natural human conversation. I think that such a system would be within reach of today's hardware too, all you need is the right training dataset/procedure and some architecture bits to make it efficient.

As for giving the model eyes, actually there are already open source vision-language models that could be used for this today! I'd love to implement one in my chatbot. It probably wouldn't have social intelligence to read body language yet, but it could definitely answer questions about things you present to the webcam, read text, maybe even look at your computer screen and have conversations about what's on your screen. The latter could potentially be very useful, the endgame there is like GitHub Copilot for everything you do on your computer, not just typing code.

lucubratory

Thanks, fascinating insights. I think an everything-to-everything multimodal model could work if it's big enough because of transfer learning (but then there are latency issues), and so could a refined system built on LLMs/LMMs with TTS (like what you are using), but I haven't seen any good research on audio-to-audio language models. My suspicion is that that would take a lot of compute, much more than text, and that the amount of semantically meaningful accessible data might be much lower as well. And if you do manage to get to the same level of quality as text, what is latency like then? Not 100% sure, just intuitions, but I doubt it's great.

I like the idea of an RL predictor for interruption timing, although I think it might struggle with factual-correction interruptions. It could be a good way to make a very fast system, and if latency on the rest of the system is low enough you could probably start slipping in your "Of course", "Yeah, I agree", and "It was in March, but yeah" for truly natural speech. If latency is low you could just use the RL system to find opportunities to interrupt, give them to the LLM/LMM, and it decides how to interrupt, all the way from "mhm", to "Yep, sounds good to me", to "Not quite, it was the 3rd entry, but yeah otherwise it makes sense", to "Actually can I quickly jump on that? I just wanted to quickly [make a point]/[ask a question] about [some thing that requires exploration before the conversation continues]".

Tuning a system like this would be the most annoying activity in human history, but something like this has to be achieved for truly natural conversation so we gotta do it lol.

fintechie

> although it does require you to wear headphones so the bot doesn't hear itself and get interrupted.

Maybe you can use some sort of speaker identification to sort this out?

https://github.com/openai/whisper/discussions/264

CygnusX1-2112

I installed this on my system, having a little issue like you mention below where the bot hears itself so it got into a loop of talking to itself and replying to itself, it was pretty funny.

Could the issue be that I am using a pair of bluetooth headphones and the microphone is built into that - what is the optimum setup, should I be listening on the headphones and using a different mic input instead of the headphone mic?

This is pretty intermeeting I would love to get it to work. Running a 3060.

Thanks.

CygnusX1-2112

I installed this on my system, having a little issue like you mention below where the bot hears itself so it got into a loop of talking to itself and reply to itself.

Could the issue be that I am using a pair of bluetooth headphones and the microphone is built into that - what is the optimum setup, should I be listening on the headphones and using a different mic input instead of the headphone mic?

This is pretty intermeeting I would love to get it to work. Running a 3060.

Thanks.

generalizations

> It could possibly be done zero-shot using prompting of a LLM

That's how I've been thinking of doing it - seemed like you could use a much smaller GPT-J-ish model for that, and measure the relative probability of 'yes' vs 'no' tokens in response to a question like 'is the user done talking'. Seemed like even that would be orders of magnitude better than just waiting for silence.

slow_numbnut

Instead of sacrificing flexibility by building one monolith model that does Audio to audio in one go, wouldn't it be better to train a model that handles conversing with the user (knows when the user is done talking, when it's hearing itself, etc) and leave the thinking to other, more generic models?

regularfry

It would be very interesting to have something like BakLLaVA's image description fed from a webcam used as a context for the LLM. "You can see: <description of scene, description of changes from last snapshot>" or something along those lines in the system prompt.

globalnode

you'd have to do something along the lines of what voice comm does to combat the output feedback problem. i think it involves an fft to analyse the two signals and cancel out the feedback, im not 100% sure on the details.

eigenvalue

Tried it but it seems it only works with Cuda 11 and I have 12 installed. Not really willing to potentially screw up my Cuda environment to try it.

modeless

Thanks for trying, what error message did you get? It works without CUDA installed at all on my test machine.

eigenvalue

  Process Process-2:
  Traceback (most recent call last):
    File "multiprocessing\process.py", line 314, in _bootstrap
    File "multiprocessing\process.py", line 108, in run
    File "chirp.py", line 126, in whisper_process
    File "chirp.py", line 126, in <listcomp>
    File "faster_whisper\transcribe.py", line 426, in generate_segments
    File "faster_whisper\transcribe.py", line 610, in encode
  RuntimeError: Library cublas64_11.dll is not found or cannot be loaded
  tts initialized

nmstoker

Using a conda environment should be able to get around that I believe

undefined

[deleted]

shon

Cool work! I tested it and got some mixed results:

1) it throws an error if it's installed to any drive other than C:\ --I moved it to C: and it works fine.

2) I'm seeing huge latency on an EVGA 3080Ti with 12GB. Also seeing it repeat the parsed input, even though I only spoke once, it appears to process the same input many times with slightly different predictions sometimes. Here's some logs:

Latency to LLM response: 4.59 latency to speaking: 5.31 speaking 4: Hi Jim! user spoke: Hi Jim. user spoke recently, prompting LLM. last word time: 77.81 time: 78.11742429999867 latency to prompting: 0.31

Latency to LLM response: 2.09 latency to speaking: 3.83 speaking 5: So what have you been up to lately? user spoke: So what have you been up to lately? user spoke recently, prompting LLM. last word time: 83.9 time: 84.09415280001122 latency to prompting: 0.19 user spoke: So what have you been up to lately? No, I'm watching. user spoke a while ago, ignoring. last word time: 86.9 time: 88.92142140000942 user spoke: So what have you been up to lately? No, just watching TV. user spoke a while ago, ignoring. last word time: 87.9 time: 90.76665070001036 user spoke: So what have you been up to lately? No, I'm just watching TV. user spoke a while ago, ignoring. last word time: 87.9 time: 94.16581820001011 user spoke: So what have you been up to lately? No, I'm just watching TV. user spoke a while ago, ignoring. last word time: 88.9 time: 97.85854300000938 user spoke: So what have you been up to lately? No, I'm just watching TV. user spoke a while ago, ignoring. last word time: 87.9 time: 101.54986060000374 user spoke: No, I just bought you a TV. user spoke a while ago, ignoring. last word time: 87.8 time: 104.51332219998585 user spoke: No, I'll just watch you TV. user spoke a while ago, ignoring. last word time: 87.41 time: 106.60086529998807 Latency to LLM response: 46.09 latency to speaking: 50.49

Thanks for posting it!

Edit:

3) It's hearing itself and responding to itself...

modeless

Thanks for trying it and thanks for the feedback! Yes, right now you need to use headphones so it doesn't hear itself. Sometimes Whisper inexplicably fails to recognize speech promptly. It seems to depend on what you say, so try saying something else. I have improvements that I haven't had time to release yet that should improve the situation, and a lot more work is definitely needed, this is definitely MVP level stuff right now. This stuff is fixable but it'll take time.

funtech

Is 12GB the minimum? got an out of memory error with 8GB

modeless

Yes, unfortunately these models take a lot of VRAM. It may be possible to do an 8GB version but it will have to compromise on quality of voice recognition and the language model so it might not be a good experience.

joshspankit

This might be silly because of how few people it benefits, but could it be broken up on to multiple 8GB cards on the same system?

samsepi0l121

But whisper does not support input streaming, so you have to wait for the whole llm response to trigger the transcription or not?

yencabulator

Apparently, by running it on windows of audio very often:

https://news.ycombinator.com/item?id=38340938

zestyping

Wow! For those of us who don't have the necessary GPU hardware, can you post a video?

tomp

How do you get Whisper to be fast?

Isn't it quite non-realtime?

modeless

Great question! Whisper processes audio in 30 second chunks. But on a fast GPU it can finish in only 100 milliseconds or so. So you can run it 10+ times per second and get around 100ms latency. Even better actually because Whisper will predict past the end of the audio sometimes.

This is an advantage of running locally. Running whisper this way is inefficient but I have a whole GPU sitting there dedicated to one user, so it's not a problem as long as it is fast enough. It wouldn't work well for a cloud service trying to optimize GPU use. But there are other ways of doing real time speech recognition that could be used there.

TOMDM

The community upgrades to whisper are far faster than real-time, especially if you have a powerful gpu

Jach

There's several faster ones out there. I've been using https://github.com/Softcatala/whisper-ctranslate2 which includes a nice --live_transcribe flag. It's not as good as running it on a complete file but it's been helpful to get the gist of foreign language live streams.

wahnfrieden

use whisper-distil, it's like 5-8x faster

aik

Hey modeless. Love it. Is your project open source by any chance? Would love to see it.

modeless

I haven't decided yet what I'm going to do with it. I think ideally I would open source it for people who have GPUs but also run it as a paid service for people who don't have GPUs. Open source that also makes money is always the holy grail :) I'll post updates on my Twitter/X account.

lhl

I tested StyleTTS2 last month, my step-by-step notes that might be useful for people doing local setup (not too hard): https://llm-tracker.info/books/howto-guides/page/styletts-2

Also I did a little speed/quality shootoff with the LJSpeech model (vs VITS and XTTS). StyleTTS2 was pretty good and very fast: https://fediverse.randomfoo.net/notice/AaOgprU715gcT5GrZ2

kelseyfrog

> inferences at up to 15-95X (!) RT on my 4090

That's incredible!

Are infill and outpainting equivalents possible? Super-RT TTS at this level of quality opens up a diverse array of uses esp for indie/experimental gamedev that I'm excited for.

huac

It is theoretically possible to train a model that, given some speech, attempts to continue the speech, e.g. Spectron: https://michelleramanovich.github.io/spectron/spectron/. Similarly, it is possible to train a model to edit the content, a la Voicebox: https://voicebox.metademolab.com/edit.html.

taneq

Great. :P

Me: Won’t it be great when AI can-

Computer: Finish your sentences for you? OMG that’s exactly what I was thinking!

JonathanFly

>Are infill and outpainting equivalents possible?

Do you mean outpainting as in you still what words to do, or the model just extends the audio unconditionally the way some image models just expand past an image borders without a specific prompt (in audio like https://twitter.com/jonathanfly/status/1650001584485552130)

refulgentis

Not sure what you mean: If you mean could inpainting and out painting with image models be faster, its a "not even wrong" question, similar to asking if the United Airlines app could get faster because American Airlines did. (Yes, getting faster is an option available to ~all code)

If you mean could you inpaint and outpaint text...yes, by inserting and deleting characters.

If you mean could you use an existing voice clip to generate speech by the same speaker in the clip, yes, part of the article is demonstrating generating speech by speakers not seen at training time

pedrovhb

I'm not sure I understand what you mean to say. To me it's a reasonable question asking whether text to speech models can complete a missing part of some existing speech audio, or make it go on for longer, rather than only generating speech from scratch. I don't see a connection to your faster apps analogy.

Fwiw, I imagine this is possible, at least to some extent. I was recently playing with xtts and it can generate speaker embeddings from short periods of speech, so you could use those to provide a logical continuation to existing audio. However, I'm not sure it's possible or easy to manage the "seams" between what is generated and what is preexisting very easily yet.

It's certainly not a misguided question to me. Perhaps you could be less curt and offer your domain knowledge to contribute to the discussion?

Edit: I see you've edited your post to be more informative, thanks for sharing more of your thoughts.

kelseyfrog

Ignore the speed comment; it is unrelated to my question.

What I mean is, can output be conditioned on antecedent audio as well as text analogous to how image diffusion models can condition inpainting and outpatient on static parts of an image and clip embeddings?

rahimnathwani

Thanks. Following the instructions now. BTW mamba is no longer recommended (for those like me who aren't already using it), and the #mambaforge anchor in the link didn't work.

lhl

I switched from conda to mamba a while ago and never looked back (it's probably saved dozens of hours from waiting for conda's slow as molasses package resolution). I'm looking at the latest docs and it doesn't look like there's any deprecation messages or anything (it does warn against installing mamba inside of conda, but that's been the case for a long time): https://mamba.readthedocs.io/en/latest/installation/mamba-in...

It looks like miniforge is still the recommended install method, but also the anchor has changed in the repo docs, which I've updated, thx. FWIW, I haven't run into any problems using mamba. While I'm not a power user, so there are edge cases I might have missed, but I have over 35 mamba envs on my dev machine atm, so it's definitely been doing the job for me and remains wicked fast (if not particularly disk efficient).

rahimnathwani

I had somehow missed the introduction of mamba, and have been using the default conda solver (which I think is the 'classic' one). Apparently conda now supports using the mamba solver: https://www.anaconda.com/blog/a-faster-conda-for-a-growing-c...

  conda update -n base conda
  conda install -n base conda-libmamba-solver
  conda config --set solver libmamba

eigenvalue

Was somewhat annoying to get everything to work as the documentation is a bit spotty, but after ~20 minutes it's all working well for me on WSL Ubuntu 22.04. Sound quality is very good, much better than other open source TTS projects I've seen. It's also SUPER fast (at least using a 4090 GPU).

Not sure it's quite up to Eleven Labs quality. But to me, what makes Eleven so cool is that they have a large library of high quality voices that are easy to choose from. I don't yet see any way with this library to get a different voice from the default female voice.

Also, the real special sauce for Eleven is the near instant voice cloning with just a single 5 minute sample, which works shockingly (even spookily) well. Can't wait to have that all available in a fully open source project! The services that provide this as an API are just too expensive for many use cases. Even the OpenAI one which is on the cheaper side costs ~10 cents for a couple thousand word generation.

eigenvalue

To save people some time, this is tested on Ubuntu 22.04 (google is being annoying about the download link, saying too many people have downloaded it in the past 24 hours, but if you wait a bit it should work again):

  git clone https://github.com/yl4579/StyleTTS2.git
  cd StyleTTS2
  python3 -m venv venv
  source venv/bin/activate
  python3 -m pip install --upgrade pip
  python3 -m pip install wheel
  pip install -r requirements.txt
  pip install phonemizer
  sudo apt-get install -y espeak-ng
  pip install gdown
  gdown https://drive.google.com/uc?id=1K3jt1JEbtohBLUA0X75KLw36TW7U1yxq
  7z x Models.zip
  rm Models.zip
  gdown https://drive.google.com/uc?id=1jK_VV3TnGM9dkrIMsdQ_upov8FrIymr7
  7z x Models.zip
  rm Models.zip
  pip install ipykernel pickleshare nltk SoundFile
  python -c "import nltk; nltk.download('punkt')"
  pip install --upgrade jupyter ipywidgets librosa
  python -m ipykernel install --user --name=venv --display-name="Python (venv)"
  jupyter notebook
  
Then navigate to /Demo and open either `Inference_LJSpeech.ipynb` or `Inference_LibriTTS.ipynb` and they should work.

undefined

[deleted]

degobah

Very helpful, thanks!

wczekalski

One thing I've seen done for style cloning is a high quality fine tuned TTS -> RVC pipeline to "enhance" the output. TTS for intonation + pronunciation, RVC for voice texture. With StyleTTS and this pipeline you should get close to ElevenLabs.

eigenvalue

I suspect they are doing many more things to make it sounds better. I certainly hope open source solutions can approach that level of quality, but so far I've been very disappointed.

KolmogorovComp

RVC? R… Voice Model?

stavros

Retrieval-based voice conversion, apparently.

sandslides

The LibriTTS demo clones unseen speakers from a five second or so clip

eigenvalue

Ah ok, thanks. I tried the other demo.

eigenvalue

I tried it. Sounds absolutely nothing like my voice or my wife's voice. I used the same sample files as I used 2 days ago on the Eleven Labs website, and they worked flawlessly there. So this is very, very far from being close to "Eleven Labs quality" when it comes to voice cloning.

wczekalski

have you tested longer utterances with both ElevenLabs and with StyleTTS? Short audio synthesis is a ~solved problem in the TTS world but things start falling apart once you want to do something like create an audiobook with text to speech.

wingworks

I can say that the paid service from ElevenLabs can do long form TTS very well. I used it for a while to convert long articles to voice to listen to later instead of reading. It works very well. I only stopped because it gets a little pricey.

stavros

The OpenAI API is ten times cheaper and a fair bit faster.

Also, ElevenLabs keeps diverging for me, and starts mispronouncing words after two or three sentences.

satvikpendem

Funnily enough, the TTS2 examples sound better than the ground truth [0]. For example, the "Then leaving the corpse within the house [...]" example has the ground truth pronounce "house" weirdly, with some change in the tonality that sounds higher, but the TTS2 version sounds more natural.

I'm excited to use this for all my ePub files, many of which don't have corresponding audiobooks, such as a lot of Japanese light novels. I am currently using Moon+ Reader on Android which has TTS but it is very robotic.

[0] https://styletts2.github.io/

qingcharles

First Wife is a professional voice-over actor. I saw someone left her a bad review saying "Clearly an AI."

2023. There is no way to win.

KolmogorovComp

The pace is better, but imho you there is still a very noticeable “metalic” tone which makes it inferior to the real thing.

Impressive results nonetheless, and superior to all other TTS.

risho

how are you planning on using this with epubs? i'm in a similar boat. would really like to leverage something like this for ebooks.

satvikpendem

I wonder if you can add a TTS engine to Android as an app or plugin, then make Moon+ Reader or another reader to use that custom engine. That's probably how I'd do it for the easiest approach, but if that doesn't work, I might just have to make my own app.

a_wild_dandan

I’m planning on making a self-host solution where you can upload files and the host sends back the audio to play, as a first pass on this tech. I’ll open source the repo after fiddling and prototyping. I’ve needed this kinda thing for a long time!

jrpear

You can! [rhvoice](https://rhvoice.org/) is an open source example.

gjm11

HN title at present is "StyleTTS2 – open-source Eleven Labs quality Text To Speech". Actual title at the far end doesn't name any particular other product; arXiv paper linked from there doesn't mention Eleven Labs either. I thought this sort of editorializing was frowned on.

stevenhuang

Eleven Labs is the gold standard for voice synthesis. There is nothing better out there.

So it is extremely notable for an open source system to be able to approach this level of quality, which is why I'd imagine most would appreciate the comparison. I know it caught my attention.

lucubratory

OpenAI's TTS is better than Eleven Labs, but they don't let you train it to have a particular voice out of fear of the consequences.

huac

I concur that, for the use cases that OpenAI's voices cover, it is significantly better than Eleven.

yreg

But is this even approaching Eleven? Doesn't seem like it from the other comments here.

modeless

It is editorializing and it is an exaggeration. However I've been using StyleTTS2 myself and IMO it is the best open source TTS by far and definitely deserves a spot on the top of HN for a while.

GaggiX

Yes, it's against the guidelines. In fact, when I read the title, I didn't think it was a new research paper but a random GitHub project.

jasonjmcghee

Out of curiosity - to folks that have had success with this...

This voice cloning is... nothing like XTTSv2, let alone ElevenLabs.

It doesn't seem to care about accents at all. It does pretty well with pitch and cadence, and that's about it.

I've tried all kinds of different values for alpha, beta, embedding scale, diffusion steps.

Anyone else have better luck?

Sure it's fast and the sound quality is pretty good, but I can't get the voice cloning to work at all.

jsjmch

See my previous comment about this point. ElevenLabs are based on Tortoise-TTS which was already pre-trained on millions of hours of data, but this one was only trained on LibriTTS which was 500 hours at best. XTTS was also trained with probably millions of speakers in more than 20 languages.

If you have seen millions of voices, there are definitely gonna be some of them that sound like you. It is just a matter of training data, but it is very difficult to have someone collect these large amounts of data and train on it.

lossolo

> It is just a matter of training data, but it is very difficult to have someone collect these large amounts of data and train on it.

It's really not that difficult, they are trained mostly on audiobooks and high quality audio from yt videos. If we talk about EV model then we are talking about around 500k hours of audio, but Tortoise-TTS is only around 50k from what I remember.

wczekalski

What's your basis for the claim that they are based on TorToiSe? I have seen this claim made (and rebutted) many times.

jsjmch

Very similar features, quite slow inference speed, and various rumors.

dsrtslnd23

See the conclusion remarks in the paper - they acknowledge that voice cloning is not that good (yet).

carbocation

I had the same experience as what you described (with a lot of experimentation with alpha and beta, as well as uploading different audio clips).

wg0

The quality is really really INSANE and pretty much unimaginable in early 2000s.

Could have interesting prospects for games where you have LLM assuming a character and such TTS giving those NPCs voice.

beachy

This is a big thing for one area I'm interested in - golf simulation.

Currently playing in a golf simulator has a bit of a post-apocalyptian vibe. The birds are cheeping, the grass is rustling, the game play is realistic, but there's not a human to be seen. Just so different from the smacktalking of a real round, or the crowd noise at a big game.

It's begging for some LLM-fuelled banter to be added.

eXpl0it3r

In Super Video Golf which is more a old-school/retro-style game, that a friend of mine made, there are some clapping sound effects, when people are in view. However, I feel like the nature sound on its own is also kind of relaxing.

billylo

Or the occasional "Fore!!"s. :-)

sandslides

Just tried the collab notebooks. Seems to be very good quality. It also supports voice cloning.

fullstackchris

Great stuff, took a look through the README but... what are the minimum hardware requirements to run this? Is this gonna blow up my CPU / harddrive?

sandslides

Not sure. The only inference demos are colab notebooks. The models are approx 700mb each so I imagine it will run on modest gpu

bbbruno222

Would it run in a cheap non-GPU server?

thot_experiment

I skimmed the github but didn't see any info on this, how long does it take to finetune to a particular voice?

stevenhuang

I really want to try this but making the venv to install all the torch dependencies is starting to get old lol.

How are other people dealing with this? Is there an easy way to get multiple venvs to share like a common torch venv? I can do this manually but I'm wondering if there's a tool out there that does this.

wczekalski

I use nix to setup the python env (python version + poetry + sometimes python packages that are difficult to install with poetry) and use poetry for the rest.

The workflow is:

  > nix flake init -t github:dialohq/flake-templates#python
  > nix develop -c $SHELL
  > # I'm in the shell with poetry env, I have a shell hook in the nix devenv that does poetry install and poetry activate.

stavros

I generally try to use Docker for this stuff, but yeah, it's the main reason why I pass on these, even though I've been looking for something like this. It's just too hard to figure out the dependencies.

lukasga

Can relate to this problem a lot. I have considered starting using a Docker dev container and making a base image for shared dependencies which I then can customize in a dockerfile for each new project, not sure if there's a better alternative though.

stevenhuang

Yeah there is the official Nvidia container with torch+cuda pre-installed that some projects use.

I feel more projects should start with that as the base instead of pinning on whatever variants. Most aren't using specialized CUDA kernels after all.

Suppose there's the answer, just pick the specific torch+CUDA base that matches the major version of the project you want to run. Then cross your fingers and hope the dependencies mesh :p.

eurekin

Same here. I'm using conda and eyeing simply installing a pytorch into the base conda env

lhl

I don't think "base" works like that (while it can be a fallback for some dependencies, afaik, Python packages are isolated/not in path). But even if you could, don't do it. Different packages usually have different pytorch dependencies (often CUDA as well) and it will definitely bite you.

The biggest optimization I've found is to use mamba for everything. It's ridiculously faster than conda for package resolution. With everything cached, you're mostly just waiting for your SSD at that point.

(I suppose you could add the base env's lib path to the end of your PYTHONPATH, but that sounds like a sure way to get bitten by weird dependency/reproducibility issues down the line.)

eurekin

Thank you! First time I come across. Looks very promising

amelius

> is starting to get old lol.

If it's starting to get old, then this means that an LLM like Copilot should be able to do it for you, no?

stevenhuang

I mean that I already have like 10 different torch venvs for different projects all with various pinned versions and CUDA variants.

Still worth the trade-off of not having to deal with dependency hell, but you start to wonder if there is a better way. All together this is many GBs of duplicated libs, wasted bandwidth and compute.

carbocation

Curious if we'll see a Civitai-style LoRA[1] marketplace for text-to-speech models.

1 = https://github.com/microsoft/LoRA

carbocation

Having now tried it (the linked repo links to pre-built colab notebooks):

1) It does a fantastic job of text-to-speech.

2) I have had no success in getting any meaningful zero-shot voice cloning working. It technically runs and produces a voice, but it sounds nothing like the target voice. (This includes trying their microphone-based self-voice-cloning option.)

Presumably fine-tuning is needed - but I am curious if anyone had better luck with the zero-shot approach.

Evidlo

What's a ballpark estimate for inference time on a modern CPU?

Daily Digest email

Get the top HN stories in your inbox every day.

StyleTTS2 – open-source Eleven-Labs-quality Text To Speech - Hacker News