Llama.vim – Local LLM-assisted text completion

Daily Digest email

Get the top HN stories in your inbox every day.

ggerganov

Hi HN, happy to see this here!

I highly recommend to take a look at the technical details of the server implementation that enables large context usage with this plugin - I think it is interesting and has some cool ideas [0].

Also, the same plugin is available for VS Code [1].

Let me know if you have any questions about the plugin - happy to explain. Btw, the performance has improved compared to what is seen in the README videos thanks to client-side caching.

[0] - https://github.com/ggerganov/llama.cpp/pull/9787

[1] - https://github.com/ggml-org/llama.vscode

amrrs

For those who don't know, He is the gg of `gguf`. Thank you for all your contributions! Literally the core of Ollama, LMStudio, Jan and multiple other apps!

kennethologist

A. Legend. Thanks for having DeepSeek available so quickly in LM Studio.

sergiotapia

well hot damn! killing it!

halyconWays

[flagged]

kamranjon

They collaborate together! Her name is Justine Tunney - she took her “execute everywhere” work with Cosmopolitan to make Llamafile using the llama.cpp work that Giorgi has done.

madeforhnyo

Someone did? Could you pls share a link?

bangaladore

Quick testing on vscode to see if I'd consider replacing Copilot with this. Biggest showstopper right now for me is the output length is substantially small. The default length is set to 256, but even if I up it to 4096, I'm not getting any larger chunks of code.

Is this because of a max latency setting, or the internal prompt, or am I doing something wrong? Or is it only really make to try to autocomplete lines and not blocks like Copilot will.

Thanks :)

ggerganov

There are 4 stopping criteria atm:

- Generation time exceeded (configurable in the plugin config)

- Number of tokens exceeded (not the case since you increased it)

- Indentation - stops generating if the next line has shorter indent than the first line

- Small probability of the sampled token

Most likely you are hitting the last criteria. It's something that should be improved in some way, but I am not very sure how. Currently, it is using a very basic token sampling strategy with a custom threshold logic to stop generating when the token probability is too low. Likely this logic is too conservative.

bangaladore

Hmm, interesting.

I didn't catch T_max_predict_ms and upped that to 5000ms for fun. Doesn't seem to make a difference, so I'm guessing you are right.

eklavya

Thanks for sharing the vscode link. After trying I have disabled the continue.dev extension and ollama. For me this is wayyyyy faster.

jerpint

Thank you for all of your incredible contributions!

liuliu

KV cache shifting is interesting!

Just curious: how much of your code nowadays completed by LLM?

ggerganov

Yes, I think it is surprising that it works.

I think a fairly large amount, though can't give a good number. I have been using Github Copilot from the very early days and with the release of Qwen Coder last year have fully switched to using local completions. I don't use the chat workflow to code though, only FIM.

menaerus

Interesting approach.

Am I correct to understand that you're basically minimizing the latencies and required compute/mem-bw by avoiding the KV cache? And encoding the (local) context in the input tokens instead?

I ask this because you set the prompt/context size to 0 (--ctx-size 0) and the batch size to 1024 (-b 1024). Former would mean that llama.cpp will only be using the context that is already encoded in the model itself but no local (code) context besides the one provided in the input tokens but perhaps I misunderstood something.

Thanks for your contributions and obviously the large amount of time you take to document your work!

gloflo

What is FIM?

LoganDark

llama.cpp supports FIM?

attentive

Is it correct to assume this plugin won't work with ollama?

If so, what's ollama missing?

mistercheph

this plugin is designed specifically for the llama.cpp server api, if you want copilot like features with ollama, you can use an ollama instance as a drop-in replacement for github copilot with this plugin: https://github.com/bernardo-bruning/ollama-copilot

There is also https://github.com/olimorris/codecompanion.nvim which doesn't have text completion, but supports a lot of other AI editor workflows that I believe are inspired by Zed and supports ollama out of the box

nancyp

TIL: VIM has it's own language. Thanks Georgi for LLAMA.cpp!

nacs

Vim is incredibly extensible.

You can use C or VIMscript but programs like Neovim support Lua as well which makes it really easy to make plugins.

halyconWays

Please make one for Jetbrains' IDEs!

eigenvalue

This guy is a national treasure and has contributed so much value to the open source AI ecosystem. I hope he’s able to attract enough funding to continue making software like this and releasing it as true “no strings attached” open source.

nacs

> This guy is a national treasure

Agreed but he's an international treasure (his Github profile states Bulgaria).

feznyng

They have: https://ggml.ai/ under the Company heading.

cosmojg

Georgi Gerganov is the "gg" in "ggml"

acters

Also the gg in gguf

frankfrank13

Hard agree. This alone replaces GH Copilot/Cursor ($10+ a month)

estreeper

Very exciting - I'm a long-time vim user but most of my coworkers use VSCode, and I've been wanting to try out in-editor completion tools like this.

After using it for a couple hours (on Elixir code) with Qwen2.5-Coder-3B and no attempts to customize it, this checks a lot of boxes for me:

  - I pretty much want fancy autocomplete: filling in obvious things and saving my fingers the work, and these suggestions are pretty good
  - the default keybindings work for me, I like that I can keep current line or multi-line suggestions
  - no concerns around sending code off to a third-party
  - works offline when I'm traveling
  - it's fast!

So I don't need to remember how to run the server, I'll probably set up a script to check if it's running and if not start it up in the background and run vim, and alias vim to use that. I looked in the help documents but didn't see a way to disable the "stats" text after the suggestions, though I'm not sure it will bother me that much.

ggerganov

Appreciate the feedback!

Currently, there isn't a user-friendly way to disable the stats from showing apart from modifying the "'show_info': 0" value directly in the plugin implementation. These things will be improved with time and will become more user-friendly.

A few extra optimizations will soon land which will further improve the experience:

- Speculative FIM

- Multiple suggestions

tomnipotent

First extension I've used that perfectly autocompletes Go method receivers.

First tab completes just "func (t *Type)" so then I can type the first few characters of something I'm specifically looking for or wait for the first recommendation to kick in. I hope this isn't just a coincidence from the combination of model and settings...

douglee650

So I assume you have tried vscode vim mode. Would love to hear your thoughts. Are you on Mac/Linux or Windows?

msoloviev

I wonder how the "ring context" works under the hood. I have previously had (and recently messed around with again) a somewhat similar project designed for a more toy/exploratory setting (https://github.com/blackhole89/autopen - demo video at https://www.youtube.com/watch?v=1O1T2q2t7i4), and one of the main problems to address definitively is the question of how to manage your KV cache cleverly so you don't have to constantly perform too much expensive recomputation whenever the buffer undergoes local changes.

The solution I came up with involved maintaining a tree of tokens branching whenever an alternative next token was explored, with full LLM state snapshots at fixed depth intervals so that the buffer would only have to be "replayed" for a few tokens when something changed. I wonder if there are some mathematical properties of how the important parts of the state (really, the KV cache, which can be thought of as a partial precomputation of the operation that one LLM iteration performs on the context) work that could have made this more efficient, like to avoid saving full snapshots or perhaps to be able to prune the "oldest" tokens out of a state efficiently.

(edit: Georgi's comment that beat me by 3 minutes appears to be pointing at information that would go some way to answer my questions!)

h14h

A little bit of a tangent, but I'm really curious what benefits could come from integrating these LLM tools more closely with data from LSPs, compilers, and other static analysis tools.

Intuitively, it seems like you could provide much more context and better output as a result. Even better would be if you could fine-tune LLMs on a per-language basis and ship them alongside typical editor tooling.

A problem I see w/ these AI tools is that they work much better with old, popular languages, and I worry that this will grow as a significant factor when choosing a language. Anecdotally, I see far better results when using TypeScript than Gleam, for example.

It would be very cool to be able to install a Gleam-specific model that could be fed data from the LSP and compiler, and wouldn't constantly hallucinate invalid syntax. I also wonder if, with additional context & fine-tuning, you could make these models smaller and more feasible to run locally on modest hardware.

sdesol

> work much better with old, popular languages

I think this will improve significantly over time when when hardware becomes cheaper. As long as newer languages can map to older languages (syntax/function wise), we should be able generate enough synthetic data to make working with less known languages easier.

rabiescow

you are free to contribute your own gleam llms. They're only as good as their inputs, thus if there's very few publicly available packages for a certain language that's all the input they got...

mijoharas

Can anyone compare this to Tabbyml?[0] I just set that up yesterday for emacs to check it out.

The context gathering seems very interesting[1], and very vim-integrated, so I'm guessing there isn't anything very similar for Tabby. I skimmed the docs and saw some stuff about context for the Tabby chat feature[2] which I'm not super interested in using even if the docs adding sounds nice, but nothing obvious for the auto completion[3].

Does anyone have more insight or info to compare the two?

As a note, I quite like that the LLM context here "follows" what you're doing. It seems like a nice idea. Does anyone know if anyone else does something similar?

[0] https://www.tabbyml.com/

[1] https://github.com/ggerganov/llama.cpp/pull/9787#issue-25729... "global context onwards"

[2] https://tabby.tabbyml.com/docs/administration/context/

[3] https://tabby.tabbyml.com/docs/administration/code-completio...

mijoharas

Ahhh, it seems that tabby does use RAG and context providers for the code completion. Interesting:

> During LLM inference, this context is utilized for code completion

Hmmm... I wonder what's better. As I'm coding I jump and search to relevant parts of the codebase to build up my own context for solving the problem, and I expect that's likely better than RAG. Llama.vim seems to follow this model, while tabby could theoretically get at things I'm not looking at/haven't looked at recently...

ghthor

I’ve been using Tabby happily since May 2023

mijoharas

Have you setup the context providers, and do you find that important for getting good completions?

dingnuts

Is anyone actually getting value out of these models? I wired one up to Emacs and the local models all produce a huge volume of garbage output.

Occasionally I find a hosted LLM useful but I haven't found any output from the models I can run in Ollama on my gaming PC to be useful.

It's all plausible-looking but incorrect. I feel like I'm taking crazy pills when I read about others' experiences. Surely I am not alone?

remexre

I work on compilers. A friend of mine works on webapps. I've seen Cursor give him lots of useful code, but it's never been particularly useful on any of the code of mine that I've tried it on.

It seems very logical to me that there'd be orders of magnitude more training data for some domains than others, and that existing models' skill is not evenly distributed cross-domain.

dkga

This. Also across languages. For example, I suppose there is a lot more content in python and javascript than Apple script, for example. (And to be fair not a lot of the python suggestions I receive are actually mindblowing good)

q0uaur

i'm still patiently waiting for an easy way to point a model at some documentation, and make it actually use that.

My usecase is gdscript for godot games, and all the models i've tried so far use godot 2 stuff that's just not around anymore, even if you tell it to use godot 4 it gives way too much wrong output to be useful.

I wish i could just point it at the latest godot docs and have it give up to date answers. but seeing as that's still not a thing, i guess it's more complicated than i expect.

psytrx

There's llms.txt [0], but it's not gaining much popularity.

My web framework of choice provides these [1], but they're not easily injected into the LLM context without much fuss. It would be a game changer if more LLM tools implemented them.

[0] https://llmstxt.org/ [1] https://svelte.dev/docs/llms

doctoboggan

It's definitely a thing already. Look up "RAG" (Retrieval Augmented Generation). Most of the popular closed source companies offer RAG services via their APIs, and you can also do it with local llms using open-webui and probably many other local UIs.

mohsen1

Cursor can follow links

fovc

> I feel like I'm taking crazy pills when I read about others' experiences. Surely I am not alone?

You're not alone :-) I asked a very similar question about a month ago: https://news.ycombinator.com/item?id=42552653 and have continued researching since.

My takeaway was that autocomplete, boiler plate, and one-off scripts are the main use cases. To use an analogy, I think the code assistants are more like an upgrade from handsaw to power tools and less like hiring a carpenter. (Which is not what the hype engine will claim).

For me, only the one-off script (write-only code) use-case is useful. I've had the best results on this with Claude.

Emacs abbrevs/snippets (+ choice of language) virtually eliminate the boiler plate problem, so I don't have a use for assistants there.

For autocomplete, I find that LSP completion engines provide 95% of the value for 1% of the latency. Physically typing the code is a small % of my time/energy, so the value is more about getting the right names, argument order, and other fiddly details I may not remember exactly. But I find, that LSP-powered autocomplete and tooltips largely solve those challenges.

sdesol

> like an upgrade from handsaw to power tools and less like hiring a carpenter. (Which is not what the hype engine will claim).

I 100% agree with the not hiring a carpenter part but we need a better way to describe the improvement over just a handsaw. If you have domain knowledge, it can become an incredible design aid/partner. Here is a real world example as to how it is changing things for me.

I have a TreeTable component which I built 100% with LLM and when I need to update it, I just follow the instructions in this chat:

http://beta.gitsense.com/?chat=dd997ccd-5b37-4591-9200-b975f...

Right now, I am thinking about adding folders to organize chats, and here is the chat with DeepSeek for that feature:

http://beta.gitsense.com/?chat=3a94ce40-86f2-4e68-b5d7-88d33...

I'm thoroughly impressed as it suggested data structures and more for me to think about. And here I am asking it to review what was discussed to make the information easier to understand.

http://beta.gitsense.com/?chat=8c6bf5db-49a7-4511-990c-5e6ad...

All of this cost me less than a penny. I'm still waiting for my Anthropic API limit to reset and I'm going to ask Sonnet for feedback as well, and I figure that will cost me 5 cents.

I fully understand the not hiring a carpenter part, but I think what LLMs bring to the table is SO MUCH more than an upgrade to a power tool. If you know what you need and can clearly articulate it well enough, there really is no limit to what you can build with proper instructions, provided the solution is in its training data and you have a good enough BS detector.

strogonoff

> If you know what you need and can clearly articulate it well enough, there really is no limit to what you can build with proper instructions, provided the solution is in its training data and you have a good enough BS detector.

In other words: you must already know how to do what you are asking the LLM to do.

In other words: it may make sense if typing speed is your bottleneck and you are dealing with repetitive tasks that have well been solved many times (i.e., you want an advanced autocomplete).

This basically makes it useless for me. Typing speed is not a bottleneck, I automate or abstract away repetition, and I seek novel tasks that have not yet been well solved—or I just reuse those existing solutions (maybe even contributing to respective OSS projects).

The cases where something new is needed in areas that I don’t know well it completely failed me. NB: I never actually used it myself, I only gave into a suggestion by a friend (whom LLM reportedly helps) to use his LLM wrangling skills in a thorny case.

barrell

I think you make a very good point about your existing devenv. I recently turned off GitHub copilot after maybe 2 years of use — I didn’t realize how often I was using its completions over LSPs.

Quality of Life went up massively. LSPs and nvim-cmp have come a long way (although one of these days I’ll try blink.cmp)

sangnoir

> Is anyone actually getting value out of these models?

I've found incredible value in having LLMs help me write unit tests. The quality of the test code is far from perfect, but AI tooling - Claude Sonnet specifically - is good at coming up with reasonable unit test cases after I've written the code under test (sue me, TDD zealots). I probably have to fix 30% of the tests and expand the test cases, but I'd say it cuts the number if test code lines I author by more than 80%. This has decreased the friction so much, I've added Continuous Integration to small, years-old personal projects that had no tests before.

I've found lesser value with refactoring and adding code docs, but that's more of autocomplete++ using natural language rather than AST-derived code.

coder543

> I wired one up

“One”? Wired up how? There is a huge difference between the best and worst. They aren’t fungible. Which one? How long ago? Did it even support FIM (fill in middle), or was it blindly guessing from the left side? Did the plugin even gather appropriate context from related files, or was it only looking at the current file?

If you try Copilot or Cursor today, you can experience what “the best” looks like, which gives you a benchmark to measure smaller, dumber models and plugins against. No, Copilot and Cursor are not available for emacs, as far as I know… but if you want to understand if a technology is useful, you don’t start with the worst version and judge from that. (Not saying emacs itself is the worst… just that without more context, my assumption is that whatever plugin you probably encountered was probably using a bottom tier model, and I doubt the plugin itself was helping that model do its best.)

There are some local code completion models that I think are perfectly fine, but I don’t know where you will draw the line on how good is good enough. If you can prove to yourself that the best models are good enough, then you can try out different local models and see if one of those works for you.

Lanedo

There is https://github.com/copilot-emacs/copilot.el that gets copilot to work in emacs via JS glue code and binaries provided by copilot.vim.

I hacked up a slim alternative localpilot.js layer that uses llama-server instead of the copilot API, so copilot.el can be used with local LLMs, but I find the copilot.el overlays kinda buggy... It'd probably be better to instead write a llamapilot.el for local LLMs from scratch for emacs.

b5n

Emacs has had multiple llm integration packages available for quite awhile (relative to the rise of llms). `gptel` supports multiple providers including anthropic, openai, ollama, etc.

https://github.com/karthink/gptel

yoyohello13

https://github.com/copilot-emacs/copilot.el

whimsicalism

there's avante.nvim

colonial

I do not, but I suspect that's mainly because I far and away prefer statically typed languages with powerful LSP implementations.

Copilot, Ollama, and the others have all been strictly inferior to rust-analyzer. The suggested code is often just straight up invalid and takes just long enough to be annoying. Compare that to just typing '.'/'::' + a few characters to fuzzy-select what I'm looking for + enter.

ETA: Both did save me a few seconds here and there when obvious "repetition with a few tweaks each line" was involved, but to me that is not worth a monthly subscription or however much wall power my GPU was consuming.

codingdave

Yep - I don't get a ton of value out of autocompletions, but I get decent value from asking an LLM how they would approach more complex functions or features. I rarely get code back that I can copy/paste, but reading their output is something I can react to - whether it is good or bad, just having a starting point speeds up the design of new features vs. me burning time creating my first/worst draft. And that is the goal here, isn't it? To get some productivity gains?

So maybe it is just a difference in perspective? Even incorrect code and bad ideas can still be helpful. It is only useless if you expect them to hand you working code.

whimsicalism

i don't find value from the models that it makes economical sense to self-host. i do get value out of a llama 70b, for instance, though

righthand

Honestly just disabled my TabNine plugin and have found that LSP server is good enough for 99% of what I do. I really don’t need hypothetical output suggested to me. I’m comfortable reading docs though so others may feel different.

frankfrank13

Is this more or less the same as your VSCode version? (https://github.com/ggml-org/llama.vscode)

binary132

I am curious to see what will be possible with consumer grade hardware and more improvements to quantization over the next decade. Right now, even a 24GB gpu with the best models isn’t able to match the barely acceptable performance of hosted services I’m not willing to even pay $20 a month for.

mohsen1

Terminal coding FTW!

And when you're really stuck you can use DeepSeek R1 for a deeper analysis in your terminal using `askds`

https://github.com/bodo-run/askds

opk

Has anyone actually got this llama stuff to be usable on even moderate hardware? I find it just crashes because it doesn't find enough RAM. I've got 2G of VRAM on an AMD graphics card and 16G of system RAM and that doesn't seem to be enough. The impression I got from reading up was that it worked for most Apple stuff because the memory is unified and other than that, you need very expensive Nvidia GPUs with lots of VRAM. Are there any affordable options?

horsawlarway

Yes. Although I suspect my definition of "moderate hardware" doesn't really match yours.

I can run 2b-14b models just fine on the CPU on my laptop (framework 13 with 32gb ram). They aren't super fast, and the 14b models have limited context length unless I run a quantized version, but they run.

If you just want generation and it doesn't need to be fast... drop the $200 for 128gb of system ram, and you can run the vast majority of the available models (up to ~70b quantized). Note - it won't be quick (expect 1-2 tokens/second, sometimes less).

If you want something faster in the "low end" range still - look at picking up a pair of Nvidia p40s (~$400) which will give you 16gb of ram and be faster for 2b to 7b models.

If you want to hit my level for "moderate", I use 2x3090 (I bought refurbed for ~$1600 a couple years ago) and they do quite a bit of work. Ex - I get ~15t/s generation for 70b 4 quant models, and 50-100t/s for 7b models. That's plenty usable for basically everything I want to run at home. They're faster than the m2 pro I was issued for work, and a good chunk cheaper (the m2 was in the 3k range).

That said - the m1/m2 macs are generally pretty zippy here, I was quite surprised at how well they perform.

Some folks claim to have success with the k80s, but I haven't tried and while 24g vram for under $100 seems nice (even if it's slow), the linux compatibility issues make me inclined to just go for the p40s right now.

I run some tasks on much older hardware (ex - willow inference runs on an old 4gb gtx 970 just fine)

So again - I'm not really sure we'd agree on moderate (I generally spend ~$1000 every 4-6 years to build a machine to play games, and the machine you're describing would match the specs for a machine I would have built 12+ years ago)

But you just need literal memory. bumping to 32gb of system ram would unlock a lot of stuff for you (at low speeds) and costs $50. Bumping to 124gb only costs a couple hundred, and lets you run basically all of them (again - slowly).

zamadatix

2G is pretty low and the sizes things you can get to run fast on that set up probably aren't particularly attractive. "moderate hardware" varies but you can grab a 12 GB RTX 3060 on ebay for ~$200. You can get a lot more RAM for $200 but it'll be so slow compare the the GPU I'm not sure I'd recommend it if you actually want to use things like this interactively.

If "moderate hardware" is your average office PC then it's unlikely to be very usable. Anyone with a gaming GPU from the last several years should be workable though.

horsawlarway

I'll second this, actually - $250 for a 12gb rtx 3060 is probably a better buy than $400 for 2xp40s for 16gb.

It'd been a minute since I checked refurb prices and $250 for the rtx 3060 12gb is a good price.

Easier on the rest of the system than a 2x card setup, and is probably a drop in replacement.

bhelkey

Have you tried Ollama [1]? You should be able to run a 8b model in RAM and a 1b model in VRAM.

[1] https://news.ycombinator.com/item?id=42069453

basilgohar

I can run 7B models with Q4 quantization on a 7000 series AMD APU without GPU acceleration quite acceptably fast. This is with DDR5600 RAM which is the current roadblock for performance.

Larger models work but slow down. I do have 64GB of RAM but I think 32 could work. 16GB is pushing is, but should be possible if you don't have anything else open.

Memory requirements depend on numerous factors. 2GB VRAM is not enough for most GenAI stuff today.

whimsicalism

2g of vram is pretty bad...

mrinterweb

Been using this for a couple hours, and this is really nice. It is a great alternative to something like Github Copilot. Appreciate how simple and fast this is.

Daily Digest email

Get the top HN stories in your inbox every day.