MemGPT – LLMs with self-editing memory for unbounded context

Daily Digest email

Get the top HN stories in your inbox every day.

pacjam

Hey all, MemGPT authors here! Happy to answer any questions about the implementation.

If you want to try it out yourself, we have a Discord bot up-and-running on the MemGPT Discord server (https://discord.gg/9GEQrxmVyE) where you can see the memory editing in action - as you chat with the bot, you'll see MemGPT edit its memory to update its profile about you (and itself).

Everything's open source, so can also try running MemGPT locally using the code here: https://github.com/cpacker/MemGPT. In the repo we also have a document-focused example where you can chat with MemGPT about the LlamaIndex API docs.

empath-nirvana

Is there any reason you're just doing everything within a single context window? I experimented with similar stuff months ago and basically parallelized everything into multiple requests to different agents in pre and post-processing steps. The main context window, for example, wasn't aware of memories being generated or retrieved. I had a post-processor just automatically generating memories and saving them, along with all the conversations being saved in a vector database, and a pre-processor that would automatically inject relevant memories and context based on the conversation, even re-writing the history so it would look to the main context window like the memory had always been there.

It saved a lot of space in the main context window for unnecessary system prompts and so on.

pacjam

These are all great points - who or what you ask to manage memory is a design decision and IMO there's two main ways to do it (in the context of chatbots):

* implicit memory management, where the "main LLM" (or for chat, the "dialogue thread") is unaware that memory is being managed in the background (by a "memory LLM", a rule-based script, a small neural network, etc.), and

* explicit memory management (MemGPT), where one LLM does everything

Prior research in multi-session / long-range chat is often implicit, with a designated memory creation process. If I had to guess, I'd say the vast majority of consumer chatbots that implement some type of memory store are also implicit. This is because getting explicit memory management to work requires a lot of complex instruction following, and in our experience this just isn't possible at the moment with most publicly available LLMs (we're actively looking into ways to fix this via eg fine-tuning open models).

The tradeoffs are as you mentioned: with implicit, you don't have to stuff all the memory management instructions into the LLM preprompt (in MemGPT, the total system message is ~1k tokens). But on the upside, explicit memory management (when the LLM works) makes the overall system a lot simpler - there's no need to manage multiple LLM models running on parallel threads, which can add a lot of overhead.

spunker540

Is it fair to call “implicit”, essentially retrieval augmented generation? While “explicit” is something different?

spaintech

This is a fascinating approach. I’m working on something similar but as part of the feedback loop, as you said, rewriting history with transactional data as part of the context window. I feel as though the LLM and the NLP could potentially be a more realizable interface to structured data, well, I should say, this is the idea we are exploring. For us, as data is created (within a certain context of the business) we extract the data, generate the embeddings and build out the vector database as to:

Pre and Post-Processing:

- Post-Processing: After the main model responds, a post-processor takes over, automatically generating memories from the conversation and saving them. This ensures that important context is stored without burdening the primary model with these tasks. We also execute any relevan business logic as part of the request, then feed that back to the systems…

- Pre-Processing: Before a new input is sent to the main model, a pre-processor checks saved memories and injects relevant context. * executes logic * It’s as if this pre-processor gives the main model a “refresher” on prior conversations, preparing it to provide more informed and consistent responses.

sabareesh

Multi Agent has several potential, I am having more confidence as there is some level of entropy on agent reply that makes it a worthwhile

keithnoizu

Yes, I have a similar solution.

huevosabio

Good job!

On the limitations you wrote: ``` Similarly, we also found that the most popular the Llama 2 70B model variants (even those fine-tuned for function calling) would consistently generate incorrect function calls or even hallucinate functions outside the providede schema. ```

You could use grammar-based sampling [0] to ensure that the function call is at least syntactically correct.

[0] https://github.com/ggerganov/llama.cpp/tree/master/grammars

pacjam

Grammar-based sampling is a great idea and a perfect fit for something like MemGPT! In our experiments using MemGPT with non-gpt-4 models, the biggest issue impacting performance ended up being incorrect use of function parameters and function hallucination. For example, even large models finetuned on function call data (eg https://huggingface.co/jondurbin/airoboros-l2-70b-2.1#agentf...) would generally output correct parsable JSON, but the arguments or function name would be wrong. For example the LLM might output a call to `personal_diary.add` (never specified in the preprompt) instead of the correct `working_context.append` call (explicitly specified in the preprompt) when trying to write data.

radarsat1

I don't see why you'd need to stop at "grammar". If you have something like intellisense that can tell you all legal completions (eg. list of member functions of the class of the last-referenced object) then you could use the same approach to limit sampling to function names that actually exist.

KhoomeiK

The title made me think this was an approach that used memory editing techniques (e.g. ROME [1]) to allow an LLM's neural memory (not just its context) to change over the course of conversation. Pretty happy to realize that this is just a fancy RAG work—will be building my version of MemEditGPT soon.

[1] https://arxiv.org/abs/2202.05262

pacjam

Awesome, feel free to open issues or PRs to our repo if you want to contribute! It's all open source and under Apache 2.0, and we're actively looking at integrating common workflows to the CLI.

You're correct that MemGPT doesn't do editing of LLM weights like in ROME - the "memory" we're considering in MemGPT is at the text/token level, not the weight level. The core concepts behind MemGPT is giving the LLM the ability to edit a working memory scratchpad (held in-context) and reading/writing to external context via functions. An important detail is that reads are always paginated (chunked) to deal with finite context limits, and MemGPT can do many iterative read/writes from a single user input (by chaining functions together). This allows MemGPT to search over a large database of documents for example, collecting information from various sources to return an answer (as in our LlamaIndex API docs example on the README).

Difwif

I've had a suspicion for a while now that this is what ChatGPT does within a conversation (chat.openai.com, not the api). I've had very long chat histories that seem to gracefully degrade instead of just forgetting everything. Maybe there's more clues in the context than I realize though.

Either way this type of idea will probably be a fundamental feature for all chat bots in the future IMO.

pacjam

Recursive summarization is a simple and popular way to provide the illusion of infinite context (when you need to free up space, just summarize the oldest N messages into 1 summary message). It's lossy and you'll inevitably lose important information, but it should degrade relatively gracefully. In MemGPT we use (implicit) recursive summarization on top of all the explicit memory management.

ASalazarMX

Would this be the same method used to assign a title to your chat based on the first prompt? It's surprisingly effective at getting the core idea most of the time.

pacjam

Thanks for your interest! Question - does the title of the chat ever change after it's first assigned? If so, using a recursive summary to refresh the title sounds like a reasonable idea (especially if you're already computing a summary to extend context).

From what I remember the title in ChatGPT gets set once after a few messages, in which case I'd assume it's generated with a special "title generation" prompt (that gets the first few messages as input).

In either case since I don't work at OpenAI I can't tell you for sure ;)

icelancer

This is how we do things at our work with the API and chunking since we don't have the 32k API. It works fairly well in limited windows.

hansvm

There are definitely a lot more clues than you realize (plus the context window is something like 12 written pages of standard English text, without much space wasted for the system prompts). If you were doing anything interesting at all, the output is heavily biased by your prompt. You lose some bits of information in that you only have one sample (the previous output/history) rather than the soft probabilities, and you lose some bits in that multiple inputs can map to the same output (like the class of prompts "output the 2nd letter of the following phrase: ..."), but real-world prompts tend to be the easiest/shortest thing to come to mind that you think will give you the result you're looking for, so the LLM's best guess for that prompt (there are lots of ways of guessing, so suppose for the sake of argument you did something like textual inversion on the one sample) is likely to not be a half-bad interpretation of the missing context -- i.e., a lot of the seemingly missing information was retained in the LLM's output, and you don't lose too many bits at a time as the old context trails off.

Der_Einzige

ChatGPT degrades precisely because they aren't doing anything special to extend their memory beyond the context length.

There are trivial techniques to implement "lossy" memory, such as just average pooling tokens (the same approach used by sentence transformers). Not sure why it's so rare to see this used for condensing a huge amount of context into a prompt. It is effectively "medium" term memory.

lgats

https://chat.openai.com/share/e367a1de-c28b-4408-aa3d-2e4b85...

Fed chatGPT special numbers, then 3k tokens, then 2k tokens. after that, it was unable to understand any question about the special numbers provided.

sharkjacobs

On the other hand

https://chat.openai.com/share/8a0675b6-2876-4606-ac79-646391...

visarga

At the very least I would average vectors inside single words or word compounds getting a 2-3x reduction in length without much work.

shishirpatil

Yeah! While it’s not known what close-sourced models do, what we think is happening based on some prompt attacks, is that they also use recursive summarization (in addition to what others have mentioned in this thread).

JCharante

To me it just feels like they’re trimming the min amount of oldest tokens in the conversation to stay under the token limit. Conversations don’t degrade in a way that feels like it has medium term memory.

kristopolous

I'm still very much learning this stuff, but I wonder if that's related to the vanishing gradient problem, which seems to be a fundamental aspect of these types of approaches. (Please don't assume that's correct)

https://en.wikipedia.org/wiki/Vanishing_gradient_problem

visarga

Vanishing gradient was an issue for non-residual deep networks and vanilla RNNs. While the long context memory issues are along sequence dimension, not network depth.

The problem could be some kind of instability of attention as it scales above 10k tokens. A recent paper suggests attention mechanism needs a default value (a "sink"), and its absence produces instability.

https://arxiv.org/abs/2309.17453

Another paper says the middle part is lossy while the beginning and end are better attended.

sandkoan

For anyone who's curious, the paper in question, entitled, "Lost in the Middle: How Language Models Use Long Contexts" (https://arxiv.org/abs/2307.03172)

kristopolous

That's a really recent paper. Do you actually keep up to date with everything? How do you find the time?

amelius

Regarding the vanishing gradient problem, has anyone tried to train using only a randomly chosen set of independent parameters in each iteration? (Updating only the weights in a small random independent set).

jdthedisciple

Are you referring to Regularization?

https://www.kaggle.com/code/sid321axn/regularization-techniq...

o11c

I don't remember the name, but there's already an Esolang that executes its commands unreliably. Through careful program design, you can ensure that a sequence of commands will execute with 99%, 99.9%, etc. reliability.

Aeolos

Java2000 iirc.

2 decades later, the same approach was unironically popularized for infrastructure as “chaos engineering”.

majewsky

Not sure if that was /s or not, but it is indeed an important insight to realize that no IT system can have 100% reliability once you factor in hardware failures and power outages. And that's before we talk about bugs.

ansc

Not to even mention the heat death of the universe!

codezero

Sounds a bit like Malbolge but not quite. https://en.m.wikipedia.org/wiki/Malbolge

pacjam

Update - we just released a Discord perpetual chatbot implemented on top of MemGPT, you can try it here: https://discord.gg/9GEQrxmVyE

You can also run the chatbot demo + a doc QA bot demo (where you can ask MemGPT about API docs) locally with the code on GitHub.

quickthrower2

Thanks! How do I use the bot?

pacjam

If you want to try the MemGPT Discord chatbot, join the Discord server (linked above), then check out the #memgpt channel to start messaging the bot.

If you want to run the chatbot or API docs examples locally, you can follow the instructions here: https://github.com/cpacker/MemGPT/tree/main#quick-setup.

qup

Discussed last night: https://news.ycombinator.com/item?id=37894403

(Mostly arguing about the authors' choice of title)

dang

OK, I've merged the comments from that thread which were not arguing about the title into this thread.

swyx

[flagged]

mcbuilder

"GPT" refers to decoder only transformer models sampled "generatively". It's a technical term for a class of LLMs. Not cringe at all.

cosmojg

I mean, if they're building on top of what is commonly understood as a generative pre-trained transformer model[1], then that just seems accurate? Although, I do agree that it's pretty cringe when they do it purely as a deceptive marketing tactic rather than to communicate meaningful information.

[1] https://en.m.wikipedia.org/wiki/Generative_pre-trained_trans...

minimaxir

It's a necessary marketing tactic lately to surface AI projects above the bullshit hype. The AI space hypesters who do podcasts and speak at conferences do not care about cringe.

The real deterrent to adding GPT to a project name is a cease-and-desist from OpenAI.

littlestymaar

It may be cringe, but at the same time it undermines the ability for OpenAI to enforce its trademark, which means the “GPT” word will remain in the public domain, so it's not entirely useless at least ;).

zerop

Context window is biggest limitation to LLMs, IMO. The great reasoning capabilities hit the context window limitation in many practical use cases.

shishirpatil

Yeah absolutely! And hopefully with some of the techniques we introduce here, we can think of designing perpetual chat bots!

wokwokwok

> Resursive summarization (Wu et al., 2021b) is a simple way to address overflowing context windows, however, recursive summarization is inherently lossy and eventually leads to large holes in the memory of the system.

Yes, it does.

> In our experiments ... conversational context is read-only with a special eviction policy (if the queue reaches a certain size, a portion of the front is truncated or compressed via recursive summarization), and working context is writeable by the LLM processor via function calls.

You're doing the same thing, and you have the same problems.

You're just doing it slightly differently; in this case instead of recursively summarizing everything, you're selectively searching the history and generating it for each request. Cool idea.

...but, I'm skeptical; this fundamentally relies on the assumption that the existing context consists of low entropy summarizable context, and that any query relies only on a subset of the history.

This might be true for, eg. chat, or 'answer question about some document in this massive set of documents'.

...but, both of these assumptions are false in some contexts; for example, generating code, where the context is densely packed with information which is not discardable (eg. specific api definitions), and a wide context is required (ie. many api definitions).

It is interesting how this is structured and done, and hey, the demo is cool.

I'm annoyed to see these papers about summary things fail to acknowledge the fundamental limitations of the approach.

pacjam

Thanks for checking out the paper! Just to clarify in case there was any misunderstanding, recursive summarization is just one part of the memory management in MemGPT: as you mentioned, in MemGPT the conversation queue is managed via recursive summarization, just like in prior work (and many chatbot implementations). However there is also a (read/write) "pinned" section of "LLM memory" that's unrelated to recursive summarization, we call this "working context" in the paper. So MemGPT has access to both recursive summaries (generated automatically), as well as working context, which MemGPT actively manages to keep up-to-date.

These are both separate from MemGPT's external context, which is pulled into the conversation queue via function calls. In all our examples, reads from external context are uncompressed (no summarization) and paginated. MemGPT receives a system alert when the queue summarization is triggered, so if MemGPT needs to keep specific details from the conversation queue it can write it to working context before it's erased or summarized.

In the conversational agent examples, working context (no summarization, and separate from the conversation queue) is used to store key facts about the user and agent to maintain consistent conversation. Because the working context is always seen by the LLM, there's no need to retrieve it to see it. In doc QA, working context can be used to keep track of the current task/question and progress towards that task (for complex queries, this helps MemGPT keep track of details like the previous search, previous page request, etc.).

majestic5762

We took a similar approach like MemGPT (working memory: summarized conversation with eviction), but our long memory is a graph we can operate on (add/remove/edit nodes & edges). We bring the top_k nodes and their neighbors in the working memory.

wokwokwok

> Just to clarify in case there was any misunderstanding

I am not confused.

It's good; it solves a specific set of problems with querying large datasets, the same as a vector search would.

...but the various memory zones you've created make absolutely no difference to the fundamental limitation of the LLM context length.

No matter how you swing it, this is just creative prompt engineering. You're packing the context with relevant information; but, if you have too much relevant information, it won't work.

undefined

[deleted]

Tostino

Heh, I've been working on...a good portion of the basics that this project / paper have tested out for the past few months as an idea (as I work more on other more material problems for my side project).

I have a whole document of my thoughts on this topic, and as I was reading through the paper just piece after piece of the concepts that I had documented kept coming up.

Glad I am not the only one thinking in this direction.

pacjam

Happy to chat more about other ideas in this direction! There are plenty of things we tried with varying degrees of success (especially when trying to get MemGPT to work on less powerful LLMs), and we'd be interested in hearing what you observed in your own work.

Tostino

I know we're chatting on Discord, but figured i'd leave the link to what I was working on a couple of months ago here if anyone else is interested: https://gist.github.com/Tostino/3f0b0887591ed06aa9f54ca2ddbd...

wilg

I was just suggesting something like this to a friend yesterday! (Neither of us know enough to do it or know if it's a good idea.)

However, I do think the context length is one of the top improvements that would make LLMs much more useful.

pacjam

If you have a chance try it out via the Discord bot or with the GitHub repo! Or even just check out the short demo GIFs we released (at https://github.com/cpacker/memgpt) to get an idea of the MemGPT inputs/outputs.

The high-level memory read/writes are quite intuitive and you may be surprised at how closely it matches what you were suggesting to your friend.

littlestymaar

Same here, it looks like the idea was pretty obvious. Glad to see it implemented though.

Context length being so limited is the number one thing that rules LLM as possessing something that resemble “intelligence”, so if we this kind of unbounded context length we're entering into a completely new universe in terms of LLM abilities.

shishirpatil

Thanks @wilg and @littlestymaar ! Yeah totally, with the benefit of hindsight, this makes total sense! Hope you find some of our codebase useful to build on top of. We are an Apache 2.0 licensed open source project and welcome contributions :)

Daily Digest email

Get the top HN stories in your inbox every day.