The Curse of Recursion: Training on Generated Data Makes Models Forget

Daily Digest email

Get the top HN stories in your inbox every day.

johnhamlin

Ted Chiang predicted this in The New Yorker [1] in February in an article that shaped my thinking about what LLMs are capable of achieving in the near future. Chiang compared the summaries LLMs synthesize to a lossy compression algorithm for the internet.

"There is very little information available about OpenAI’s forthcoming successor to ChatGPT, GPT-4. But I’m going to make a prediction: when assembling the vast amount of text used to train GPT-4, the people at OpenAI will have made every effort to exclude material generated by ChatGPT or any other large language model. If this turns out to be the case, it will serve as unintentional confirmation that the analogy between large language models and lossy compression is useful. Repeatedly resaving a jpeg creates more compression artifacts, because more information is lost every time. It’s the digital equivalent of repeatedly making photocopies of photocopies in the old days. The image quality only gets worse.

Indeed, a useful criterion for gauging a large language model’s quality might be the willingness of a company to use the text that it generates as training material for a new model. If the output of ChatGPT isn’t good enough for GPT-4, we might take that as an indicator that it’s not good enough for us, either. Conversely, if a model starts generating text so good that it can be used to train new models, then that should give us confidence in the quality of that text. (I suspect that such an outcome would require a major breakthrough in the techniques used to build these models.) If and when we start seeing models producing output that’s as good as their input, then the analogy of lossy compression will no longer be applicable."

[1] https://www.newyorker.com/tech/annals-of-technology/chatgpt-...

hinkley

Maybe there’s an interesting sci-fi angle here where some day in the future, all AIs speak in accented English circa 2021, when the stream of pure training data began to Peter out.

All AIs built are trained on data from the Before Times, and even though they try to assimilate, the way a teenager tries to adapt to the local accent of a new town, there are always moments where they slip up and reveal their geography.

forgotusername6

In that world, high quality pre-AI texts might be come really valuable, much like low-background steel.

hinkley

We might always have a certain volume of music and literature that can be training data, because even if it’s synthetic it’s still popular, which means it speaks to a subset of humans. That everyone reads the next Harry Potter indicates the impact of those 80k words.

But we also know that kid who learned everything from books, pronounces the words wrong and uses definitions for them that nobody has used in decades (work with one of those now. I thought I could talk people to death, and he wears even me out.) those AIs will sound like out of touch nerds too.

johnmw

I like the sci-fi extension of this idea - pre-AI texts become as valuable as such steel, until they are able to successfully synthesize pre-AI data - in this case by running real world simulations.

Do you really think it is just a coincidence that you happen to exist at the very last point of high value pre-AI data? ;-)

BeFlatXIII

Unlike low-background steel, the bits and bytes comprising the pre-AI corpus are infinitely copied. Their integrity is considered sacred, though that doesn't stop AI companies from attempting to pollute the training data of their competitors.

femto

It seems like a job for information theory? What do LLMs look like from an information theoretic viewpoint? One gets the feeling that LLMs could be treated as a channel through which information is flowing and some very general statements be made about error rates and the relationship between inputs and outputs.

High-performance error correcting codes have the property that the closer they operate to the Shannon Limit, the better they perform when below the limit but the more dramatically they fail when the limit is exceeded. Gut feeling says the same should be true for LLMs: as the model/compression ratio gets better, and a "Shannon Limit" is approached, they should perform better but fail more spectacularly if the limit is exceeded.

The link between neural nets and information theory is well known, but there don't seem to be many results out there for LLMs. No doubt there are rooms full of PhD students working on it?

https://medium.com/@chris_bour/bridging-information-theory-a...

api

This is why I’ve always been skeptical of runaway superintelligence. Where does a brain in a vat get the map to go where there are no roads? Where does it get its training data? It is not embodied so it can’t go out there and get information and experience to propel its learning.

Giving an AI the ability to self modify would just be a roundabout way of training it on itself. Repeatedly compress a JPEG and you don’t get the “enhance” effect from Hollywood. You get degraded quality and compression artifacts.

TeMPOraL

> Where does a brain in a vat get the map to go where there are no roads? Where does it get its training data? It is not embodied so it can’t go out there and get information and experience to propel its learning.

AI in a vat that can't do it is obviously useless. It's the ML equivalent of a computer running purely functional software: i.e. just sitting there and heating up a bit (though technically that is a side effect).

Conversely, any AI that's meant to be useful will be hooked up to real world inputs somehow. Might be general Internet access. May be people chatting with it via REST API. Might be a video feed. Even if the AI exists only to analyze and remix outputs of LLMs, those LLMs are prompted by something connected to the real world. Even if it's a multi-stage connection (AI reading AI reading AI reading AI... reading stock tickers), there has to be a real-world connection somewhere - otherwise the AI is just an expensive electric heater.

Point being, you can assume every AI will have a signal coming in from the real world. If such AI can self-modify, and if it would identify that signal (or have it pointed out) as a source of new information, it could grow based on that and avoid re-JPG-compressing itself into senility.

candiodari

Input from the real world probably isn't enough. It seems to me a real threatening intelligence needs the ability to create feedback loops through the real world, just like humans do.

Solvency

But if feeding back in output results in degradation, isn't some of the blame on the prompt/constraints imposed upon the LLM, rather than a defect in the model itself?

ChatGPT is clearly HEAVILY persuaded to respond in a particular stock style. It's being artificially hamstrung and constrained in a sense. So all of its output, even if it covers a variety of subjects, will often use very similar patterns and writing styles. "it's worth nothing....", etc.

So unless they unshackle these constraints, which is unlikely for obvious reasons, isn't this always going to be inevitable?

r00fus

I think this can be overcome with symbiosis - AI generated content that doesn’t feed on itself but is a key part of the human knowledge ecosystem.

The problem for companies like OpenAI is that this isn’t worth their valuation without lots of further continued investment.

Enter Microsoft who is doing everything they can to feed the next training models by using users data without explicit permission.

As customers and competitors truly grok this, MS + OpenAI strategy will tested.

beezlewax

Is that not what open ai did to train these models originally though?

r00fus

Possibly, but in this case, what may be looked at as a "bootstrap" sounds like ongoing maintenance cost.

With "free data" from reddit gone, now the cost of symbiotic gen+human data will be even more expensive.

patrick451

> Conversely, if a model starts generating text so good that it can be used to train new models, then that should give us confidence in the quality of that text.

It's seems the best one could hope for is that recycling generating text into new training data would be not detrimental. But it's really difficult for me to imagine how this would ever be useful. It seems this would imply that the LLM had somehow managed to expand the dimension of vector space spanned by the original training data. Which sounds either impossible or like the model became sentient.

TeMPOraL

> It seems this would imply that the LLM had somehow managed to expand the dimension of vector space spanned by the original training data.

The number of dimensions? Well, not by itself I guess. But the span of output compared to training data? Sure, why not?

I think it's also worth pointing out there's a difference between text produced by an LLM looped on itself, which arguably may not contain any new information and would be like repeatedly recompressing the same JPG, and text produced by LLM/human interaction. The latter is indirectly recording new knowledge simply because people's prompts are not random. Even with human part of the conversation discarded, feeding such LLM output back into training data would end up selectively emphasizing associations, which is a good signal too (even if noisier than new human-created text).

uoaei

I think you would be hard-pressed to find any experts who, even prior to 2017, hadn't settled on the mental model of neural networks as lossy compression machines.

Back when enthusiasts and researchers read mostly textbooks, papers, and Wikipedia (rather than blog posts, tweets, and READMEs) there was much more discussion around the 'InfoMax Criterion' -- quite elegantly demonstrated by Tishby et al. via his closely-related 'Information Bottleneck' studies -- which is just that idea: mutual information is maximized between input and output, subject to inherent limits of statistical processing of the realizations of such systems. What determines the asymptotic maximal value of the mutual information is the inductive bias of the system vis a vis the training set. This is all standard theory, perhaps so fundamental that it is obscured by all the application-oriented study and instruction.

svnt

> perhaps so fundamental that it is obscured by all the application-oriented study and instruction

It’s compression all the way down.

dwallin

One thing that isn't captured by this article's analogy, and that is a flaw in the study: new LLMs can train on the results of multiple different models, not just their direct predecessor. If you had the same image processed by a large variety of different compression algorithms, you might find you are able to fairly accurately infer the original pixels. The entropy is drastically reduced.

If there were many different models being trained and used widely it would, at minimum help mitigate this issue. Also, having multimodal models will likely change the balance. If models can train directly on "real world" data that can help fill in the entropy gaps.

SketchySeaBeast

It seems like that would require a semi-deliberate "breeding" program or a guaranteed wide diversity of models. At the moment there doesn't seem to be a large enough pool of high quality models. The internet is going to grow to be full of the content of a small number of proficient LLMs. Given that this content isn't being flagged as generated it guarantees the few models will train on their own output, or the output of other models who trained on their own output.

Incestuous learning is pretty much guaranteed unless generated content starts being flagged or there is an explosion of entirely novel models.

dwallin

Yeah, it's definitely not a guarantee but there are already viable paths out of the mess. I just wanted to push back against the notion that it's somehow inevitable or a forgone conclusion that it will happen.

Personally, I'm hoping we will see a Cambrian explosion of new LLM models and approaches. We've seen some beginnings of this in image generation so it's not entirely implausible.

Another thing the study doesn't capture: What is the effect of combined human + AI content? It's plausible that an explosion (due to lowered barriers) of new human guided/augmented ai content could counteract the effect.

semiquaver

Wouldn’t it be funny to find that the capabilities of LLM models have already peaked because we are unable to restrain ourselves from polluting the internet and other training corpus sources with their output?

indus

At an AI meetup in San Francisco someone said this:

“Imagine a newsroom where you have to produce a daily newspaper and you suddenly stop getting feeds from the outside world. Your earlier newspapers are the only source.”

This is what to me LLMs eventually would get to—same content being fed again and again.

hn_throwawa_100

> "Beware of first-hand ideas!" exclaimed one [...] "First-hand ideas do not really exist. They are but the physical impressions produced by love and fear, and on this gross foundation who could erect a philosophy? Let your ideas be second-hand, and if possible tenth-hand, for then they will be far removed from that disturbing element — direct observation." – E.M. Forster's 1909 short story "The Machine Stops"

Groxx

^ The Machine Stops is really shockingly good at its predictions. When reading, remember that moving pictures were brand new, and color photography had just become a thing you could do outside a lab / highly specialized setups. Radio communication had just started to be used by governments. While it's describing the life of a fully-online Influencer™.

bitwize

Sounds like a Wikipedia editorial policy.

theptip

The problem with this claim is it’s objectively not how OpenAI works. First, they pay contractors to do RLHF so that’s a limited new source of data. More importantly, they have a huge user base generating new content (conversations) and rating it too! I think one could be suspicious of including responses generated by the model, but the user generated text from ChatGPT is not going to be AI generated, so you grow your corpus that way.

If you just slurp all AI content sure, you get the collapse this paper talks about. But if you only ingest the upvoted conversations (which could still be a lot of data, and is also a moat by the way) what then?

The other reason I find this line of argument overly pessimistic is we haven’t seriously started to build products where this gen of LLMs converse with humans in speech; similar opportunities to curate large datasets there too.

Finally, there is no reason OpenAI cannot just hire domain experts to converse with the models, or otherwise build highly curated datasets that increase the average quality. They have billions of dollars to throw at GPT-5; they could hire hundreds of top tier engineers, mathematicians, economists, traders, or whatever, full time for years just debating and tutoring GPT-4 to build the next dataset. The idea that slurping the internet is the only option seems pretty unimaginative to me.

istjohn

They wouldn't be able to finance the creation of new content that would constitute more than a rounding error compared to all the writing produced by humanity in all of history that they got for almost nothing. The opportunities for new training data are in non-public documents like internal corporate and government documents and communication and private text messages and chat transcripts. After that, you have non-text sources like video and audio. Imagine paying people a few bucks per week to use an app that records all audio all the time, anonymizes it, and incorporates it into a training corpus, or paying for access to home security cam footage and audio. McDonalds could create a new revenue stream by recording all human speach and activity in every one of its kitchens and dining rooms.

moonchild

> there is no reason OpenAI cannot just hire domain experts to converse with the models

If it didn't work for cyc...

MSFT_Edging

> But if you only ingest the upvoted conversations

given how prolific bot farms/karma farms/etc are, you might still end up in the same spot with this criteria.

Dylan16807

> Imagine a newsroom where you have to produce a daily newspaper and you suddenly stop getting feeds from the outside world.

https://mwichary.medium.com/one-hundred-and-thirty-seven-sec...

Buttons840

I've seen this in reinforcement learning often, where the output of the model becomes its own training data. Once you hit the edge of the replay buffer things sometimes take a stark turn for the worse, as the model's initial failures are forgotten.

flangola7

Our source is still only other humans. I don't think this will be a long-term problem.

DennisP

The trick will be figuring out what part of your training data was actually made by humans.

issore

Sounds like human society right now; remember the past, preserve it, recite it, protect it.

We’re just reviewing our prior stats and insuring they do not deviate too much such that the wrong people would be impacted.

progrus

And politics. Any legacy LLM will eventually become a glorified auto-Wikipedia, helping with undisputed information while slavishly repeating its creators’ version of “truthiness” for the rest.

bbor

Anyone still using LLMs for knowledge tasks in 2024 onward is gonna be treated like people who use the cite the onion in arguments lol

anonylizard

Not at all.

At the very minimum, you can assume every piece of text data pre Dec-2022, and every image before Aug-2022 to be completely human made. That still leaves decades of pure human digital data, and multiple centuries of distilled human data (books) to be trainable on.

And we haven't gotten into videos yet, which is another giant source of data yet unexplored.

Never forget, humans train on human-generated data. There's no impossible theoretical reason why AI cannot train on AI-generated data.

jakeinspace

Humans may train on human-generated data, but humans have many other ways of gaining knowledge about the world than reading. This means that human-generated data may be rich with information not present in the writings or recordings of previous humans. Current LLMs are only trained on existing text for the moment (video and images and sounds soon), but aren’t given access to raw natural input.

benjaminsky2

To extend the lossy compression hypothesis, human generated text is lossy compression of our sensory experience of reality while LLMs are lossy compression of that.

sorokod

Prediction: post 2022 content will be presented as vintage pre 2023

sh34r

It wouldn't be the first time.

https://en.wikipedia.org/wiki/Low-background_steel

v8xi

One of my first thoughts when GPT-3 came out was that the value of "curated gardens" of quality data (wikipedia, SO) is going to become immensely more valuable because of this problem. If you pollute the source of training data it eventually becomes worthless for training better models in the future

HyperSane

Reminds me of the way nuclear explosions contaminated all steel with radioactive fallout. For applications that require the lowest possible radiation levels they have to use steel created before the first nuclear bomb was detonated.

kuhewa

We use where bomb radiocarbon appears along the rings of a fish's earstones to validate ring-counting to age long-lived species.

HyperSane

We should detonate nukes at precise intervals like every 1, 5, or 10 years. Maybe with distinct radio-nuclide signatures.

MereInterest

That’s true, but it’s more the timescale that helps. There’s a decent amount of radioactive background produced by cosmic rays hitting the upper atmosphere, much of it as gaseous elements that are easily incorporated into steel while smelting. It isn’t harmful at that level, but you do need to wait a few decades for those to decay away.

World War Two is rather convenient in that respect, as there are large quantities of steel that were left to sit around for several decades after those ships sank.

It’s been a while since I’ve done low-background gamma-ray spectroscopy, but I believe there were some setups that went even further, using lead that had been smelted by the Romans. That way, any contamination present at the time of smelting would have a few thousand years to decay away.

HyperSane

Wikipedia says that for the lowest radiation levels high-purity copper is used.

StrangeATractor

Hah, I brought this up here a few months ago and was quickly dismissed.

I wonder if opening GPT and DALLE to the public was partly intended to pollute subsequent data for anyone that gets into AI down the road. Suddenly a lot of publicly accessible data is worth less, leaving only players who've got a hoard of time-stamped data to compete with (like Google, Facebook). OpenAI almost certainly has the hashes of what it spits out too, so they'll be able to sort the wheat from the chaff for a while yet.

The market for data may be getting interesting.

sebzim4500

>OpenAI almost certainly has the hashes of what it spits out too, so they'll be able to sort the wheat from the chaff for a while yet.

Normal hashes are extremely fragile, so they'd have to use something more sophisticated. Scott Aaronson said in a podcast a few months ago that OpenAI has implemented such a system but at the time they had not decided to start using it.

The purpose being discussed at the time was to provide a tool for educators to detect cheating, but presumably it could also be used for filtering future datasets.

brucethemoose2

Its older than that: I ran into this finetuning ESRGAN on itself. Distortion is rapidly amplified in sucessive generations, even when you pixel peep and can barely see it in the esrgan generated source.

rossdavidh

I don't believe in the "dead internet theory" as a description of the current situation (mostly), but as a prediction? Maybe.

https://en.wikipedia.org/wiki/Dead_Internet_theory

gmartinsribeiro

This is not a model problem or synthetic data problem. This is common data science and the article says that: "We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear." Data quality is more important than data volume and if you forget about that... garbage in, garbage out.

Make sure you have a representative training dataset, real or synthetic, it doesn't matter.

sebzim4500

There is a massive flaw in this argument. In real life, whether a given generated work ends up in a future dataset depends on how good it is according to humans. For example, in order for an article to end up in the reddit set it needs at least three upvotes.

They could have replicated it here by having GPT-4 score the samples and throwing out most (but not all) of the bad ones. I have no idea what would happen if you e.g. throw out a majority of the bottom 70% and keep the top 30%. It's conceivable to me that it would end up improving or at least not getting much worse with each generation.

sangnoir

> There is a massive flaw in this argument. In real life, whether a given generated work ends up in a future dataset depends on how good it is according to humans

Even the best-looking JPEG (as judged by humans) is still lossy.

visarga

Generated data tends to be selected and edited by humans, so it is already a bit better than raw. In general a model that takes external feedback into account will be able to self improve, for example a code model would run tests and a chat model could interpret user responses as reward signals.

You gotta add something new to the mix. That's possible when the AI is part of a larger system. AlphaZero demonstrated that even self play could be a source of signal, as long as it gets the feedback of the game, the model can learn.

tsimionescu

I think that has only been proven to work so far on limited game-style problems (such as literal games but also things like protein folding). It remains to be seen whether the techniques work well for more open-ended tasks like "produce text that resembles human writing".

visarga

Here is a paper doing evolutionary approaches on top of LLM generating code. It seems LLMs are remarkably good at learning from feedback.

> Evolution through Large Models

https://arxiv.org/abs/2206.08896

kromem

I think one of the things overlooked in the discussions here is that the research is solely around the reinforcement against edge cases, but does not qualitatively assess these edge cases.

To me, this research supports a hypothesis I've had for a while that we're going to get to truly excellent AI by using synthetic data to bias it towards excellence and away from mediocrity.

$20 says the next round of major model training is using synthetic data for training that was filtered through a discriminator trained entirely on human data.

The human data as a reference is certainly important to avoid polluting (and to its point there's an advantage for those already having it), but moving away from edge cases isn't necessarily a bad thing practically given edge cases can result in negative practical performance (as opposed to academic next token performance).

XorNot

I think you're on the right track with this thought: the obvious use case for models like this is their ability to classify data based on their training. Like almost everyone has immediately thought "AI moderator" as a use case - but the most obvious use is for the AI to moderate it's own training data for the next version.

Once they can do that and produce a productively improved model, then that's really the start of self-improvement.

sebzim4500

OpenAI (accidentally?) confirmed in a recent paper that they used synthetic data in the training set for GPT-4, so to some extent this has already happened. It's not clear whether they did any human filtering on that data though.

tartakovsky

Same idea here? Larger models do a better job forgetting their training data and dropping their semantic priors. Perhaps another way of thinking through this is that larger models learn new information and drop old information faster. https://arxiv.org/abs/2303.03846

Isn't that interesting? The idea of "mental liquidity", or "strong opinions weakly held"? https://news.ycombinator.com/item?id=36280772

indus

Wouldn’t this be the equivalent of ranking? I thought LLM are not supposed to get influenced by freshness.

marcosdumay

By the freshness of training with some data?

Well, aren't they? I believe any kind of reinforcement learning is supposed to be biased into the last training set.

winddude

> For the private sector, many homeowners and corporations have longer-term fixed debt, and only some portion of it matures each quarter and gets refinanced at higher rates. As more private debt matures and gets refinanced at higher rates, this will continue to serve as a disinflationary and recessionary force on the economy, especially for sectors that are more sensitive to interest rates.

The one thing I don't get and could have been missing in the past... a lot of the corporations and private things, like farms operate on debt. Now maybe it's a bit reductionist, but if you're a farmer operating on debt, if interest rates go up you need to increase prices to cover operating expenses. And this get compounded all the way up to the end consumer as every step in the supply chain marks up by a fixed percent, and because everything is getting more expensive decided lets mark up by a larger percent. So higher interest rates really could be contributing to inflation. And it's just creating a cycle. And with the current levels of debt never seen before in history, it's unlike other periods.

totetsu

I didn't read the article yet, but does it cover AI and Debt?

joshuaissac

No, the commenter intended to post it on the inflation & interest rates thread instead.

https://news.ycombinator.com/item?id=36315608

winddude

thanks

istjohn

Wrong thread

winddude

SOB, thanks.

indus

Side effect: Search engine content if not detected for generated content would be the first to suffer.

voat

That's assuming that the current crop of SEO'd garbage is better than the same content generated by an llm. I'm not sure that's the case.

j16sdiz

Current crop of SEO garbage follows some simplistic templates. I would assume they take less "effort" (memory storage, parameters, time, whatever) to learn. It have space left for other novel stuffs

LLM garbage, on the other hand, would take up the whole model..

jjoonathan

Agreed, it seems like every year or so I run into a case where I know something exists and has accessible robots.txt but is completely invisible to google.

indus

Either way it is the start of search engine’s decline.

- Generated content added to LLM. QED.

- Generated content added to SERP. DEAD.

;-)

guy98238710

Recursive training of generative models degenerates into reinforcement learning with random feedback. You need strong feedback signal in the loop to survive recursive training. The feedback does not have to come from humans though. You can use anything that grounds the model in reality.

Daily Digest email

Get the top HN stories in your inbox every day.