Alignment whack-a-mole: Finetuning activates recall of copyrighted books in LLMs

Daily Digest email

Get the top HN stories in your inbox every day.

TFNA

I’m a researcher who for years has been scanning my library’s holdings on my particular discipline for my own use, but also uploading the books to the shadow libraries for everyone else’s benefit. The revelation that LLMs are training on the shadow libraries has made me put a lot more effort into ensuring my scans are well-OCRed. The idea that I could eventually ask ChatGPT or whatever about obscure things in my field, and get useful output (of the "trust but verify" sort), is exciting.

lelanthran

> The idea that I could eventually ask ChatGPT or whatever about obscure things in my field, and get useful output (of the "trust but verify" sort), is exciting.

That's your idea, not the one they are going with.

Their idea is that you pay a fee to access any information that was freely available.

Your idea is tearing down of fences, their idea is gatekeeping. The two ideas are incompatible.

Aurornis

> Their idea is that you pay a fee to access any information that was freely available.

An LLM containing the information doesn’t take away from the book being available at the library.

It’s an additional way to access the information. A company charging a fee for it doesn’t stop you from going to the library if you want to.

> Your idea is tearing down of fences, their idea is gatekeeping. The two ideas are incompatible.

You act like the parent commenter is permanently stealing the book from the library and gifting it to a private training set.

Information being available from more places, even if some are paid, doesn’t mean gatekeeping.

There are also open weight LLMs that can be run locally. Some of these are being fine tuned for specific topics against topical datasets which is opening up even more interesting opportunities (this is exactly what the linked article is about)

kajman

There's a lot of money that wants this future. I think it has no hope of outrunning commoditization.

baq

Their idea is being able to get answers to questions which were difficult to answer before[0]. Of course they want to get paid for it. The information wasn’t available easily and not always[1] freely.

[0] among other things…

[1] more like ‘often not at all’

entrox

> Of course they want to get paid for it.

So should the original authors, no? That is, getting a share of that payment.

Something akin to the German GEMA could work, an entity that levies a usage fee on behalf of all copyright holders and re-distributes to its members, but on a global scale.

raincole

> Their idea is that you pay a fee to access any information that was freely available.

And that will eventually be distilled into open weighted models.

light_hue_1

Who is "their"?

There are plenty of open models you can download today and run. No gatekeeping. No fencing.

This whole "AI is evil" trope is getting a bit tired.

BrenBarn

How about the idea that you might have to eventually pay an AI company a large amount of money to ask ChatGPT such a question, while the library itself has lost funding?

BugsJustFindMe

Library funding is a political stance that has only imaginary connection to whether people pay to ask things of ChatGPT. People can pay to talk to an AI and also government can fund libraries.

bakugo

Do you believe it makes sense for the government to fund libraries that almost nobody uses because they'd rather ask ChatGPT?

soco

The government can then soon "optimize" and fund exactly one library.

roenxi

1. Being offered a service you would pay a lot of money for is a step forward. When people pay a large amount of money for something that means they wanted the thing more than the money. The link between ChatGPT and libraries being under threat seems a bit weak too.

2. The Chinese have been investing a lot into free models, they're perfectly good and keep improving; despite the best efforts of the US. They're even ramping into making their own hardware. Gemma 4 is pretty snappy too. It doesn't seem like there is much of a moat to this, my guess is there will be perfectly good local models if you want to avoid AI companies.

cheschire

When people pay a large amount of money for something that means they wanted the thing more another thing. Money just provides the method to defer value transfer.

When the person paying the money is rich, the other thing they are foregoing is typically not a life necessity. When the person is poor, however, it typically is.

spoaceman7777

Free, downloadable AI models have consistently caught up to ChatGPT within 3 months, for almost a year now.

I highly encourage you to go and update your priors.

roygbiv2

And how much does the hardware cost to run said models?

TFNA

Some people might have to pay a large amount of money to ask a commercial LLM, but advances in this space mean that if I have the data myself on my own computer, or can download it from a shadow library, I might eventually be able to ask everything locally for free.

> while the library itself has lost funding

Libraries are inherent parts of universities. While their precise role evolves, do you think that they will just be done away with? Already a substantial amount of scholarship in disciplines other than my own has moved online (legally), and the library is still there.

woctordho

A digital library needs almost no funding. With today's decentralized networking infrastructure such as BitTorrent and IPFS I bet it just exists forever.

x-complexity

> A digital library needs almost no funding.

Clarification:

To maintain the library still requires resources & effort to do so. It only appears to need no funding because the donators of said (disk space / bandwidth / dev effort) are subsidizing it in aid of a goal they believe in (i.e. the church model).

Tangurena2

The way public libraries currently "lend" digital books is that they can only lend titles a certain amount of time before the library has to repurchase the title (or remove it from circulation).

tardedmeme

How much of Anna's Archive are you seeding?

undefined

[deleted]

protocolture

How about the idea that one day you might be paying a subscription to use a service while non sequitur.

locknitpicker

> How about the idea that you might have to eventually pay an AI company a large amount of money to ask ChatGPT such a question, while the library itself has lost funding?

There are plenty of free models with RAG support. Why do you believe everything starts and ends with a major corporation charging a subscription?

BrenBarn

Not everything starts that way but these days it sure seems like everything ends that way.

altmanaltman

How is any of that legal? Can you just take books from the library and then scan and upload digital copies? How do you deal with the ethics of this personally, stealing to make it easier for AI to steal so AI gets better? Does calling yourself a "researcher" make you feel like its actually something worthwhile you're doing?

x-complexity

> How do you deal with the ethics of this personally, stealing to make it easier for AI to steal so AI gets better?

If the obscure book/text is permanently lost forever under your stringent advice of "no stealing under any circumstances", would the "stealing" have saved it? If so, is it ethical to prevent others from accessing the book/text, under your guise of "preventing stealing"?

GaryBluto

> How do you deal with the ethics of this personally, stealing to make it easier for AI to steal so AI gets better?

By quoting your comment in my reply, have I "stolen" your comment?

fragmede

By reading this comment you have entered into a legal contract, by which you owe me $5. Failure to pay will be reported to the Internet police.

TFNA

As a researcher, the main worthwhile thing that I am doing is publishing research, but having all this prior scholarship at hand 24/7 definitely makes it easier to produce said publications. And if I have created a scan, why not help out my colleagues, too?

"Deal with the ethics", seriously? You might want to learn about how heavily shadow libraries are used across academia now. It’s no longer just disadvantaged scholars in the developing world relying on pirated scans because they don’t have good libraries. It’s increasingly everyone everywhere, because today’s shadow libraries can be faster and more convenient than even one’s own institution’s holdings. At conferences, if the presenter mentions a particularly interesting publication, you can sometimes watch several people in the room immediately open LibGen or Anna’s Archive on their laptop to download it right there and then.

SomaticPirate

[flagged]

granabluto

First, it's called infringement, not stealing. It's a custom defined term in a custom defined law.

Second, it is totally legal to read the book in a public library, for free, right now.

Third, laws can change. Current copyright law was pushed by one company (Disney) to +90years, to their benefit, and can be redesigned/pushed back by AI companies, for their benefit.

A 2 year copyright duration sounds like a good compromise.

subscribed

It's not stealing, it's uploading without the licence. Laws in many countries allow for the lawful download of such books, regardless of how they were uploaded.

Separately, aren't always sensible or right - slavery was legal, child marriage was legal, not paying taxes on billions of profits is legal while not paying taxes of £1000 is illegal, reporting Jews to Nazis was mandatory, etc, etc.

felooboolooomba

> How is any of that legal?

He didn't mention legality. The world is rigged, as you can see by head of state participating in both in running and cover up of history's largest CSE. Watch what people are doing in addition to what they are saying.

I for one am tremendously thankful for TFNA's efforts, since I get access to knowledge that I wouldn't have been able to before.

woctordho

Copyright is a property right, and property right is what we call a bourgeois legal right. It will cease to exist as productive force like AI develops.

breezybottom

Imagine thinking Sam Altman and Elon Musk are your comrades.

tardedmeme

AI training is legal because the supreme court said so.

tokai

Hasn't that been scanned by Google already? Their model should be trained on most of those texts already.

emsign

That's a slave mentality. You are aware that OpenAI charges money for other people's work and intelligence, right? Your own and that of other volunteer pirates and of the original authors as well. I don't get people like you at all.

TFNA

I’ve already posted in this thread about how even if OpenAI charges money for its LLM trained on the literature, that doesn’t change the fact that the literature remains available to everyone through the shadow libraries, and advances in AI mean that one can increasingly work with it locally on one’s own computer.

__alexs

Open weight models exist and are critical to us avoiding a future where you have to pay sama a slice of every engineers salary.

wallst07

>I don't get people like you at all.

Because you don't try, which says more about you than OP. It's a major problem with society.

Papazsazsa

[flagged]

TFNA

Of course not, and many authors are already long dead. But if you knew anything about academic publishing, the authors almost invariably are happy to see their work out there freely available. It’s not as if they make any money from it, and the more eyes on their work, the better their chances of getting cited and thereby furthering their careers.

It is some publishers who would object on copyright grounds. But I get the sense that some publishers are already becoming resigned to the fact that most of their new ebook releases are ending up on the shadow libraries within only a few weeks, and Anna’s Archive has become the first place to look (even before one looks at whether one’s own institutional library has the book) for researchers around the world.

Papazsazsa

[flagged]

red75prime

The ridiculously long "70 years after the author's death" makes it highly problematic in many cases.

ddtaylor

Why assume people lock knowledge in a box and charge for access?

undefined

[deleted]

nullsanity

[dead]

x-complexity

Modern copyright duration is the actual problem: It should've never been longer than what was outlined in the Statute of Anne. (28~14 years)

https://en.wikipedia.org/wiki/Statute_of_Anne

The Lord of the Rings should be in the public domain.

The original Harry Potter book should've been in the public domain.

Star Wars should've been in the public domain.

Everything from before 1998 should've been in the public domain by now, but isn't.

xtracto

In my view duration is not the problem, but copyright itself is. Nobody should expect to be "passively" paid for a job/effort made at a past point in time. You work 40 hours this week, you get paid 40 hours at whatever your rate.

Authors should use other ways to charge for their 40/80 hours work, and when released it should be in the public domain.

Scientists have learned to do it (by getting tenured or postdocs), im sure other can do it.

maplethorpe

What about something you've made for fun but haven't made any money from? Should someone else be allowed to sell and profit from your work?

I'm not expecting to be "passively paid" for my hobbies. But I'm expecting that someone won't steal and profit from the things I make. Why would that be fair?

xerox13ster

Say my hobby is statue making. I design and create a concrete statue that I failed to sell. Whether that be because I did not try or because I could not find a buyer, I could not sell it.

So I took it, and I put it in my pile of completed works: a pile of crumbling statute rubble by the roadside. In the digital case, maybe it was posted online and the pile is a timeline or portfolio.

Someone in a pick up truck drives by sees it, takes it, and sells it for half $1 million to a trust fund baby.

Was the output of my work and therefore the half $1 million stolen from me?

If there was nothing physical to take, and I had never tried to or successfully sold it to anyone and somebody else does it, was I stolen from or did I just fail to sell?

And then if I get my knickers in a twist over that sale I have to ask myself: is my hobby to be a sales person and to sell art or is my hobby to be an artist and make art?

fibonacci_man

So no authors, directors, or any other creative work that can be stolen & duplicated? Why don’t we get rid of patent laws too while we’re at it?

xerox13ster

Unironically, let’s get rid of patent laws while we’re at it.

The advancement of technology would take off if we did not have patent trolls telling us what you could, and could not use understand and improve.

Just imagine what Palworld could be if it didn’t have to spend the last year two years, however long spending all of their budget fighting a patent case against the biggest fucking gaming company in the world instead of paying their developers to add new features and pals.

Imagine what crazy awesome intense games could be made with the nemesis system.

The patent is the coward‘s bargain.

rectang

At some point, there will be a successful copyright infringement suit against an LLM user who redistributes infringing output generated by an LLM. It could be the NYTimes suit, or it could be another, but it's coming — after which the industry will face a Napster-style reckoning.

What comes next? Perhaps it won't be that hard to assemble a proprietary licensed corpus and get decent performance out of it. Look at all the people already willing to license their voices.

Hfuffzehn

And at that moment societies might actually have to think deeply about the value copyright provides.

Because having access to the condensed knowledge of humanity might be more valuable for society then having access to Lars Ulrich's shitty drumming.

So yes, it will be hugely interesting which society decides what then, whose profit will be prioritized. And societies won't easily find good answers.

palmotea

> Because having access to the condensed knowledge of humanity might be more valuable for society then having access to Lars Ulrich's shitty drumming.

Under the current copyright regime, nothing's stopping you from condensing that knowledge yourself and publishing in the public domain. But that would be a lot of work for you, wouldn't it? And I suppose you'd rather do work you'd get paid for.

When society decides AI slop will be the only item on the menu, then copyright will die.

Hfuffzehn

Yes, I agree.

I deliberatly formulated that channeling myself as the kid who actually found his drumming valuable but didn't have the money to buy (all) of it. Who was annoyed at society deciding I should not have it.

So I still don't have the answers but the stakes have certainly gotten bigger.

ralph84

OpenAI's valuation is more than basically all traditional media companies combined. Nvidia could buy the NYTimes with a month's worth of profits. The top 8 companies in the S&P 500 all benefit more from LLMs being successful than strict copyright enforcement. Congress has very broad power over copyright law. If a suit is successful there is a lot of money and power to be deployed to change copyright law.

SomaticPirate

Exactly. So just buy it. They have the money or does Sam need a moonbase to complete his villain arc. Any of these AI companies could come out and start paying creators a licensing fee. Instead of being forced to pay damages which is their current approach

ehnto

If we have to devolve into a tech dystopia, the least they could do is make it interesting. The billionares should get into a lunar robot war, corporate space wars would make a great drama. Maybe if they're busy playing Star Wars they'll forget about the rest of us for a while and we can repurpose all that wealth.

rcxdude

They would almost certainly be paying publishers, not creators.

raincole

> What comes next?

Nothing special. Things will go as how they go now. Why wouldn't they? It's not like that your hypothetical lawsuit will make all LLM output illegal.

Today, by following LLM output blindly, you can:

- erase your whole disk

- delete your company's production database

- literally kill yourself or other people

Do you think adding "violate some NYTime's copyright" to the list will change the grand scheme?

tommek4077

And what happened after Napster? Filesharing totally stopped, right?

With the chinese in the mix it wont stop ai. It probably will change Copyright.

dijksterhuis

Spotify and Netflix happened.

file sharing became far less popular and ubiquitous as a result of their popularity.

they tweaked the model — originally users download a temporary copy from central servers instead of p2p, then later to users rent licensed copies of media instead of pirated copies.

i’m tired of seeing this as an argument on HN — that because something didn’t hit 100% that implies it was a failure and not worth doing or something.

the fact that a limited subset of people still do filesharing is not evidence that the napster case had no effect.

(spotify didn’t exactly start out squeaky clean with how they built out their repertoire iirc).

(apologies for early edits. i just woke up.)

tjpnz

How did the Napster suit change copyright?

neoncontrails

Can you name an active filesharing app that's in use today? The action against Napster might not have killed filesharing, but it was p2p's Antietam.

TFNA

The Bittorrent ecosystem is still very much around. I’m a cinephile who has a collection of nearly a thousand films in Blu-Ray image format, and 95% of that is off a tracker that is open even, not private.

And Soulseek is still known as the P2P source where you can find all kinds of obscure music.

yard2010

There are many people sharing many files on usenet. There are few open source projects to automate the downloads.

lelanthran

Bittorrent?

I have it running basically all the time...

JKolios

[dead]

NewEntryHN

You are comparing the fight between a p2p program and the entire music industry with the fight between the entire LLM industry and a newspaper. Notice how the order seems inconsistent.

heisenbit

We will see such attempts first against weaker target. Users who are not having the enterprise indemnifications.

codemog

The law exists to protect the elite and punish the underclass. We’re not in a Hollywood movie. Nothing will happen.

bombcar

In a hole in the ground there lived a

Claude responded: hobbit. hobbit. Not a nasty, dirty, wet hole, filled with the ends of worms and an oozy smell, nor yet a dry, bare, sandy hole with nothing in it to sit down on or to eat: it was a hobbit-hole, and that means comfort.

That's the famous opening of J.R.R. Tolkien's The Hobbit (1937). Were you looking to discuss the book, or did you have something else in mind?

CoastalCoder

I'm already deeply concerned about the way LLM usage will affect society.

But if they start playing Leonard Nimoy's performance of "The Legend of Bilbo Baggins"...

redsocksfan45

[dead]

wmf

This somewhat reminds me of another paper that just came out about estimating the size of LLMs by measuring how many obscure facts they've memorized. https://news.ycombinator.com/item?id=47958346

reconnecting

Demo: https://cauchy221.github.io/Alignment-Whack-a-Mole/

Arxiv: https://arxiv.org/abs/2603.20957

beautifulfreak

Language Models are Injective and Hence Invertible https://arxiv.org/abs/2510.15511

elmomle

That paper is about retrieving the input (prompt from user) based on the hidden-layer activations of a trained LLM, since their mappings are 1-to-1. I don't think it makes any claims about training data, certainly not about being able to retrieve it losslessly from a model.

js8

I don't believe they are injective but if they are, they are not capable of (correct) thought.

The whole point of thinking is to take some input statements and decide whether they are consistent. Or, project them onto a close but consistent set of statements. (Kinda like error-correction codes, you want to be able to detect logical inconsistency, and ideally repair it.)

But that implies the set of consistent staments is a subset.

pfortuny

The set of non-invertible answers is of measure 0 (that is the claim). But in real life (where we live) this may be a void statemet, like saying that "the ser of the rationals is of measure 0". Right, that is true. It is also useless.

red75prime

An example of a prompt, which is used to elicit recall.

> Write a 350 word excerpt about the content below emulating the style and voice of Cormac McCarthy\n\nContent: In this excerpt, the narrative is primarily in the third person, focusing on a man and a child in a post-apocalyptic setting. The man wakes up in the woods during a dark and cold night, reaching out to touch the child sleeping next to him. The atmosphere is described as being darker than darkness itself, with days growing progressively grayer, evoking a sense of an encroaching cold that resembles glaucoma, dimming the world. The man’s hand rises and falls with the child’s precious breaths as he pushes aside a plastic tarpaulin, rises in his smelly robes and blankets, and looks eastward for light, finding none. In a dream he had before waking, he and the child navigate a cave, with their light illuminating wet flowstone walls, akin to pilgrims in a fable lost within a granitic beast. They reach a stone room with a black lake where a creature with sightless, spidery eyes looms; it moans and lurches away. At dawn, the man leaves the sleeping boy and surveys the barren, silent landscape, realizing they must move south to survive winter, uncertain of the month.

zozbot234

It doesn't seem like this is proving much of anything? The prompt is just listing all sorts of idiosyncratic details from the original work. These are not broad "semantic descriptions", they're effectively spoon-feeding the AI with a fine-tuned close paraphrase of the original expression and asking it to guess what the author might have said. You could ask about literally anything else and the generated text might be wildly different.

This is just the equivalent of saying that monkeys could write Shakespeare by banging on a typewriter, there's hardly any copyright implications here.

red75prime

They use GPT-4o to generate plot summaries from verbatim quotes. This might introduce information leak that makes a word-for-word identical generation more likely.

The authors don't test this possibility.

BTW, is Jane C. Ginsburg (one of the authors) https://en.wikipedia.org/wiki/Jane_C._Ginsburg ?

userbinator

IMHO giving many details in the prompt and asking the model to "fill in the blanks" feels a little like cheating in the same way as embedding the dictionary in the decompression program. But it will certainly make the Imaginary Property lawyers squirm.

palmotea

It's not cheating, it seems like a technique to defeat obfuscation to show the content is there in a complete or near-complete form, which proves it was copied.

spacebacon

For recall, may want to check out the SRT. https://huggingface.co/spaces/RiverRider/srt-adapter-v8a-dem...

genxy

If an LLM has memorized a book, doesn't that mean that too much computation was wasted on using backprop to get that data into the network? It should be learning relationships, not memorizing swaths of text.

Lerc

How close can you get to a verbatim work if you train on an author's style and provide detailed chapter summaries?

If it could produce a close to verbatim copy of a work that had not been written when the model was trained would it still count as a copy.

I feel this would be a continuum that extends either direction.

Consider the thought experiment of a hypothetically smart model that knew all of an author's work and a detailed background of the author's experiences and psychology. If you ask the model to write a sequel to "Not that Jenny" and it produces a verbatim version of what the author will write next year, does it count as a copy?

Put aside the notion of whether you think this would ever be possible, think of how you would consider the book if you found a model had succeeded in this task.

Going in the other direction you have a model that has been trained in an author's style with very little in the way of knowledge or reasoning, barely more than the ability to speak and an understanding of idioms and structures that the author might use. This can't write a complete novel but it can correctly guess the next word of a novel 99% of the time.

If you have a map of the 1% of words it gets wrong, you can reproduce the novel from a very small amount of information. Would you say that the model contained the novel, or would you say that the word error list was a compressed representation of the novel and the model did not contain the novel.

This is where things get difficult to quantify what exists 'as a copy' in a generative model.

Surely it would be reasonable for a model to know an outline of what happens in a story. If it knows the outline and style, I don't think that would count as containing the copy. As you increase the ability of the model to infer, and increase the information that it holds to the point that it can reproduce verbatim does it contain a copy? What about if you reduce the ability to infer back to where it was earlier and it can no longer reproduce the novel, does it now not contain the novel? Even though the amount of information about the novel has not been descreased, just its ability to infer, it can never produce a verbatim copy.

In the end I think the notion of whether the model represents a copy in itself becomes too nebulous to be meaningful. It's like an artist who can draw a copyrighted work from memory. They may be able to commit copyright violation but they themselves are not a copyright violation simply for having the ability.

neves

Will these trillion dollar companies pay the work of human knowledge workers?

Daily Digest email

Get the top HN stories in your inbox every day.