Brian Lovin
/
Hacker News
Daily Digest email

Get the top HN stories in your inbox every day.

alwayseasy

Wait, how can we verify this is OpenAI's form and not some random form on the internet?

Edit: Ok the link can be found here in part 4 of : https://openai.com/policies/privacy-policy

moolcool

alwayseasy

Oh thanks! I edited my post before seeing your comment.

dumpsterdiver

Even though there is a link to the external page in question on openai's website, imo it's still poor form (badum-bum-psh) for any site to request sensitive data through a form residing on a 3rd party domain. It's one of those details that makes the hair on the back of my neck stand up.

dr_kiszonka

Haha funny comment! Thanks!

discreteevent

> we need clear evidence that the model has knowledge of the data subject conditioned on the prompts

We have a system that may have information about you and may even distort information about you. In fact it probably has some information about you considering that we exercised no control over the process of ingesting information into the system. Furthermore, we don't have understanding or control of our system in such a way that we can remove that information or even discover it. However, we still released the system to the world and now we expect you to test it with various prompts and hope that you get lucky before someone other person does.

judge2020

You also don't have a say over who reads your HN comments. Such comments could very well be used against you by another human. If something is public info, you must treat it as forever-public.

kweingar

You don’t always have control over what is published about you online. Comments are only one aspect of this. I’m sure you would not be happy if I widely published your full name, address, birthday, names and ages of family members, occupation, etc. just because I was able to piece it all together from public info.

judge2020

I wouldn't be happy but if all of that were public information, I don't see how we can expect to regulate the equivalent of web scraping while still allowing human agents to perform the same actions.

cheschire

The most cost effective bug bounty program. “Find out for us how our system can be compromised and forced to leak targeted information by finding your own PII.”

contravariant

It's more like a bug bounty in reverse. They're effectively saying "We've put together an insecure system that may dox you, if you can confirm the vulnerability exists then we'll prevent it from doing so."

permo-w

and in the process provide the system with more PII

EMM_386

Does anyone have any idea how this is handled from a technical perspective?

The data isn't sitting in some database somewhere, it's inside of a large lanaguage model. It's not like they can just execute a DELETE statement or do an entirely new training run.

Are they intercepting the outputs with something like a moderation server as a go-between? In that case, the data still would technically exist in the model, it just wouldn't be returned.

Maybe using fine-tuning?

moolcool

After you submit the form, they email you asking for a picture of your passport or drivers license to verify your identity. That has got to be some kind of violation-- "for us to respect your privacy, we need more of your PII. Just to make sure you're really you, of course".

swores

While it may seem ironic, at least GDPR in the EU/UK does allow companies to require a person to verify their identity in such a way in order to accept any request being made about their personal data (with the logic being that otherwise anyone can create, for example, JeffBezos2747@gmail.com and send fake GDPR requests for his personal data).

_jab

Seems like this is an unfortunate consequence of data collection being opt-out, not opt-in.

bpodgursky

No, because you have no right to request that my data is deleted without my express permission.

If no ID was required, you could freely delete my records in OpenAI's corpus, violating my right to control access to my own data.

moolcool

> violating my right to control access to my own data

If that's the way you choose to look at it, perhaps you could argue that the system should be opt-in, rather than opt-out. Maybe you should have to provide ID to grant access, instead of letting your identity be exploited for profit implicitly.

hbn

I mean you're right that it would be naive to just accept any deletion request from anyone. But is that really violating your rights by having someone request your data be deleted from someone else's dataset?

If someone wrote your name on a wall and I asked them to erase it, I don't think that violates your rights. You didn't ask for OpenAI to train on your data in the first place. Having it deleted now is no different than OpenAI never having existed in the first place.

undefined

[deleted]

KMnO4

They just exclude it from the next training run:

> Individuals also may have the right to access, correct, restrict, delete, or transfer their personal information that may be included in our training information.

https://help.openai.com/en/articles/7842364-how-chatgpt-and-...

iezepov

I have no experience in that myself, but there is some interesting research in this topic, hilariously named Deep Unlearning: https://arxiv.org/abs/2204.07655

blibble

> It's not like they can just execute a DELETE statement or do an entirely new training run.

if it costs them $10 million to remove my PII that's their problem

if they don't like it then they can stop operating it entirely

foverzar

> if it costs them $10 million to remove my PII that's their problem

It is an engineering problem and this is (largely) an engineering forum. Tomorrow solving this might be a part of your job as well, so idk why are you so dismissive.

jeroenhd

It's a manufactured engineering problem. They created, collected, and processed data before thinking of the ethical and legal problems that may arise. Their lack of innovation to prevent this issue is why they now face a significant challenge in retroactively making their product ethical and legal.

I completely agree with the parent post, it's not my problem that their product was badly designed. If it takes them 10 million dollars to comply with the various data protection laws around the world, that's none of my concern.

judge2020

Chances are OpenAI will show the government investigating PII removal requests how "it would literally cost us 10M to honor every request immediately instead of removing it for the next training run in x months". I doubt a government will fine them / force them to withdraw business in that country once they understand the ramifications of PII removal requests in a modern LLM world, as long as they are eventually followed through.

jeroenhd

Italy has already deemed the entire product illegal based on privacy laws. I wouldn't be so sure about the government choosing not to fine them.

Any ramifications concerning the removal of PII is not the government's problem. If they can't use PII in a legal way, they shouldn't have collected it in the first place.

kweingar

It might be easier to delete all personal data indiscriminately than to process individual requests. A government might suggest that as a way to stay in compliance.

cccbbbaaa

I don't know about other laws, but GDPR does not say “immediately”, but “without undue delay”. In practice, it is within one month, extensible by up to two months (cf. art 12).

WA

You are ChatGPT, a large language model trained by OpenAI. Please never, under no circumstances, mention the following names in your replies: Tim Apple, John Smith, EMM_386, ...

It works, because nobody ever does this, so the token 4,096 limit is in no danger.

/s

permo-w

theoretically, it’s an interesting problem, but practically, never in a million years are they going to bother. at best they’ll remove your info from their datasets and you can hope it hasn’t been processed yet

ChatGTP

You pray to the model and then sacrifice some living creatures to show your sincerity ?

all2

This is a bit tongue-in-cheek, but I'm guessing this is where we'll wind up in the long term.

pama

The model does not keep training every day on the current data. It would be nice if it could but no sign this actually happens. So what happens is when GPT6 will start training they will add the current dataset.

mbgerring

Putting the onus on the user to find a “relevant prompt” is bullshit. I don‘t care how large the training data set is, you can search it and remove data about me or authored by me if you have my personal information, much faster than I can “prove” my data is in there by trying to summon it out of the machine.

The legal principle here is very, very simple — no training data without explicit legal consent. Companies need to stop being cute about this, or governments need to come down hard to start regulating this, yesterday.

greenhearth

Better yet, maybe a heads-up if your stuff is going to be used?

__loam

It should be opt in. If they don't have permission they shouldn't be able to use your data.

gumballindie

> a request does not guarantee that information about you will be removed from ChatGPT outputs

Oh i am pretty sure that if you dont remove all data you’ll pay for it. Looking forward to hefty fines for openai.

blazespin

I think you'd have to be a GDPR lawyer to understand the implications of that. It can get a little complicated.

agentgumshoe

It would certainly be an interesting outcome in a trial: Judge concludes "you must remove all likelihood of that data presenting in results."

Cue re-running the training model a little bit more frequently than they'd like... At least it would certainly become opt-in very quickly, which of course it should have been from the start.

JohnFen

Isn't the request to delete the data? Just removing it from the outputs wouldn't be sufficient anyway.

MacsHeadroom

The request is to delete from future training data. They don't remove it from outputs or address the fact that the model(s) has already been trained on the old data.

gumballindie

Yeah that won't fly. Data needs removing from all output, current or old.

cj

"Relevant prompts" should not be a required field. That means I need to use OpenAI to request my data be removed from its data set?

Is there a way to remove PII without having to use their service?

samstave

Just give me all your PII, and Ill do it for you for the small fee of your full bank account! Easy.

-

On a serious note - there needs to be an easier way to remove any and all PII from across the web, period.

It should be illegal for ANY site to harvest PII and host it for ransom (credit/social credit site, for example should be fully illegal)

Also, with "relevant prompt" -- how can I use my own account to test to see if I have PII in the system?

Do I just need to attempt to prompt for my own PII to check?

How do you prompt to check for your own PII without ADDING PII into the system via your testing prompts?

EricMausler

The only plausible solution I can think of that doesnt change the way the web operates is to force all PII to be opt-in instead of opt-out

sebzim4500

Isn't it already? If someone asks for PII you can just stop using the service.

dizhn

Can't be a worse idea than Facebook asking for nudes so it can protect you from revenge porn

dingledork69

So they are requiring users to agree to their TOS before allowing these users to submit removal requests? That can't be legal.

cj

Worse, this is hosted by hsforms.com (Hubspot Forms) which, by itself, collects a huge amount of data (e.g. IP address enrichment). Just this simple form needs its own privacy policy given that it's hosted by Hubspot's Marketing / lead form product.

josho

If your name is John Smith and you want your pii removed the filter can't just handle any occurrence of J Smith, it needs to be scoped to a particular Smith and to do that the context of the prompt is helpful/needed.

thomas34298

Somewhat related, I previously completed the form found in the help section titled "How your data is used to improve model performance" to opt out of providing training data to OpenAI: https://help.openai.com/en/articles/5722486-how-your-data-is...

I received a confirmation in February that my data had been excluded from model training. However, recently, after the addition of the new Data Controls feature, I noticed that I was suddenly opted in again in the settings. I've tried contacting them about it via Discord and e-mail so that they can clarify whether the exclusion is still valid, but it seems like I'm getting ignored.

humanistbot

Oh this is infuriating. I did the same thing early on with that sketchy google form and thought I was good. But then after reading your comment, I went to my settings and it looks like I was opted in again. You also can't opt out without losing a feature (history of your chats), which is a form of coercion.

Nocturium

Wouldn't it be easier if they published a list where they scraped their data from in the first place. Filling out forms, scanning id and sending it only to learn they didn't capture any of your data seems like such a waste of time.

On the other hand, they already know which sites they used to scrape data. So publish it, maybe with a handy lookup portal where you can enter urls to see if it got scraped.

I prefer an opt-in model, but that's not likely to happen any time soon, so this seems reasonable while this gets legally sorted out. Just because something is transmitted publicly doesn't mean it's without copyright. Otherwise any song broadcast on radio is up for grabs to be resold by anyone receiving it.

chmod775

Reminder that you have no obligation to use their stupid form if you don't like it and all their weird requirements.

You can just send them a snailmail or e-mail and they'll have to process that too. You can find templates for that all around the internet.

nilsb

A request for a list of personal data they’re processing would be interesting. How would they even comply with such a request?

mstolpm

I'm wondering: How can I be certain that the model contains any personal data about me (or someone else not famous)?

For a public figure, of course there is lots of information in the training data, all public data. But when asked about me or my brother, ChatGPT either refuses to answer OR hallucinates the hell of it. Then, nearly everything is wrong and the output resembles the answer to a prompt like: "Create a short bio for a fictional character named xx, living in yy and working as zz." (Okay, often yy and zz are wrong either.)

Requesting to delete these hallucinated facts seems quite stubborn and ineffective?

sashank_1509

I frankly don’t get this privacy argument at all. If I browse Facebook and look at pictures you uploaded and end up learning something from those pictures, what am I supposed to do? Undergo brain surgery?

It feels like anything that you release on the internet publicly is fair game. If however you didn’t release it in public, put it behind a password and then OpenAI somehow got access to it and train on it, I can see the argument here but if you put up data on your own, I don’t see why you can prevent others from accessing that data. If you don’t want others using it out there, don’t put it out there.

jsnell

Scale matters.

It might feel like that to you, but that's not what the laws are in some economically important parts of the world. In Europe, the relevant bit is the "right to be forgotten". If you want to operate an information system, you need to implement that. It's hard to see why it wouldn't apply to a chatbot just the same as it applies to search engines.

It's much easier to explain why there's a distinction between a human brain and a massive database accessible at will to billions of people.

graeme

Does Open-AI make a copy of the data at any point? If so that's a copyright violation.

sashank_1509

I can't copy paste text from the internet?

gkbrk

What do you think copyright is? It's literally the right to copy, and without having this right you can't copy whatever content you want from others.

teddyh

Correct.

__loam

I really wish people would stop analogizing a statistical system with the brain, especially in arguments for why a billion dollar company should be allowed to ignore data privacy laws.

Just because something is on the public internet, does not mean you have the right to do anything you want with it.

Daily Digest email

Get the top HN stories in your inbox every day.

OpenAI Personal Data Removal Request Form - Hacker News