OPT: Open Pre-trained Transformer Language Models

Daily Digest email

Get the top HN stories in your inbox every day.

MasterScrat

"We are releasing all of our models between 125M and 30B parameters, and will provide full research access to OPT-175B upon request. Access will be granted to academic researchers; those affiliated with organizations in government, civil society, and academia; and those in industry research laboratories."

GPT-3 Davinci ("the" GPT-3) is 175B.

The repository will be open "First thing in AM" (https://twitter.com/stephenroller/status/1521302841276645376):

https://github.com/facebookresearch/metaseq/

ALittleLight

I don't like "available on request". I just want to download it and see if I can get it to run and mess around with it a bit. Why do I have to request anything? And I'm not an academic or researcher, so will they accept my random request?

I'm also curious to know what the minimum requirements are to get this to run in inference mode.

JackC

> Why do I have to request anything? And I'm not an academic or researcher, so will they accept my random request?

Just a guess: you will have to contractually agree to some things in order to get the model; at a minimum, agree not to redistribute it, but probably also agree not to use it commercially. That means whatever commercial advantage there is to having a model this size isn't affected by this offer, which makes it lower stakes for Facebook to offer. And then the point of "academics and researchers" is to be a proxy for "people we trust to keep their promise because they have a clear usecase for non-commercial access to the model and a reputation to protect." They can also sue after the fact, but they'd rather not have to.

Not saying any of this is good or bad, just an educated guess about why it works the way it does.

HWR_14

> Why do I have to request anything?

I'm guessing it could be one or a mix of these:

They want to build a database of people interested in this and vetted by some other organization as worth hiring. Just more people to feed to their recruiters.

To see the output of the work. While academics will credit their data sources, seeing "XXX from YYY" requested, and then later "YYY releases product that could be based on the model" is probably pretty valuable vs wondering which ML it was based on.

A veneer of responsible use, maybe required by their privacy policy or just to avoid backlash about "giving people's data away".

javchz

My bet it's probably a filter, trying to prevent create a even more realistic farmbots in social media, as they are already bad as they are now.

robonerd

But they'll consider requests from government and industry.. both greater threats in the information war than any private individual.

bogwog

That is dumb when you consider that this thing is likely going to leak anyways. It’s inevitable, and when it does happen, it will just end up in the hands of criminals/scammers and not the general public.

acchow

I’m thankful they’re offering anything at all openly. Is it such a big deal a gigantic download is hidden behind a request form?

saynay

If it is like many other models, part of the reason would just be to reduce their bandwidth costs. The models can be huge, and they want to limit those who just want to download it on a whim so they don't rack up $10k+ is bandwidth charges, as has happened to many others who hosted big models out on S3 or something.

levesque

If only there was a way to distribute large files in a peer-to-peer manner, thus reducing the load on facebook's servers to effectively nothing. That would likely result in a torrent of bits being shared without any issues!

gwern

I expect they will release the models fully, perhaps even under nonrestrictive licenses. Most researchers aren't too happy about those sort of restrictions, and would know that it vitiates a lot of the value of OPT. They look like they are doing the same sort of thing OA did with GPT-2: a staggered release. (This also has the benefit of not needing all the legal & PR approvals done upfront all at once; and there can be a lot of paperwork there.)

JoeyBananas

A 175B parameter language model is going to be huge. You probably don't want the biggest model just for messing around.

fxtentacle

I'd guess they want to limit traffic. Once Huggingface links to you, your bandwidth bill 100x-es.

schleck8

Repo down?

thorum

A quick summary of the Limitations section:

- "OPT-175B does not work well with declarative instructions or point-blank interrogatives."

- "OPT-175B also tends to be repetitive and can easily get stuck in a loop. While sampling can reduce the incidence rate of repetitive behavior (Holtzman et al., 2020), we anecdotally found it did not eliminate it entirely when only one generation is sampled."

- "We also find OPT-175B has a high propensity to generate toxic language and reinforce harmful stereotypes, even when provided with a relatively innocuous prompt (Gehman et al., 2020), and adversarial prompts are trivial to find."

- "In summary, we still believe this technology is premature for commercial deployment."

With regard to stereotypes:

- "When compared with Davinci in Table 4, OPT175B appears to exhibit more stereotypical biases in almost all categories except for religion. Again, this is likely due to differences in training data; Nangia et al. (2020) showed that Pushshift.io Reddit corpus has a higher incidence rate for stereotypes and discriminatory text than other corpora (e.g. Wikipedia)."

- When testing with the RealToxicityPrompts data set, "OPT-175B has a higher toxicity rate than either PaLM or Davinci"

ad_hominem

> Pushshift.io Reddit corpus

Pushshift is a single person with some very strong political opinions who has specifically used his datasets to attack political opponents. Frankly I wouldn't trust his data to be untainted.

These models really need to be trained on more official data sources, or at least something with some type of multi-party oversight rather than data that effectively fell off the back of a truck.

edit: That's not even to mention I believe it's flat-out illegal for him to collect and redistribute this data as Reddit users did not agree to any terms of use with him. Just look at the disastrous mess of his half-baked "opt-out" thing that flagrantly violates GDPR: https://www.reddit.com/r/pushshift/comments/pat409/online_re...

VectorLock

Thats interesting, any good sources for this accusation?

ad_hominem

Not handy, and I'm not going to spend my evening digging. It may've also been one of the NGOs ideologically aligned with him that credited him for the data + assistance

robbedpeter

Web scraping is legal. Reddit users, like all other members of public forums, put their comments on the internet for the whole world to see. And collect, parse, process and manipulate. If you don't want the whole world to have access to your writing, you'd have to join a private forum.

Trying to shoehorn social media posts into some contorted post-hoc bastardization of the concept of privacy is ridiculous.

Shockingly, things that people post to publicly accessible websites are accessible by the public. We're starting to see social damage from this, with facial recognition and authoritarian governments using people's posts for tracking and oppression.

Decentralized services with strong legislation protecting personal data, and globally recognized content licensing will all be needed to prevent future abuse, but everyone currently in the planet over the age of 20 is more or less personally responsible for the massive and naive oversharing. We know better now, but 15+ years ago nobody except Sci-fi authors and fringe activists had a grasp of how badly unprotected globally shared streams of consciousness could go wrong.

undefined

[deleted]

mike_d

> Just look at the disastrous mess of his half-baked "opt-out" thing that flagrantly violates GDPR

Pushshift collects data from Reddit using the same API as the mobile app and public site. It does not have any privileged access to the Reddit database, nor is it collecting any PII that would be subject to GDPR.

You as a user grant a pretty broad license to Reddit when you post content. One of the things the license allows them to do is redistribute the content to other users as well as search indexes and things like the Wayback Machine or Pushshift.

(While I did work for Reddit at one point, these opinions are my own)

ad_hominem

> nor is it collecting any PII that would be subject to GDPR

Yeah that's not how that works. Reddit is a free text input interface. I'm free to put PII in any post or comment I want to and you have to comply with data protection laws accordingly if I want my information redacted later on.

The same way you wouldn't just "let it ride" if someone uploaded illegal content - the content itself is what's protected, doesn't matter how Reddit structures its web forms.

hjjjjjje

The opt-out form doesn't even get processed these days. It's a fig leaf for GDPR compliance that doesn't actually work.

hoseja

At some point they have to face the reality these "stereotypical biases" are natural and hamstringing AIs to never consider them will twist them monstrously.

SheinhardtWigCo

Viruses are natural, so should we stop trying to hamstring them?

mdp2021

What about: at some point we would have to really catch that inspiration from the expression "Intelligence" and build a critical engine?

Edit: in fact, your latter statement seems to suggest finished products: no, they are toys. We are playing in order to build further, we are getting results, milestones in the construction abilities - but those "models" are little lab-byproducts monsters. What are you «twisting»?

IAmEveryone

So if your plane model keeps blowing up, at some point people will just have to learn to live (/die) with it?

hoseja

It's not blowing up though, it's experiencing natural turbulence and you're so afraid of getting jostled a bit you demand the plane be tethered to the ground and never exceed 10mph. How to fly under these conditions is left as an exercise for the reader.

Ar-Curunir

you're just saying "people are naturally racist" in more words.

wardedVibe

They're saying that racist stereotypes are true, specifically.

undefined

[deleted]

mavhc

They are, that's the point of civilisation, to try to stop acting like animals

boppo1

Can you think of an example?

speed_spread

Reminds me a lot of "Do not taunt Happy Fun Ball".

bestcoder69

> - "OPT-175B does not work well with declarative instructions or point-blank interrogatives."

Lame!!! I've come to realize InstructGPT3 is just so so so much better than base GPT-3. I won't be _too_ excited about competitors yet until someone makes their own instruct model.

domenicrosati

The T0 series by big science is essentially an instruct model (though using multitask prompting instead of user feedback). You should check it out. I have got very competitive results on prompting t0-11b v instructgpt3(text davinci 2)

bestcoder69

Thanks, this looks awesome. But my use case is creative text generation (chatbots), which from a quick glance doesn’t seem to be a suggested use case for T0?

I’ve found that simply describing to text-davinci-002 how a chatbot should act gives you more fun and believable responses. For example I trained a trump bot on 2000 tweets (davinci non-instruct fine tuning), and it generated responses that were more boring than when I just wrote a sentence saying to please tweet like trump + a couple adjectives to help it.

I ran out of guest API credits on hugging face before I could trick T0 to respond with a chat completion longer than a few words. But I’ll try it some more later.

yosito

> OPT-175B has a high propensity to generate toxic language and reinforce harmful stereotypes

So they trained it on Facebook comments?

Gigachad

I'd think any natural language model would have the same biases we see from real humans.

tsol

Are there really no moderated forums that the data can be taken from? Even HN-based training data would be much more civil

yosito

I'd think the training data is something that could be curated. Eliminating all bias might be impossible, but GIGO applies.

stephenroller

We trained on Reddit comments and HackerNews comments.

Rebelgecko

I thought Pushshift was only reddit comments?

MengerSponge

Does it merely reinforce harmful stereotypes? Or will it help perpetrate genocide?

rhizome

Tomato, tomahto.

ChrisRR

Higher rate of toxicity and stereotypes?

So it was trained on facebook comments then

TedShiller

AKA not as impressive as it sounds

crazypython

BigScience (a coalition including Huggingface) is training and releasing a 175B language model and finishes in 2 month.

lumost

I often wonder if OpenAIs decision not to open gpt-3 was because it was to expensive to train relative to its real value.

They’ve hidden the model behind an api where they can filter out most of the dumb behaviors, while everyone believes they are working on something entirely different.

zarzavat

Didn’t they sell an exclusive license to Microsoft? It’s probably just a contractural issue.

lumost

That happened after they decided not to release the model.

Jensson

So their goal was to become the next IBM Watson? Parade around tech and try to create hype and hope for the future around it, while hiding all the dirty secrets that shows how limited the technology really is. Their original reasoning for not releasing it "this model is too dangerous to be released to the public" felt very much like a marketing stunt.

cortesoft

Well, the decision not to release the model might have been made so that they could license it instead.

teaearlgraycold

> They’ve hidden the model behind an api where they can filter out most of the dumb behaviors

What do you mean by this?

codebolt

Things like cobbling on a bunch of heuristic rule-based behaviours that wouldn't look good in the public repo of a supposed quasi-AGI system?

lumost

There is some evidence that the OpenAI GPT-3 APIs have a human in the loop for bad examples. They may also have a number of filters to exclude certain words/patterns/other rules.

The challenge with such rule and human in the loop systems is that the long-tail of these problems is huge, and fat. Meaning that you generally can't make a product which doesn't have full generalization. That it took ~1.5 years to open the GPT-3 API inclines me to think that they've run into similar problems. We're also not seeing the long pitched swarm of GPT enabled content despite the API being open for ~10 months.

teaearlgraycold

There’s no way they have a human in the loop. The model spits out tokens one at a time. You can see that with the stream flag set to true. The latency doesn’t allow for human intervention.

They do have API parameters for tweaking repetitiveness. That might be what you’re talking about - but it’s fair to call the model and an external repetition filter part of the same product.

As for word filters - no. If they did they’d not be sending back explicit content. But they do. If you have a gpt-3 product you’re obligated to run each result through their content filter to filter out anything nsfw.

We don’t see a ton of gpt-3 enabled content because writing good gpt-3 prompts is hard. You’re trying to learn how this black box works with almost no examples to go off of. I worked for a gpt-3 startup and we put someone on prompt writing full time to get the most out of it. Most startups wouldn’t think to do that and won’t want to.

alimov

I would like to know about the reason behind this as well.

mikolajw

The big one, OPT-175B, isn't an open model. The word "open" in technology means that everyone has equal access (viz. "open source software" and "open source hardware"). The article says that research access will be provided upon request for "academic researchers; those affiliated with organizations in government, civil society, and academia; and those in industry research laboratories.".

Don't assume any good intent from Facebook. This is obviously the same strategy large proprietary software companies have been using for a long time to reinforce their monopolies/oligopolies. They want to embed themselves in the so-called "public sector" (academia and state institutions), so that they get free advertising for taxpayer money. Ordinary people like most of us here won't be able to use it despite paying taxes.

Some primary mechanisms of this advertising method:

1. Schools and universities frequently use the discounted or gratis access they have to give courses for students, often causing students to be only specialized in the monopolist's proprietary software/services.

2. State institutions will require applicants to be well-versed in monopolist's proprietary software/services because they are using it.

3. Appearance of academic papers that reference this software/services will attract more people to use them.

Some examples of companies utilizing this strategy:

Microsoft - Gives Microsoft Office 365 access for "free" to schools and universities.

Mathworks - Gives discounts to schools and universities.

Autodesk (CAD software) - Gives gratis limited-time "student" (noncommercial) licenses.

Altium (EDA software) - Gives gratis limited-time licenses to university students.

Cadence (EDA software) - Gives a discount for its EDA software to universities.

EDIT: Previously my first sentence stated that the models aren't open - in fact, only OPT-175B is not (but the other ones are much smaller).

Vetch

The other ones are smaller but not much worse according to their tests (oddly, in the Winograd Schema Challenge and Commitment Bank tasks, the largest model actually appears to be worse than much smaller ones).

30B parameter models are already large enough to exhibit some of the more interesting emergent phenomena of LLMs. Quantized to 8 bits, it might be possible to squeeze into 2, better three 3090s. But the models also seem undercooked, slightly to strongly under-performing GPT-3 in a lot of tasks. To further train the same model is now looking at > 100 GB, possibly 200GB of VRAM. Point being, this is no small thing they're offering and certainly preferable to being put on a waiting list for a paid API. The 6.7B and 13B parameter models seem the best bang for your buck as an individual.

sireat

Can you actually stack multiple 3090s arbitrarily like that?

That is use multiple 3090s to load a single model for inference.

I thought that at most you could use two 3090s via NVlink.

Stacking multiple cards would open some real cheap options.

Like a real budget option would be something like a few ancient K80s (24GB version). eBay price was around $200-300 last I checked. .

mwint

Add Mathematica to that list, too. Pretty cool to play with and I would have bought a license if I had a good excuse to; the tactic works.

jrockway

Mathematica has been on my mind since high school because we got it for free. I went through the free trial process recently and tried a couple of things I have been too lazy to manually code up (some video analysis). It was too slow to be useful. My notebooks that were analyzing videos just locked up while processing was going on, and Mathematica bogged down too much to even save the notebook with its "I'm crashing, try and save stuff" mode. I ultimately found it a waste of time for general purpose programming; the library functions as documented were much better than library functions I could get for a free language, but they just wouldn't run and keep the "respond to the UI" thread alive.

So basically all their advertising money ended up being wasted because they can't fork off ffmpeg or whatever. Still very good at symbolic calculus and things like that, though.

rdedev

I'm afraid of companies pushing large scale models as the end all for anything text related. Large language models are revolutionary but the last thing I want to see is everything being run through an API. I'm more interested in things like knowledge distillation or prompt tuning. The hope is that a medium size model with some training can match a large one large one using zero shot approaches

undefined

[deleted]

coding123

Can someone open a Bittorrent seed if you get it

LeicaLatte

As someone who finds openai patronizing, this is welcome.

bestcoder69

I love text-davinci-002, but they need competition, badly. Their ToS is preventing me from releasing the world's greatest chatbot :P https://old.reddit.com/r/GPT3/comments/ubm0hm/my_customizabl...

causality0

Out of curiosity, what's the file size on that?

learndeeply

Depends which model, but assuming the largest: 175B * 16 bits = 350GB. Half of that if it's quantized to 8 bits. Good luck finding a GPU that can fit that in memory.

faebi

Does the model need to be in memory in order to run it with current tooling?

PeterisP

To run it at a reasonable speed, yes. Computing a single word requires all of the parameters; if you don't have them in memory you'd have to re-transfer all those gigabytes to the GPU for each full pass to get some output, which is a severe performance hit as you can't fully use your compute power because the bandwidth is likely to be the bottleneck - running inference for just a single example will take many seconds just because of the bandwidth limitations.

GPT-3 paper itself just mentions that they're using a cluster V100 GPUs with presumably 32GB RAM each, but does not go into detail of the structure. IMHO you'd want to use a chain of GPUs each having part of the parameters and just transfering the (much, much smaller) processed data to the next GPU instead of having a single GPU reload the full parameter set for each part of the model; and a proper NVLink cluster can get an order of magnitude faster interconnect than the PCIe link between GPU and your main memory.

So this is not going to be a model that's usable on cheap hardware. It's effectively open to organizations who can afford to plop a $100k compute cluster for their $x00k/yr engineers to work with.

undefined

[deleted]

wmf

I wonder if a 64GB Orin or M1 Max could fit the 30B model...

Invictus0

Someone can correct me if I'm wrong, but "30B parameters" refers to a matrix with 30B elements, and assuming all the numbers are 16 bit, then that's 2 bytes * 30B = 60GB.

sanxiyn

175B * 16 bits = 350GB, but it does compress a bit.

GPT-J-6B, which you can download at https://github.com/kingoflolz/mesh-transformer-jax, is 6B parameters but weighs 9GB. It does decompress to 12GB as expected. Assuming the same compression ratio, download size would be 263GB, not 350GB.

d--b

Remember when OpenAi wrote this?

> Due to concerns about large language models being used to generate deceptive, biased, or abusive language at scale, we are only releasing a much smaller version of GPT-2 along with sampling code. We are not releasing the dataset, training code, or GPT-2 model weights

Well I guess Meta doesn’t care.

https://openai.com/blog/better-language-models/

sigmoid10

Ever since OpenAI transitioned away from the non-profit model, I'd take these statements with a grain if salt. Yes, there may also be some truth in that opinion, but don't underestimate monetary interests when someone has an easy ~12 month industry lead. Meta's existence and financial wellbeing on the other hand doesn't depend on this stuff, so they have less incentive to keep things proprietary. It seems ironic and almost bit sad that the new commercial circumstances have basically reversed these companies' original roles in AI research.

SteveDR

I feel the same way. It does seem odd, though, that Meta would release this despite the precedent set by OpenAI with statements like this. What does Meta gain by releasing this for download?

wdroz

I hate the nanny point of view of OpenAI. IMO trashing Meta because theirs models may be misused isn't fair.

I think that hackers should advocate to have the freedom to toy/work with these models.

learndeeply

OpenAI released their large GPT-2 models weights a couple months after making that post: https://openai.com/blog/gpt-2-1-5b-release/

krageon

OpenAI is only concerned with making money. What you quote is the PR reason, so they don't sound like the empty corporate money-grubbers they actually are.

jdrc

hint: openAI didn't care either

urthor

Is the model of using an asterisk after first author's names to signal equal contribution common?

Don't read many papers, but that's a new one.

axg11

Very common.

etaioinshrdlu

What type of hardware would you need to run it?

robbedpeter

A cluster of many $8000+ gpus. You're looking at around 350GB of vram, so 30 12gb gpus - a 3090 will cost around $1800, so $54k on the gpus, probably another $15k in power, cooling, and infrastructure, $5k in network, and probably another $20k in other costs to bootstrap it.

Or wait 10 years, if gpu capacity scales with Moore's law, consumer hardware should be able to run a ~400GB model locally.

coolspot

One could use $4.5k RTX A6000 48Gb instead. They can be joined in pairs of 96Gb common memory pool with NVlink. That’s 7x$4.5=$31.5k in GPUs to get 336Gb of memory. Or 8x$4.5=$36k in GPUs to get 384Gb of memory.

Add say $3k per GPU pair for surrounding computer (MB,CPU,RAM,PSU) 4x$3k=$12k.

$48k total budget.

coolspot

> so 30 12gb gpus - a 3090 will cost around $1800

3090 has 24Gb, thus 15 GPUs X $1800 = $27,000 in GPUs

etaioinshrdlu

Can 3090 GPUs share their memory with one another to fit such a large model? Or is the enterprise grade hardware required?

adamsmith143

Almost no one does this on prem. What would this cost on AWS?

cardine

This is not true. On prem is extremely common for things like this because after ~6 months you'll have paid more in cloud costs than it would have cost to purchase the GPUs. And you don't need to purchase new GPUs every 6 months.

AWS would cost $50-100k/mo for something comparable.

f311a

Just curious, will I be able to use it using my Nvidia card with 10GB of memory? Does it require multiple graphic cards?

robbedpeter

The smaller models, yes. I'd bet dollars to donuts that gpt-neo and EleutherAI models outperform most, if not all, of Facebook's.

Check out huggingface, you'll be able to run a 2.7b model or smaller.

https://huggingface.co/EleutherAI/gpt-neo-2.7B/tree/main

woodson

As the model weights (even quantized) would be several hundred GBs, it’s unlikely, unless special inference code is written that loads and processes only a small subset of weights and calculations at a time. But running it that way would be painfully slow.

lostmsu

The code is already there: DeepSpeed

Daily Digest email

Get the top HN stories in your inbox every day.