Brian Lovin
/
Hacker News
The Pile: An 800GB dataset of diverse text for language modeling (2020)