When I tell people that Large Language Models (LLMs), like the ones that power ChatGPT, are trained on a large amount of text, it’s hard to communicate quite how large a dataset we’re talking about.
I was interested in how this compares with the size of Wikipedia:
"As of 14 January 2025, there are 6,939,793 articles in the English Wikipedia containing over 4.7 billion words (giving a mean of about 690 words per article)."
You perpetuate two MAJOR falsehoods that play into the hands of exploiters and weaken the case of rights holders: the ”wide scrape” myth and the ”even distribution” myth.
Wide Scrape myth:
There is nothing random about what sources were targeted for scraping. Target lists were compiled manually, and include paywalled content, news archives, pirate troves and torrent sites. The popular myth is, it’s all randomly compiled from the www at large. This is to a varying degree openly admitted in interviews and court hearings, and was to be revealed by OpenAI whistleblower Suchir Balaji in detail before he was murdered in his home.
Even Distribution myth:
Scraping is stage one. Not all property is created equal. Garbage is filtered out, and quality sources are upregulated 10x or more. So while there is some averaging going on, there is a clear value differential between quality corpuses like the NYT archive and the Books3 pirate ebook collection on the one hand, and forum threads on the other. And while there is a whole lot of blending going on, streaks of verbatim regurgitation is commonplace and by design.
And this is all before going into the meaning myth and fact myth: that text compression magically derives meaning from symbols, or that symbols map to any external reality…
I was interested in how this compares with the size of Wikipedia:
"As of 14 January 2025, there are 6,939,793 articles in the English Wikipedia containing over 4.7 billion words (giving a mean of about 690 words per article)."
https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia
No.
You perpetuate two MAJOR falsehoods that play into the hands of exploiters and weaken the case of rights holders: the ”wide scrape” myth and the ”even distribution” myth.
Wide Scrape myth:
There is nothing random about what sources were targeted for scraping. Target lists were compiled manually, and include paywalled content, news archives, pirate troves and torrent sites. The popular myth is, it’s all randomly compiled from the www at large. This is to a varying degree openly admitted in interviews and court hearings, and was to be revealed by OpenAI whistleblower Suchir Balaji in detail before he was murdered in his home.
Even Distribution myth:
Scraping is stage one. Not all property is created equal. Garbage is filtered out, and quality sources are upregulated 10x or more. So while there is some averaging going on, there is a clear value differential between quality corpuses like the NYT archive and the Books3 pirate ebook collection on the one hand, and forum threads on the other. And while there is a whole lot of blending going on, streaks of verbatim regurgitation is commonplace and by design.
And this is all before going into the meaning myth and fact myth: that text compression magically derives meaning from symbols, or that symbols map to any external reality…