When I tell people that Large Language Models (LLMs), like the ones that power ChatGPT, are trained on a large amount of text, it’s hard to communicate quite how large a dataset we’re talking about.
OpenAI hasn’t disclosed how many tokens were used to train its more recent LLMs but, according to its model card, Meta’s Llama 3.1 family of LLMs were pre-trained on ~15 trillion tokens of data. Tokens correspond to words or parts of words, with one token equating to around 0.75 of a word. Therefore, ~15 trillion tokens equates to around ~11 trillion words.
Now we all know 11 trillion is a big number, but it’s hard to conceive of quite how big.
To try and illustrate the sheer magnitude of LLMs’ training data, I used AI chatbot Claude to visualise the number of words reportedly used to train Llama 3.1 relative to an estimate of the number of words a literate person reads in their entire lifetime*.
Here’s what we produced:
It’s possible that Llama 3.1 may represent a high watermark in training set size. Having scraped most of the open internet, the human-generated data cupboard is now looking pretty bare (one of OpenAI’s original co-founders recently suggested “We’ve achieved peak data and there’ll be no more. We have to deal with the data that we have. There’s only one internet.”)
Additionally, training larger and larger models no longer seems to be resulting in materially better performance and AI companies are increasingly focused on other approaches to improve their models, such as giving the models more time to process each prompt. While smaller models would be good news from an environmental perspective (as pre-training large models is very energy intensive), it’s worth noting that models taking longer to process prompts is also energy intensive in aggregate.
As interesting to me as the visualisation itself was the process of creating it, which involved a back-and-forth between me and Claude; me giving it instructions in natural language and it modifying a React component in a workspace to the right of our chat in response.
I started with “visualise the difference between 300 million and 11 trillion” and then iterated from there with instructions like “change to circle instead of squares and keep the circles centred” and “increase the font size of the circle text a smidge more and move the red circle text down a bit”).
If you haven’t tried coding with Claude using nothing but natural language then I’d encourage you to give it a go. Here’s a simple game I made with it soon after the visualisation capability (known as Artifacts) was released.
*Estimates triangulated from multiple LLMs: ChatGPT 4o (74-148m words), ChatGPT o1 (100m), Le Chat (182m), Microsoft Copilot (328.5m), Gemini (328m), Claude (600-900m).
I was interested in how this compares with the size of Wikipedia:
"As of 14 January 2025, there are 6,939,793 articles in the English Wikipedia containing over 4.7 billion words (giving a mean of about 690 words per article)."
https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia
No.
You perpetuate two MAJOR falsehoods that play into the hands of exploiters and weaken the case of rights holders: the ”wide scrape” myth and the ”even distribution” myth.
Wide Scrape myth:
There is nothing random about what sources were targeted for scraping. Target lists were compiled manually, and include paywalled content, news archives, pirate troves and torrent sites. The popular myth is, it’s all randomly compiled from the www at large. This is to a varying degree openly admitted in interviews and court hearings, and was to be revealed by OpenAI whistleblower Suchir Balaji in detail before he was murdered in his home.
Even Distribution myth:
Scraping is stage one. Not all property is created equal. Garbage is filtered out, and quality sources are upregulated 10x or more. So while there is some averaging going on, there is a clear value differential between quality corpuses like the NYT archive and the Books3 pirate ebook collection on the one hand, and forum threads on the other. And while there is a whole lot of blending going on, streaks of verbatim regurgitation is commonplace and by design.
And this is all before going into the meaning myth and fact myth: that text compression magically derives meaning from symbols, or that symbols map to any external reality…