Which AI tools have been trained on licensed content (and which haven’t)
Wanting to avoid using AI tools that have been trained on unlicensed content?
I’ve got good news and bad news.
The bad news is that the majority of generative AI tools have been trained on unlicensed content their developers scraped from the internet and/or acquired as part of large data repositories such as Common Crawl and The Pile.
The good news is that the recently enacted EU AI Act will force companies to publish a detailed summary of the content used for training general-purpose AI models by August 2025, whilst proposed US legislation (the Generative AI Copyright Disclosure Act) would - if passed - mandate that companies disclose copyrighted material in their training datasets at least 30 days before releasing a new model.
In the meantime, here’s a snapshot of where some of today’s most popular AI products stand in relation to licensing of training data.
Key takeouts:
There isn’t a mainstream AI chatbot which has been trained exclusively on licensed content. KL3M bills itself as “the first clean LLM” but it’s aimed at enterprise legal users.
OpenAI, Anthropic, Microsoft & Google have all now licensed content for model training but it remains a tiny fraction of the training data on which their respective chatbots rely.
Whilst Meta and Grok may have technically licensed content from users of Facebook, Instagram and X by changing their respective privacy policies, I’m leaving them in the Unlicensed row as they opted users in by default and aren’t remunerating anyone.
There’s now a decent selection of image generators which have only been trained on owned, licensed or public domain images although they still lag behind the likes of Midjourney, Ideogram and FLUX.1 when it comes to quality of output.
There’s a growing number of AI music generators that haven’t been trained on unlicensed material, although they tend to only produce instrumental tracks, leaving the vocal stylings to the likes of Suno and Udio (both of whom are currently being sued by the major record labels).
No video model has thus far claimed to be trained on licensed material.