I vividly remember the mid-noughties excitement around Web 2.0. The promise of the democratisation of publication coupled with online network effects heralding a new dawn for people wanting to share their creative output with the world, unmediated by gatekeepers…
Of course, what emerged was a handful of scale platforms promising free global distribution to anyone with an Internet connection and a keyboard/webcam in return for access to their data.
We thought the price of our free lunch was our data being used to deliver targeted advertising (with a mandatory side order of disinformation and toxicity salad).
And it was. On Facebook, on YouTube, on Twitter, on Instagram, on TikTok.
However, another bill has now arrived for payment. The price? Our content being used to train generative AI models.
Oh, and payment has already been taken.
The scale of data needed to train generative AI models and the intensity of the AI arms race since OpenAI fired the starting gun on moving LLMs from research to consumer product with the release of ChatGPT, has led most developers of AI models to adopt a ‘forgiveness rather than permission’ approach to using user-generated content for training data.
And once a model is trained, it’s disconnected from its training data. The horse has bolted and requesting your content not be used as training data only has the potential to impact future models.
The phrase that keeps popping up from companies who’ve trained generative AI models when asked about training data sources is ‘publicly available’. It’s in Google’s Privacy Policy, it’s in X’s Privacy Policy and when the Wall Street Journal asked OpenAI’s CTO Mira Murati whether YouTube videos were used to train its video generation model, Sora, she feigned ignorance and repeated the phrase “publicly available” four times.
It’s a slippery term. All ‘publicly available’ really promises is that they didn’t hack into a private server to get access to it. It does not mean permission was granted for it to be used for that purpose.
I tweeted 3,284 times between Nov 2006 and Jul 2023. My account isn’t private so I knew anyone could read them or link to them and that Twitter would put ads around my tweets to cover its costs / try and make a bob or two.
However, I wasn’t tweeting with the expectation that those tweets would be used to train an AI model to write more like a human.
As it happens, XAi states that its Grok-1 LLM wasn’t trained on tweets and instead draws on them as a real-time data source. However, it’s reasonable to expect a future version of Grok will (X updated it’s privacy policy last September to explicitly enable it).
Mark Zuckerberg has been open about his plans to leverage the ‘publicly available’ content on Meta’s products:
“The next key part of our playbook is learning from unique data and feedback loops in our products… On Facebook and Instagram, there are hundreds of billions of publicly shared images and tens of billions of public videos, which we estimate is greater than the Common Crawl dataset and people share large numbers of public text posts in comments across our services as well.”
Amazon reportedly has similar plans for user-generated videos on Twitch.
And whilst Google have recently stated that OpenAI using YouTube videos to train Sora would be an violation of YouTube’s terms of use, I’ve got a hunch Google’s lawyers wouldn’t take the same view of Google using YouTube videos to train its own AI models, despite a lack of consent from users who’ve uploaded over a billion videos to the platform.
As pressure mounts on AI companies to formally licence content, the money still isn’t flowing to individual content creators. Reddit reportedly struck a $60m p.a. content licensing deal with Google ahead of its recent $6.4bn IPO and Photobucket is reportedly in talks with multiple tech companies to licence the 13 billion photos and videos uploaded to its servers since 2003, despite a privacy policy which declares, in BLOCK CAPS, that it “will never use your images…for any purpose other than to provide our services to you unless we have your express permission or as set forth herein”.
The payment that is flowing to individual creators is a little more, er, modest. Adobe recently offered photographers the princely sum of $60 for 500-1000 photos of “bananas in real life situations” (I kid you not) to help train its AI models.
The demand for more high-quality training data to differentiate competing AI models is also likely to put pressure on companies like Google to change its stance on not using private data (e.g. GMail, Google Docs) for training.
So, what’s a self-respecting content creator to do if they don’t want their output to be used to train generative AI models without consent or recompense? Here’s a few practical suggestions:
1.) Publish your work to a website you control rather than a social media platform / online forum which may very well licence your content to AI companies (if it hasn’t already).
2.) Stop OpenAI, Microsoft and Google using content published on your website as future training data by blocking GPTBot and using Google-Extended. It’s also worth blocking Common Crawl’s CCBot (most LLMs were trained on some variant of Common Crawl’s gargantuan dataset).
3.) Consider using Glaze to reduce the likelihood of AI models mimicking the style of your artwork.
4.) Submit a form to OpenAI to request images you’ve created be excluded from future training datasets.
5.) Stop your interactions with ChatGPT being used as training data (go to Settings > Data controls and toggle off Chat history & training).
Alternatively, unplug your computer and dust down your quills and freshly torn vellum…
This post was written purely for your reading pleasure (not for the bots) so if you enjoyed it, please like it and/or share it. Thanks :)