Checking in on my AI predictions (and making a few more)
In February, I posted 11 AI predictions for 2024.
3 months on, I’m taking a look at progress against those predictions, revisiting my 2023 predictions and making a few new ones.
First up, my Feb 2024 predictions…
1.) We’ll see a major step change in the quality of AI-generated video
Two weeks after I posted this prediction, OpenAI unveiled Sora, causing many jaws (mine included) to drop at the quality of its output relative to currently available AI video generators and reportedly prompting media mogul Tyler Perry to put the planned expansion of his US studio space on ice.
The leaps forward included far greater stability between frames, closer (though my no means complete) adherence to real-world physics and much longer outputs (60 seconds rather than the 3 or 4 seconds that’s standard with current models).
OpenAI has very sensibly resisted the temptation to rush Sora to a general release, instead working with hand-picked creatives to put the model through its paces and provide reassuring quotes about it enabling, rather than replacing, flesh and blood creatives.
Elsewhere, Adobe re-announced, but still hasn’t released, Firefly Video, Microsoft showed off photo-to-talking-head model VASA-1 and last week Google demoed its latest video generation model, Veo, which looks good, but not Sora-good (the demo videos are suspiciously light on people).
It feels like we’re overdue an video model update from Meta, who teased its Make-A-Video model way back in September 2022. Plucky upstart Higgsfield also looks like one to watch.
Here’s my recent round-up of currently available AI video generators:
2.) Apple will seriously up its generative AI game, reimagining Siri and upgrading Apple Watch and AirPods to become the preeminent AI wearables
Whilst we’ll have to wait another few weeks to discover the detail (Apple’s Worldwide Developers Conference starts on the 10th June), the game-upping half of this prediction is now looking highly likely. Tim Cook, not known for hyping - or even talking about - product updates in advance, told shareholders in late Feb that he’s looking forward to “sharing the ways we will break new ground in generative AI” and Apple’s SVP of Marketing tweeted - sorry, posted on X - “Mark your calendars for #WWDC24, June 10-14. It’s going to be Absolutely Incredible!”. And it’s not just talk. In early February Apple released open-source natural language image editor MGIE and last month, a suite of open-source language models under the banner OpenELM.
I’d be astonished a reimagined Siri isn’t part of the WWDC announcements, although it’s likely we’ll have to wait till the Apple’s autumn hardware announcements to find out whether I’m right about the Apple Watch and AirPods.
3.) The New York Times and OpenAI / Microsoft will agree a data licensing deal and avoid going to trial
We’re still in the early stages of the legal dance. In late February, OpenAI filed a motion to dismiss parts of the lawsuit. It’s likely to be a while before this one gets resolved. In the meantime…
4.) There will be lots more training data licensing deals
And lo, there were. See Reddit (x2), StackOverflow, Le Monde & Prisa Media, Dotdash Meredith and The Financial Times.
An interesting open question is how many partners AI companies need to have licensing arrangements with in order to provide good enough coverage in a given domain/territory. As more licensing deals are struck with news providers, the New York Times’ bargaining power gets weaker, potentially to the point that OpenAI can afford to walk away and exclude NYT content from future training sets (see Meta and news publishers in Australia & Canada).
5.) LLMs will increasingly be used in combination with other AI tools
Most of OpenAI’s Spring Update this week was not about improvements to ChatGPT’s language capabilities (beyond dramatic improvements to its non-English language support), but about the power of pairing an LLM’s written language capabilities with capabilities in other modalities.
LLM + computer vision is looking like a pretty powerful combo (see 2024 prediction 6), as is LLM + speech recognition/generation (see prediction 2023 prediction 7). Ditto LLMs + robotics (see 2024 prediction 11).
Meanwhile, Perplexity is doing a good job of showing the power of the LLM + search engine combination, which Google is trying to emulate at scale with its ‘AI Overviews’ (ne. Generative Search Experience) and is reportedly also in OpenAI’s crosshairs.
6.) Advances in computer vision will unlock new consumer use cases
There have been lots of announcements about industry applications of computer vision in the last few months, including Apple’s acquisition of DarwinAI (which uses computer vision to visually inspect components during the manufacturing process) and Datakalab (which worked with the French government during the pandemic to visually check whether people were wearing face masks on Paris’s transportation systems). Less so, consumer applications.
Until last week that is, when OpenAI revealed the computer vision capabilities of its new natively multimodal model, GPT-4o, and Google demoed its Project Astra prototypes.
OpenAI’s demos included GPT-4o acting as a second pair of eyes for the blind or partially-sighted and camera-based language learning.
Whilst Google was showing prototypes rather than a production product, some of the use cases were fairly compelling (AI being able to tell you where you left your glasses would be a huge boon in my household…)
7.) Increasingly egregious viral deepfakes will prompt federal legislation
Alas, the deepfakes keep a comin’ and it’s not just politicians and celebrities being targeted. A US high school teacher was arrested last month for allegedly creating an audio deepfake of his principal appearing to make racist comments and a couple of weeks ago a Labour campaigner in the West Midlands was similarly framed (frAImed?). Last week, a deepfake of WPP’s CEO, Mark Read, was used on a Teams call as part of a phishing scam.
On the legislation front, the Protecting Americans from Deceptive AI Act was introduced as a bipartisan bill in March which, if passed, would mandate the labelling of AI-generated content.
8.) Global elections will shine a brighter spotlight on some of the ways in which AI can be used for ill
‘Fraid so. See Indonesia and India.
Buckle up for the US Election, which Microsoft’s Threat Analysis Center reports Russia is already limbering up for.
9.) Use of generative AI in professional creative work will become more commonplace but stigma will remain
Last April, Drake’s record label was issuing takedown notices for a song which cloned his voice. This April, Drake was the one publishing a track with a cloned voice and the one receiving the takedown notice (from the estate of Tupac Shakur).
However, few artists are openly using AI in their work and even fewer have followed Grimes lead in making an AI clone of their voice available for users to create new songs.
A few weeks ago, FKA twigs revealed she had developed a deepfake but was planning to delegate her fan and press engagement to it so she can focus on her music.
It may be that there needs to be some resolution of creative works being used to train AI models in order for professional artists to adopt it more widely.
The spectre of generative AI also helps explain the strength of reaction to Apple’s tone-deaf ‘Crush!’ advert for its new iPads.
10.) More of us will be using AI as an assistant, delegate and/or companion
Much of OpenAI and Google’s demos last week focussed on assistant-like behaviour (help me with my math homework, do you remember where you saw my glasses?) and OpenAI’s release of a low-latency natively multimodal (text, voice and vision) model in the form of GPT-4o feels certain to encourage more assistant-like interactions (with or without Scarlett Johansson’s sultry tones).
OpenAI opening up custom GPT creation to users of its free tier (and Google copying the capability with Gems) also feels significant, although they still have work to do on discovery/on-boarding (most people I speak to are unaware of this powerful capability).
On the delegation front, Cognition’s unveiling of Devin - billed as ‘the world’s first fully autonomous AI software engineer’ - feels like a watershed moment for AI agents.
Some of the OpenAI demos also showcased delegation scenarios, such as asking ChatGPT to arrange the return of a defective iPhone (“I want you to get them to send me a replacement device. Can you take care of this for me?”).
Meanwhile, Bumble founder Whitney Wolfe Herd outlined her vision for “AI dating concierges”.
On the companion front, Character.AI still appears to be going strong (625m visits in the last quarter, according to SimilarWeb). Whist Replika continues to deal with the fallout of stopping its chatbots engaging in ‘intimate’ conversations, other apps - such as Chai AI - are filling the void.
11.) We will start to see more embodied AI
Humanoid robots have had a busy quarter.
I mentioned the Figure 01 robot in my February post. In March, Figure posted a demo of that same robot with OpenAI models plumbed in. I wrote about what makes the demo so impressive here.
Also in March, Chinese firm Unitree showed its “universal humanoid robot”, the H1, breaking the “full-size humanoid speed world record”, whilst moving a lot like my 2-year-old son when he’s got a full nappy (I’m not sure if that makes it more or less impressive).
In early April, part-time CEO, full-time shitposter, Elon Musk reframed Tesla as “an AI/robotics and sustainable energy company” and put humanoid robots at the top of its ecosystem diagram.
Later in April, Boston Dynamics played a PR blinder by announcing it was putting its 11-year-old HD Atlas hydraulic robot out to pasture one day and unveiling its supple electric successor the next.
So, that’s progress on my Feb 2024 predictions. How about my single word (“AI will become more X”) predictions from last June and July?
1.) Ubiquitous / Integrated
Yep. AI is popping up left, right and centre.
Adobe recently announced it’s adding AI tools to Premiere Pro and Google and Microsoft are jamming AI into every conceivable part of their product portfolios.
And it’s not just software. Microsoft has been trying to persuade OEMs to add a Copilot key to keyboards, those crazy cats at Logitech have launched a mouse with a dedicated AI button, Meta have sprinkled some AI fairy dust over their smart Ray-Bans and the Ai Pin and rabbit r1 have started shipping to, er, mixed reviews.
We’ll have to wait till Q4 to discover if the Limitless Pendant is equally underwhelming.
2.) Accessible
AI tools are slowly becoming more accessible.
Midjourney - the premier AI image generation service (see my recent round-up) - is opening its web interface to users who’ve generated more than 100 images via its Discord channel. It’s a huge leap forward in terms of ease of use. Generating images via Discord involves remembering to add various text parameters to get the right aspect ratio and visual effects. The website is a simple point and click affair.
The shift towards multimodal models is also going to help make them more accessible, inviting more natural forms of interaction.
3.) Applied
More applied models are starting to appear.
In January, AIWaves shared details of Weaver, a family of LLMs for creative writing.
In February, Microsoft added Microsoft Copilot for Finance to the burgeoning Copilot family.
4.) Personal
Announced in February, ChatGPT memory, is a small step towards a more personal experience of AI.
Windows Recall, announced yesterday, is potentially a bigger one, with its ability to “access virtually what you have seen or done on your PC in a way that feels like having photographic memory”.
I expect Apple to apply this to the iPhone with the release of iOS18 and the iPhone 16.
5.) Refinable
Apple released MGIE in February, which enables image editing using natural language and OpenAI added the ability to regenerate portions of a ChatGPT/DALL-E generated image in April.
Sora also appears to be making strides in refining video output using text-prompts.
6.) Real-time
OpenAI claim its new multimodal model, GPT-4o, “can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in a conversation”.
Further up the stack, Groq put the cat amongst the GPU pigeons in February with a demo of its superfast Language Processing Unit (LPU).
Taking the latency out of our interactions with AI is going to when and how we use it.
7.) Vocal
I can’t say I was expecting OpenAI to beat Amazon, Apple and Google to the punch on bringing the voice assistant kicking and screaming into the generative AI age but they appear to have done just that with the release of GPT-4o.
As well as significantly reduced latency, it’s now possible to interrupt voice responses, which makes for a more natural conversation, all but removing those awkward waits for the voice assistant to start or finish speaking (“ALEXA, STOP!” is a common exclamation in my household).
They’ve also injected more personality into the default voice and nicked hume’s USP by making it responsive to your mood/direction.
The voice element of GPT-4o is rolling out to ChatGPT Plus & Teams subscribers over the coming weeks, so we’ve currently only got demos to go on. This one’s pretty impressive (although I think they might need to tweak the default voice for UK users to be less relentlessly upbeat…)
8.) Transparent
The EU AI Act, passed by the European Parliament in March, obligates general purpose AI systems/models to publish detailed summaries of the content used for training.
The UK Government is reportedly working on transparency legislation and, over the pond, the Generative AI Copyright Disclosure Act was introduced as a bill in April.
9.) Multi-media / multimodal
GPT-4o being natively multimodal is the big leap forward here. Rather than having to hand off between models optimised for different modalities, which introduces a time lag, GPT-4o is a one-stop shop for interacting using text, voice, image and/or video.
10.) On-device
Launched in December, Google’s Gemini Nano is optimised for on-device use, as are Apple’s recently unveiled OpenELM models, which we can expect to get used in anger in iOS18.
Meanwhile, running an open-source model, such as Llama 3, locally on a phone continues to get easier.
Now, how about some new predictions? Here are 4 to round my 2024 predictions up to 15:
12.) We’ll see the first AI-generated immersive worlds/games
Non-generative AI has been used in game development for decades. Generative AI is prompting a more ambivalent reaction, with fears about the impact on designers and the reaction of fans to AI-generated characters, voices and game objects.
Nevertheless, developers are starting to use generative AI toolkits such as Scenario and Inworld, whilst Ludo.ai has positioned itself as a less threatening ‘research and ideation’ service.
AI-generated games have thus far been rudimentary 2D affairs (see Google Genie, Rosebud AI). However, I anticipate the leaps forward manifest in Sora’s output, coupled with the power of natively multimodal models such as GPT-4o will result in the first AI-generated immersive worlds/games appearing before too long.
13.) We’ll see machines processing inputs/learning more like humans
Most of the current crop of generative AI models were trained on a huge dump of data, which they processed over a period of months to build up a model of language (or images, or videos, or music) - the equivalent of a (super)human disappearing into a library for 6 months, reading everything and then emerging blinking into the sunlight to answer questions.
However, there are signs that there might be other ways.
Researchers showcased a model which learned language concepts solely from recordings from a headcam worn intermittently by a single child from aged 6-months to 2 years.
And Meta released V-JEPA - a model pre-trained on video in order to then learn new concepts about the physical world.
14.) We’ll see more AIs navigating GUIs
For my money the most innovative thing about the rabbit r1 is not the device itself (cute as is it) but the way in which it accesses other services, not through complex API integrations, but by aping human interactions with an app’s interface.
A recently released research paper about Ferret UI (“a new MLLM tailored for enhanced understanding of mobile UI screens”) suggests Apple is gearing up to apply these sorts of capabilities to iOS.
And Google appear to be thinking along similar lines - one of its I/O demos last week showed ‘Navigating Page…’ as one of the steps its agent was taking to assist you with a change of address.
15.) Perplexity will scale to 100 million users, primarily through word of mouth
The word-of-mouth recommendations I’ve given and received for Perplexity remind me of the conversations I was having about Google in the late 90s.
Not just, “here’s a cool thing”. But “here’s a thing that’s so useful that I’m no longer using Ask Jeeves/Google” (delete according to millennia).
It’s entirely possible Google and/or OpenAI will succeed in raining on Perplexity’s parade with a similarly compelling experience piggybacking on their existing scale properties (see Threads’ record-breaking journey to 100m users on Instagram’s coat tails). Google certainly betting big on AI Overviews, which it says it will be rolling out to 1 billion users this year.
Personally, I’ll be rooting for Perplexity to put a dent in Google’s dominance of search, although it’s going to be tough when Google are willing and able to pay a cool $20 billion to be the default on Apple’s Safari browser.