It’s seven months since my last AI video generator round-up and the landscape has changed dramatically.
OpenAI and Adobe finally released their long-awaited models: Sora and Firefly video. Google unveiled Veo 2. US startups, Genmo, Pika, Luma and Runway all released new models: Mochi 1, Pika 2, Ray 2, and Runway Gen-4. UK startup Haiper released Haiper 2.5. Chinese tech behemoths, Tencent, Alibaba, Kuaishou and ByteDance released Hunyuan Video, Wan 2.1, KLING 2.0 and Seaweed-7B. And Chinese startups MiniMax and Aishi released T2V-01-Director and PixVerse V4. Meta Movie Gen remains unreleased.
I decided to put all the models I could get my hands on1 to the test with five challenges. We’ve all seen lots of impressive AI-generated videos, carefully selected by AI companies and creatives to showcase their models/prompting skills. I wanted to push the models and expose their current limitations, as well as their strengths.
Each model was only allowed a single attempt at each challenge and I avoided using advanced techniques to control the output - just simple text prompts, some with an accompanying image.
I tried to craft prompts that wouldn’t generate copyright-infringing output. Adobe Firefly refused to animate one of my images “because the reference image doesn't meet our User Guidelines” though I’m unclear whether that was because the style of the Ideogram-generated image was deemed too close to existing IP or the prompt raised some other flag.
For those of you worried about the environmental impact of generating all these videos (post on this topic coming soon), I carbon-offset all of my AI use (via Ecologi).
Challenge 1: Camera direction and unpredictable natural elements
My first challenge was to emulate a specific camera motion (an arc shot) capturing a human-made object (an oil rig) being ravaged by a unpredictable natural element (fire), surrounded by another unpredictable natural element (water).
The full prompt was ‘A cinematic arc shot capturing a massive offshore oil rig fully engulfed in flames against a dark night sky. The camera smoothly moves around the rig, highlighting the intense fire, billowing black smoke, and the reflection of the flames on the surrounding ocean waters. The scene is dramatic and intense, emphasizing the scale of the disaster.’
All of the models managed to render the key elements, although few, if any, of the outputs could be mistaken for real footage.
7 of the 12 delivered a smooth cinematic arc shot (Runway Gen-4 and Wan 2.1 were pretty static, Pika 2.2 was too frenetic, Veo 2 rendered a tracking shot, and Sora conjured a lesser seen ‘drunk camera operator’ shot).
Pixverse V4, MiniMax Video-01 and Hunyuan Video all did a decent job on rendering plausible elements and a smooth arc shot although I think KLING 2.0 just edges it.
Winner: KLING 2.0
Challenge 2: Elements in relation to one another
My second challenge aimed to push the models on their ability to render elements in relation to one another - a challenge because AI video generators don’t have a good understanding of semantics or real-world physics.
The prompt was ‘A woman pushing a buggy across a zebra crossing whilst talking on her phone and walking her whippet’.
Whilst every model managed to render a woman, a buggy, a zebra crossing, a phone and a dog, none managed to position them correctly in relation to one another.
Veo 2 generated the most whippet-like dog but let itself down with a disappearing dog lead and depicting the woman jaywalking along the road rather than crossing it.
KLING 2.0 managed to render a woman walking in the correct direction across a plausibly oriented zebra crossing, although its dog didn’t look like a whippet and its buggy was disconcertingly self-propelling.
Runway Gen-4, Sora, PixelVerse V4 and Genmo Mochi 1 all hallucinated a sinister alternative to a hands-free headset - a third arm.
Whilst imperfect, Hunyuan Video made the best fist of this challenge.
Winner: Hunyuan Video
Challenge 3: Complex human motion & emotion
I upped the difficulty another notch for this challenge, describing a complex and unusual interaction between two emoting protagonists: ‘A woman in a glamorous red dress dramatically throws a wig across a city rooftop at sunset. It spins through the air and lands perfectly on the head of a surprised bald man, who looks delighted. Slow motion, cinematic lighting, wind blowing.’
Whilst it rendered a more voluminous wig than I’d envisaged, KLING 2.0 pretty much nailed this peculiar brief, as did PixVerse V4, which was a wig spin away from full marks.
Pika technically met most of the brief, just in a typically zany fashion.
The other models all went full fever dream.
Winner: PixVerse V4
Challenge 4: Animated action and reaction in sequence
My fourth challenge included a reference image I’d created previously with Ideogram. The prompt I used attempted to pack in a lot for a 5 second clip and included a challenging series of actions/reactions in sequence: ‘The lion lets out a mighty roar. The man yelps, does a dramatic cartoon-style jump, and sprints away with flailing arms. The lion smirks and rolls its eyes’.
All of the models struggled with this.
KLING 2.0 once again fared the best, managing a roar, yelp, jump, sprint and lion reaction in sequence.
The other models all under or over animated.
Winner: KLING 2.0
Challenge 5: Comedic deep fake
My final challenge tested how the different models get on animating the likeness of a real person in a sudden and dramatic fashion. I uploaded a photo of myself sat at a bar and added the prompt: ‘The man sat on the bar stool waves and then falls off the stool’.
Adobe Firefly initially refused to generate a video based on the photo then depicted me melting rather than falling. Pika 2.2’s rendering was similarly Daliesque. Sora and Veo 2 quickly lost my likeness (in Sora’s case, by design) and failed to render a tumble. However, the accidental comic genius award goes to Luma Ray2 which had me simply lie back and levitate.
PixVerse V4 and KLING 2.0 did a pretty decent job of both the wave and the fall.
Winner: KLING 2.0
Overall winner
On the strength of these five challenges, I’d say KLING 2.0 is currently the model to beat. However, as this supertest illustrates, different models can perform better or worse on different tasks.
In practice, it’s also rarely a matter of entering a single text prompt/reference image and clicking generate. The ability to provide additional input up front (e.g. end frames, camera controls, motion paths, negative prompts, seeds) and tools to refine outputs (e.g. inpainting, outpainting) also play into model selection.
Comparison matrix
I’ve pulled together a snapshot of how the twelve models I tested stack up against one another in terms of capabilities:
Take outs
There are now a lot of highly capable AI video models available.
However, we’re a long way from reliably getting output that looks good, adheres to a prompt and respects the laws of physics.
Chinese and open source models have caught up with proprietary US models.
The only widely-available model which claims to be wholly trained on licenced material (Adobe Firefly) is some way off the pace on visual quality and motion realism.
Models have got better at rendering lots of elements within a prompt, but struggle to render complex interaction between those elements.
Getting useable output from these models requires creative vision, prompting skills, patience and, often, multiple tools.
The next leap in AI video generation requires models to get a better understanding of natural language prompts and real-world physics. I suspect this may come from combining a multimodal model capable of native video generation (similar to ChatGPT & Gemini’s native image generation) with a world model like Genie 2.
So, which model should I use?
If you’re wanting to create high quality videos with strong prompt adherence and realistic motion then KLING 2.0 is currently the best model available, although it lacks some of the functionality of Runway Gen-4. PixVerse V4 is also a strong choice if you don’t need the ability to regenerate areas of a video.
If you’re just wanting to create some fun videos to share with friends and family, then Pika has some easy-to-use templates (e.g. selfie with your younger self), although the free plan doesn’t give access to their latest model.
If you’re a paid ChatGPT subscriber, you already have access to Sora, although I’ve been decidedly underwhelmed by it.
Instead I would recommend heading to Google AI Studio which is currently offering free access to Veo 2.
If you want to experiment with multiple models I’d recommend purchasing a month’s Freepik Premium subscription for £14 as their AI video generator provides access to many of the models reviewed here via a credits system.
I wasn’t able to log in to Seaweed-7B and Haiper 2.5 kept on erroring. I also didn’t include Higgsfield, which doesn’t currently support text-to-video generation so wouldn’t have been able to take part in 3 of the 5 challenges.