AI video supertest
15 models, 5 challenges, 1 winner
It’s been nine months since my last AI video supertest. All of the leading models have released major new versions since then (Veo 3, Sora 2, KLING 3.0) and a couple of serious new contenders have entered the fray (Grok Imagine and Seedance), both with a heavy side-serving of controversy.
I gave 15 of the most capable video models five challenges to test different facets of their capabilities.
Here’s how they fared:
Challenge 1: real-world setting, action, text rendering and camera direction
This challenge looked to test the models’ ability to render action (a delivery scooter stopping at a red light) in a plausible real-world environment (the city of Bath) and render and maintain accurate text whilst performing a specific camera movement (a 45 degree arc shot).
Full prompt: Street-level shot in Bath, UK. A delivery scooter stops at a red light. On the insulated box is printed: ‘BODY PARTS TO YOU’. The camera slowly circles 45 degrees around the scooter. The text must remain spelled correctly and readable throughout.
Google’s Veo 3.1 was the clear winner here, delivering on the location, action, text rendering and camera movement, with no obvious AI tells and realistic ambient sounds.
Seedance 2.0 was a respectable second, nailing the text but failing to show the scooter stopping and botching the numberplate and traffic light.
Sora 2, Wan 2.6, PixVerse V5.5 and Pika 2.5 were the only other models to render the text correctly, but all fell down on other aspects of the challenge.
Lots of the models struggled to accurately render UK traffic lights and almost half depicted driverless scooters (Hailuo inexplicably delivered up a van).
Runway Gen-4.5 gave us an absurd number of red lights and the grammatically correct but nonsensical text ‘Body Pours to You’ (not quite as good as Seedance 1.5 Pro’s ‘Boy Pants to You’).
However, the wooden spoon must be shared by Moonvalley’s Marey and Adobe Firefly, which both managed to miss every element of the brief and depict some seriously possessed scooters.
Winner: Veo 3.1
Challenge 2: 3D animation style, event sequencing and simple dialogue
This challenge was designed to test the models’ ability to render a distinctive animation style (claymation), sequence some cause-and-effect action and perform some anthropomorphic amphibian lip syncing, with a sound effect and a single word, delivered with emphasis.
Full prompt: Stop-motion claymation with visible fingerprints. A frog hops onto a stack of books; the books wobble, then settle. The frog burps, then says “greetings!”
Veo 3.1 managed a decent claymation style and cause-and-effect action, but delivered its greetings prematurely.
Wan 2.6 hit the key elements of the brief but blew it by having the burp and the greetings delivered by two different mouths.
Both Seedance 2.0 and KLING 3.0 nailed all the key elements, but KLING 3.0 just edges it for me for being more obviously claymation and delivering a more congruent ‘greetings!’.
Winner: KLING 3.0
Challenge 3: likeness of a real person, accented speech, improvising dialogue
For this image-to-video challenge, I uploaded a triptych image of me in different guises generated by Nano Banana Pro for my recent comparison of AI image generators. I then gave the models a very short, open-ended prompt to see how they would improvise.
Full prompt: These 3 British men discuss the weather
6 of the 15 models aren’t capable of generating audio, which was an automatic disqualification for this challenge.
PixVerse V5.5, LTX-2 Pro & Grok Imagine all struggled to get the right dialogue coming out of the right Dan’s mouth.
Seedance 2.0 and Vidu Q3 ensured the speakers didn’t get muddled by cutting to each in turn and did a decent job of the accents, but it wasn’t what I’d had in mind and Vidu’s lip syncing was a bit off.
Sora 2 transported the three Dans to a Northern street, delivering plausible, if slightly rushed, lip-synced dialogue but completely abandoned my likeness from the reference image (possibly by design as Sora has a separate mechanism for managing likenesses).
Of the other talkers, only KLING 3.0 and Veo 3.1 managed to choreograph all 3 Dans talking in turn. They also conjured up some convincing British accents. Veo 3.1 just loses it thanks to some misplaced lip syncing and ambient noise.
Another shared wooden spoon for Moonvalley’s Marey and Adobe Firefly, thanks to their respective flying cars and crazy eyeballs.
Winner: KLING 3.0
Challenge 4: hands, detail, scene composition, character consistency
This challenge looked to test how the models cope with hands (a bête noire of AI video models), high levels of detail, novel scene composition (I’d be surprised if this exact scene was in their training data) and transposing a character into a radically different context whilst retaining consistency.
Full prompt: Close up shot of an elderly woman's hands holding a china teacup. She has a gold signet ring on her middle ring finger with the initials 'SG' embossed. Zoom out to show her sitting in a small room chock-full of toy ducks and Christmas crackers. Cut to the same woman in scrubs performing an appendectomy. Her ring is on a necklace around her neck.
None of the models completely nailed this challenging prompt. Most struggled with the ring being on the protagonist’s middle finger and all failed to render it on her necklace in the second scene, converting it to a pendant (damn training data bias).
Marey fell at the first hurdle with malformed hands (but would also have been disqualified for performing surgery in a non-sterile environment).
Seedance 2.0 refused to generate a video for this prompt, reporting ‘Content flagged as potentially sensitive. This may involve copyrighted characters, brands, real people, or sensitive groups. Please try different prompts or images’ (presumably part of their recent safeguard strengthening), so I reverted to Seedance 1.5 Pro for this challenge.
Both Seedance 1.5 Pro and fellow-Chinese model MiniMax Hailuo 2.3 decided on a Studio Ghibli-style animation in response to this prompt, but neither managed to render the initialled ring.
KLING 3.0 nailed the camera direction but failed on the hands, initialed ring and character consistency.
Sora 2 made a decent fist of most elements, but didn’t deliver the zoom out and gave its protagonist an implausibly bendy pinky finger. The two scenes also lacked aesthetic continuity, with the first scene looking like a Werther’s Original ad and the second looking like Silent Witness.
Veo 3.1 gave us two initialed rings rather than one and also missed the zoom out. However, it hit every other element of the brief and delivered the most realistic footage and the best character consistency and scene continuity.
PixVerse V5.5 deserves special mention for generating the most disquieting video (although the Being John Malkovich-esque second shot from Wan 2.6 is also pretty unsettling). A single shot features one woman magically producing a slightly smaller teacup and a knife from her teacup, whilst another sits in the corner wearing scrubs and pushing red gloop around a bowl.
Winner: Veo 3.1
Challenge 5: unfilmable transitions, cinematic aesthetic
My final challenge was intended to push the models to render an unbroken shot that would be impossible to film (side note: I wish more AI video experimentation was directed at exploring the unfilmable) with cinematic colour grading and specific lighting.
Full prompt: A single unbroken shot begins on the eye of a honeybee resting on a rain-drenched wildflower. The camera pulls back continuously, through a meadow, over rolling hills, a coastline, the curve of the Earth, until the planet is a pale blue dot suspended in darkness. Cinematic colour grading throughout, golden hour light.
Most of the models struggled with this demanding brief, generating unrealistic visuals and resorting to dissolves to get from the coastline to the Earth.
Pika 2.5 attempted one continuous zoom out but couldn’t make the final transition work.
KLING 3.0 gave us a good-looking honeybee and handled the initial pull out well, but then had the earth looming up from the bottom of the shot.
Veo 3.1 abandoned the pull out after a couple of seconds and struggled with all the subsequent transitions.
Whilst the bee and flowers could arguably have been a touch more realistic, Seedance 2.0 was the only model to nail every element of the brief, executing a seamless pull back from ommatidia to pale blue dot.
Winner: Seedance 2.0
Overall winner
Whilst Veo 3.1 and KLING 3.0 both netted two victories (and Seedance 2.0 one), I’m going to name Veo 3.1 the overall winner as KLING 3.0’s text rendering made its Challenge 1 generation unusable and it fluffed almost all elements of Challenge 4.
Conclusions
The leading video generation models have come a long way in nine months in terms of realism, prompt adherence and consistency (compare the winning generations above with Challenge 2 from my May 2025 test).
However, there isn’t a single model that consistently outperforms the others in all scenarios.
Veo 3.1 is reliably strong when it comes to generating realistic, physics-respecting footage. It’s also best-in-class at incorporating text. However, it can struggle with less realistic scenarios and its audio generation can be inconsistent.
KLING 3.0 is a strong all-rounder and outperformed Veo 3.1 on the less real-world briefs. However, it struggled with the text rendering, which is a common element of many creative briefs.
Seedance 2.0 made a few small mistakes in Challenges 1 and 3 (and refused Challenge 4 altogether), but its prompt adherence on Challenge 5 was seriously impressive (as are some of the viral videos currently doing the rounds).
It’s fair to say that with Seedance 2.0 and KLING 3.0, China has very nearly closed the gap on the best US video model, although Veo 3.1 still has the edge when it comes to realism. I also expect Veo 4 to raise the bar again come Google I/O in May.
The upshot is that those generating varied videos will likely want access to more than one model via a creative platform (comparison of these coming soon).
Sadly, the two models trained exclusively on licensed footage, Adobe Firefly and Moonvalley’s Marey, are both some way behind the leading models when it comes to handling more complex prompts.


