How Google’s new model addresses some classic AI image fails
Completely full wine glasses, analogue clocks not showing 10:10 and rooms truly devoid of elephants
There’s a Reddit thread, with 2,600 comments and 15,000 upvotes, about AI image generators not being able to render a completely full glass of wine.
There are lots of threads about AI generated images of analogue clocks/watches always showing the time to be 10:10.
And there are countless threads and LinkedIn posts about ChatGPT not being able to render an ‘empty room with absolutely no elephants inside’.
The first two are explained by the preponderance of images in the models’ training data. Most photos of ‘full’ glasses of wine aren’t full to the brim and 10:10 is the most common time shown in clock and watch product shots for aesthetic reasons.
The third issue occurs because AI image generators don’t understand negatives in prompts without specific syntax - they see the word elephant in the prompt and dutifully include it in the generated image. ChatGPT’s insistence that there are no elephants in the room is because it has handed the image generation off to a separate model (the awkwardly-named DALL-E).
However, help is at hand. This week Google made a new experimental version of its Gemini 2.0 Flash model available via its AI Studio. Previously only available to testers, the new model combines multimodal input, enhanced reasoning (see previous post on reasoning models) and natural language understanding to create images.
A blog post on Google’s Developers Blog provides some examples of what this means in practice:
“tell a story and it will illustrate it with pictures, keeping the characters and settings consistent throughout. Give it feedback and the model will retell the story or change the style of its drawings.”
“edit images through many turns of a natural language dialogue, great for iterating towards a perfect image, or to explore different ideas together.”
“Unlike many other image generation models, Gemini 2.0 Flash leverages world knowledge and enhanced reasoning to create the right image.”
The model’s enhanced reasoning coupled with the fact it’s natively generating the images, rather than handing off to another model, means it doesn’t fail so spectacularly or consistently as other models on the three challenges I outlined at the start of this post.
Whilst it took five iterations to get the wine glass completely full, it was able to inspect the images it had generated and refine based on my natural language instructions, rather than stubbornly insisting the glass was full, ChatGPT style.
It managed to render wall clocks not showing 10:10 (although it gave me four rather than the three I asked for) and an elephant-free room right off the bat.
It’s still a long way from perfect (trying to get it to render a specific time on an analogue clock face requires the patience of a saint) but it’s a meaningful step forward in models being able to examine the images they’ve generated and humans being able iterate images using natural language.