Chatting to ChatGPT with live video
One of my AI predictions for last year was that “advances in computer vision will unlock new consumer use cases”.
In May, OpenAI demoed the computer vision capabilities of its first natively multimodal model, GPT-4o and Google teased similar capabilities as part of research prototype Project Astra.
However, the vision capabilities of their live products were limited to sharing static photos and screenshots and, in Gemini’s case, pre-recorded videos. When ChatGPT’s Advanced Voice Mode finally rolled out in late September, it didn’t include the ability to share live video.
That changed on the 11th and 12th December, when Google and OpenAI both launched combined voice + video modes for Gemini and ChatGPT.
How to access
To access ChatGPT’s prosaically-named ‘Advanced Voice Mode with Video’, you need to be a paid subscriber. It’s then simply a matter of opening the ChatGPT mobile app and starting a voice conversation by tapping the soundwave icon (bottom right) then tapping the camera icon (bottom left). You can toggle between your rear and front-facing cameras by tapping the circular arrows.
To experience voice + video on Gemini, you currently need to head to Google’s AI Studio, select ‘Stream Realtime’ and tap the camera icon (to the left of the text input box).
Once enabled, ChatGPT/Gemini can now ‘see’ whatever you point your phone camera at and chat about it. It’s also possible to share your screen and chat about that instead.
It’s going to take time for the valuable applications of this new capability to emerge. We currently at the experimentation and contrived demos/parlour tricks stage.
Hands-on testing
Google and OpenAI’s demo videos are obviously carefully choreographed, curated and edited, so I thought I’d share a couple of unedited clips of my first interactions with ChatGPT’s Advanced Voice Mode with Video along with some observations.
Test 1: The veg box challenge
Observations:
Identifies an impressive number of fruit and veg from a quick pan of the box, although mistakes blueberries for blackberries.
Genuinely quite helpful in identifying less common fruit and veg (I am notoriously bad at this) although please don’t use it for mushrooms.
Gives pretty sound dietary advice, although misidentifies the Shreddies as Shredded Wheat (potential US bias).
Understands that my suggestion of a bowl of marshmallows is in jest and responds appropriately with a half-hearted “ha ha”.
It’s worth pausing on that final bullet for a moment to reflect on how far AI has come and what we’ve already started to take for granted. An AI model has identified a jar of marshmallows it’s never seen before, made sense of what I’ve said in my weird English accent and verbally responded in a cogent and appropriate fashion, all in a matter of milliseconds. What’s more, my 3-year-old will never find any of this remarkable.
Test 2: Recommend board games
Observations:
Its first recommendations were on the money (Splendor is indeed perfect for a quiet evening after the kids are in bed) and it got the age suitability of Tension right.
However, it finishes by recommending a game which isn’t visible in the video. It acknowledges this when challenged although styles it out as “a great addition if you’re looking for something new and challenging”.
Current limitations and future potential
Whilst the responses tend to be a bit formal, verbose and relentlessly upbeat (addressable through system prompting) and hallucination remains an issue (harder to address), it’s not difficult to conceive of some powerful product experiences being built on top of this capability.
Building on the cereal example, I would readily download an app which I could point at a supermarket shelf to identify the level of processing different foods had undergone, rather than trying to infer it from the length and perceived artificiality of the 4-point font ingredients list.
Voice + video also has the potential to move smart glasses beyond the early adopter. On 16th December, Meta announced its Ray-Ban Meta glasses would be upgraded with ‘live AI’, which enables Meta AI to “see what you see continuously and converse with you more naturally than ever before” and Halliday Glasses have got CES excited.
Regardless, this feels like another significant step in moving consumer AI beyond a ‘text-in, text-out’ box on a website towards something we can seamlessly interact with in all modalities (for better and for worse).
A message from our sponsors: Oh wait, there aren’t any sponsors. Dan’s Media & AI Sandwich is free of advertising and free to access. If you value my writing and want to help me dedicate more time to it, please consider becoming a paid subscriber (huge thanks to those who already have). Alternatively, you can spread the word.
I work with a wide range of organisations to help them make sense of AI. Whether it's delivering keynotes, running training sessions, drafting policy or designing experiments, I’d love to help. Drop me a line at mail@dantaylorwatt.com.