In a recent appearance on Possible, a podcast co-hosted by LinkedIn co-founder Reid Hoffman, Google DeepMind CEO Demis Hassabis said the search giant plans to eventually combine its Gemini AI models with its Veo video-generating models to improve the former’s understanding of the physical world.
“We’ve always built Gemini, our foundation model, to be multimodal from the beginning,” Hassabis said, “And the reason we did that [is because] we have a vision for this idea of a universal digital assistant, an assistant that […] actually helps you in the real world.”
The AI industry is moving gradually toward “omni” models, if you will — models that can understand and synthesize many forms of media. Google’s newest Gemini models can generate audio as well as images and text, while OpenAI’s default model in ChatGPT can now create images — including, of course, Studio Ghibli-style art. Amazon has also announced plans to launch an “any-to-any” model later this year.
These omni models require a lot of training data — images, videos, audio, text, and so on. Hassabis implied that the video data for Veo is coming mostly from YouTube, a platform that Google owns.
“Basically, by watching YouTube videos — a lot of YouTube videos — [Veo 2] can figure out, you know, the physics of the world,” Hassabis said.
Google previously told TechCrunch its models “may be” trained on “some” YouTube content in accordance with its agreement with YouTube creators. Reportedly, the company broadened its terms of service last year in part to tap more data to train its AI models.