Google DeepMind’s robotics team is teaching robots to learn how a human intern would: by watching a video. The team has published a new paper demonstrating how Google’s RT-2 robots embedded with the Gemini 1.5 Pro generative AI model can absorb information from videos to learn how to get around and even carry out requests at their destination.
Thanks to the Gemini 1.5 Pro model’s long context window, training a robot like a new intern is possible. This window allows the AI to process extensive amounts of information simultaneously. The researchers would film a video tour of a designated area, such as a home or office. Then, the robot would watch the video and learn about the environment.
The details in the video tours let the robot complete tasks based on its learned knowledge, using both verbal and image outputs. It’s an impressive way of showing how robots might interact with their environment in ways reminiscent of human behavior. You can see how it works in the video below, as well as examples of different tasks the robot might carry out.
A limited context length makes it a challenge for many AI models to recall environments. 🌐Powered with 1.5 Pro’s 1 million token context length, our robots can use human instructions, video tours, and common sense reasoning to successfully find their way around a space. pic.twitter.com/eIQbtjHCbWJuly 11, 2024
Robot AI Expertise
Those demonstrations aren’t rare flukes, either. In practical tests, Gemini-powered robots operated within a 9,000-square-foot area and successfully followed over 50 different user instructions with a 90 percent success rate. This high level of accuracy opens up many potential real-world uses for AI-powered robots, helping out at home with chores or at work with menial or even more complex tasks.
That’s because one of the more notable aspects of the Gemini 1.5 Pro model is its ability to complete multi-step tasks. DeepMind’s research has found that the robots can work out how to answer questions like whether there’s a specific drink available by navigating to a refrigerator, visually processing what’s within, and then returning and answering the question.
The idea of planning and carrying out the entire sequence of actions demonstrates a level of understanding and execution that goes beyond the current standard of single-step orders for most robots.
Don’t expect to see this robot for sale any time soon, though. For one thing, it takes up to 30 seconds to process each instruction, which is way slower than just doing something yourself in most cases. The chaos of real-world homes and offices will be much harder for a robot to navigate than a controlled environment, no matter how advanced the AI model is.
Still, integrating AI models like Gemini 1.5 Pro into robotics is part of a larger leap forward in the field. Robots equipped with models like Gemini or its rivals could transform healthcare, shipping, and even janitorial duties.