Intelligent beings learn by interacting with the world. Artificial intelligence researchers have adopted a similar strategy to teach their virtual agents new skills.
In 2009, a computer scientist then at Princeton University named Fei-Fei Li invented a data set that would change the history of artificial intelligence. Known as ImageNet, the data set included millions of labeled images that could train sophisticated machine-learning models to recognize something in a picture. The machines surpassed human recognition abilities in 2015. Soon after, Li began looking for what she called another of the “North Stars” that would give AI a different push toward true intelligence.
She found inspiration by looking back in time over 530 million years to the Cambrian explosion, when numerous land-dwelling animal species appeared for the first time. An influential theory posits that the burst of new species was driven in part by the emergence of eyes that could see the world around them for the first time. Li realized that vision in animals never occurs by itself but instead is “deeply embedded in a holistic body that needs to move, navigate, survive, manipulate and change in the rapidly changing environment,” she said. “That’s why it was very natural for me to pivot towards a more active vision [for AI].”
Today, Li’s work focuses on AI agents that don’t simply accept static images from a data set but can move around and interact with their environments in simulations of three-dimensional virtual worlds.
This is the broad goal of a new field known as embodied AI, and Li’s not the only one embracing it. It overlaps with robotics, since robots can be the physical equivalent of embodied AI agents in the real world, and reinforcement learning — which has always trained an interactive agent to learn using long-term rewards as incentive. But Li and others think embodied AI could power a major shift from machines learning straightforward abilities, like recognizing images, to learning how to perform complex humanlike tasks with multiple steps, such as making an omelet.
“Naturally, we get more ambitious, and we say, ‘Okay, how about building an intelligent agent?’ And at that point, you’re going to think of embodied AI,” said Jitendra Malik, a computer scientist at the University of California, Berkeley.
Work in embodied AI today includes any agent that can probe and change its own environment. While in robotics the AI agent always lives in a robotic body, modern agents in realistic simulations may have a virtual body, or they may sense the world through a moving camera vantage point that can still interact with their surroundings. “The meaning of embodiment is not the body itself, it is the holistic need and functionality of interacting and doing things with your environment,” said Li.
This interactivity gives agents a whole new — and in many cases, better — way of learning about the world. It’s the difference between observing a possible relationship between two objects and being the one to experiment and cause the relationship to happen yourself. Armed with this new understanding, the thinking goes, greater intelligence will follow. And with a suite of new virtual worlds up and running, embodied AI agents have already begun to deliver on this potential, making significant progress in their new environments.
“Right now, we don’t have any proof of intelligence that exists that is not learning through interacting with the world,” said Viviane Clay, an embodied AI researcher at the University of Osnabrück in Germany.
Toward a Perfect Simulation
While researchers had long wanted to create realistic virtual worlds for AI agents to explore, it was only in the past five years or so that they could start building them. The ability came from improvements in graphics driven by the movie and video game industries. In 2017, AI agents could make themselves at home in the first virtual worlds to realistically portray indoor spaces — in literal, albeit virtual, homes. A simulator called AI2-Thor, built by computer scientists at the Allen Institute for AI, allows agents to wander through naturalistic kitchens, bathrooms, living rooms and bedrooms. Agents could study three-dimensional views that would shift as they moved, exposing new angles when they decided to get a closer look.
Such new worlds also gave agents the chance to reason about changes in a new dimension: time. “That’s the big difference,” said Manolis Savva, a computer graphics researcher at Simon Fraser University who has built multiple virtual worlds. “In the embodied AI setting … you have this temporally coherent stream of information, and you have control over it.”
These simulated worlds are now good enough to train agents to do entirely new tasks. Rather than just recognize an object, they can interact with it, pick it up and navigate around it — seemingly small steps but essential ones for any agent to understand its environment. And in 2020, virtual agents went beyond vision to hear the sounds virtual things make, providing another way to learn about objects and how they work in the world.
That’s not to say the work is finished. “It’s much less real than the real world, even the best simulator,” said Daniel Yamins, a computer scientist at Stanford University. With colleagues at MIT and IBM, Yamins co-developed ThreeDWorld, which puts a strong focus on mimicking real-life physics in virtual worlds — things like how liquids behave and how some objects are rigid in one area and soft in others.
“This is really hard to do,” said Savva. “It’s a big research challenge.”
Still, it’s enough for AI agents to start learning in new ways.
Comparing Neural Networks
So far, one easy way to measure embodied AI’s progress is by comparing embodied agents’ performance to algorithms trained on the simpler, static image tasks. Researchers note these comparisons aren’t perfect, but early results do suggest that embodied AI agents learn differently — and at times better — than their forebears.
In one recent paper, researchers found an embodied AI agent was more accurate at detecting specified objects, improving on the traditional approach by nearly 12%. “It took the object detection community more than three years to achieve this level of improvement,” said Roozbeh Mottaghi, a co-author and a computer scientist at the Allen Institute for AI. “Simply just by interacting with the world, we managed to gain that much improvement,” he said.
Other papers have shown that object detection improves among traditionally trained algorithms when you put them into an embodied form and allow them to explore a virtual space just once, or when you let them move around to gather multiple views of objects.
Researchers are also finding that embodied and traditional algorithms learn fundamentally differently. For evidence, consider the neural network — the essential ingredient behind the learning abilities of every embodied and many nonembodied algorithms. A neural network is a type of algorithm with many layers of connected nodes of artificial neurons, loosely modeled after the networks in human brains. In two separate papers, one led by Clay and the other by Grace Lindsay, an incoming professor at New York University, researchers found that the neural networks in embodied agents had fewer neurons active in response to visual information, meaning that each individual neuron was more selective about what it would respond to. Nonembodied networks were much less efficient and required many more neurons to be active most of the time. Lindsay’s group even compared the embodied and nonembodied neural networks to neuronal activity in a living brain — a mouse’s visual cortex — and found the embodied versions were the closest match.
Lindsay is quick to point out that this doesn’t necessarily mean the embodied versions are better — they’re just different. Unlike the object detection papers, Clay’s and Lindsay’s work comparing the underlying differences in the same neural networks has the agents doing completely different tasks — so they could need neural networks that work differently to accomplish their goals.
But while comparing embodied neural networks to nonembodied ones is one measure of progress, researchers aren’t really interested in improving embodied agents’ performance on current tasks; that line of work will continue separately, using traditionally trained AI. The true goal is to learn more complicated, humanlike tasks, and that’s where researchers have been most excited to see signs of impressive progress, particularly in navigation tasks. Here, an agent must remember the long-term goal of its destination while forging a plan to get there without getting lost or walking into objects.
In just a few years, a team led by Dhruv Batra, a research director at Meta AI and a computer scientist at the Georgia Institute of Technology, rapidly improved performance on a specific type of navigation task called point-goal navigation. Here, an agent is dropped in a brand-new environment and must navigate to target coordinates relative to the starting position (“Go to the point that is 5 meters north and 10 meters east”) without a map. By giving the agents a GPS and a compass, and training it in Meta’s virtual world, called AI Habitat, “we were able to get greater than 99.9% accuracy on a standard data set,” said Batra. And this month, they successfully expanded the results to a more difficult and realistic scenario where the agent doesn’t have GPS or a compass. The agent reached 94% accuracy purely by estimating its position based on the stream of pixels it sees while moving.
“This is fantastic progress,” said Mottaghi. “However, this does not mean that navigation is a solved task.” In part, that’s because many other types of navigation tasks that use more complex language instructions, such as “Go past the kitchen to retrieve the glasses on the nightstand in the bedroom,” remain at only around 30% to 40% accuracy.
But navigation still represents one of the simplest tasks in embodied AI, since the agents move through the environment without manipulating anything in it. So far, embodied AI agents are far from mastering any tasks with objects. Part of the challenge is that when the agent interacts with new objects, there are many ways it can go wrong, and mistakes can pile up. For now, most researchers get around this by choosing tasks with only a few steps, but most humanlike activities, like baking or doing the dishes, require long sequences of actions with multiple objects. To get there, AI agents will need a bigger push.
Here again, Li may be at the forefront, having developed a data set that she hopes will do for embodied AI what her ImageNet project did for AI object recognition. Where once she gifted the AI community a huge data set of images for labs to standardize input data, her team has now released a standardized simulated data set with 100 humanlike activities for agents to complete that can be tested in any virtual world. By creating metrics that compare the agents doing these tasks to real videos of humans doing the same task, Li’s new data set will allow the community to better evaluate the progress of virtual AI agents.
Once the agents are successful on these complicated tasks, Li sees the purpose of simulation as training for the ultimate maneuverable space: the real world.
“Simulation is one of the most, in my opinion, important and exciting areas of robotic research,” she said.
The New Robotic Frontier
Robots are, inherently, embodied intelligence agents. By inhabiting some type of physical body in the real world, they represent the most extreme form of embodied AI agents. But many researchers are now finding that even these agents can benefit from training in virtual worlds.
“State-of-the-art algorithms [in robotics], like reinforcement learning and those types of things, usually require millions of iterations to learn something meaningful,” said Mottaghi. As a result, training real robots on difficult tasks can take years.
But training them in virtual worlds first offers the opportunity to train much faster than in real time, and thousands of agents can train at once in thousands of slightly different rooms. Plus, virtual training is also safer for the robot and any nearby humans in its path.
Many roboticists started taking simulators more seriously in 2018, when researchers at OpenAI proved that transferring skills from simulation to the real world was possible. They trained a robotic hand to manipulate a cube it had seen only in simulations. More recent successes have allowed flying drones to learn how to avoid collisions in the air, self-driving cars to deploy in urban settings across two different continents, and four-legged doglike robots to complete an hourlong hike in the Swiss Alps in the same time it takes humans.
In the future, researchers might also close the gap between simulations and the real world by sending humans into virtual space via virtual reality headsets. A key goal of robotics research, notes Dieter Fox, the senior director of robotics research at NVIDIA and a professor at the University of Washington, is to build robots that are helpful to humans in the real world. But to do that, they must first be exposed to and learn how to interact with humans.
“Using virtual reality to get humans into these simulated environments and enable them to demonstrate things and interact with the robots is going to be very powerful,” Fox said.
Whether they exist in simulations or the real world, embodied AI agents are learning more like us, on tasks that are more like the ones we do every day. And the field is progressing on all fronts at once — new worlds, new tasks and new learning algorithms.
“I see a convergence of deep learning, robotic learning, vision and also even language,” Li said. “And now I think through this moonshot or North Star towards embodied AI, we’re going to learn the foundational technology of intelligence, or AI, that can really lead to major breakthroughs.”