New Apple Study Teaches Robots How to Act by Watching First-Person Videos of Humans

05/22/2025

788

New Apple Study Teaches Robots How to Act by Watching First-Person Videos of Humans

In a new paper called “Humanoid Policy ∼ Human Policy,” Apple researchers propose an interesting way to train humanoid robots. And it involves wearing an Apple Vision Pro.

Robot see, robot do

The project is a collaboration between Apple, MIT, Carnegie Mellon, the University of Washington, and UC San Diego. It explores how first-person footage of people manipulating objects can be used to train general-purpose robot models.

In total, the researchers gathered over 25,000 human demonstrations and 1,500 robot demonstrations (a dataset they called PH2D), and fed them into a unified AI policy that could then control a real humanoid robot in the physical world.

As the authors explain:

Training manipulation policies for humanoid robots with diverse data enhances their robustness and generalization across tasks and platforms. However, learning solely from robot demonstrations is labor-intensive and requires expensive teleoperated data collection, which is difficult to scale.
This paper investigates a more scalable data source, egocentric human demonstrations, to serve as cross-embodiment training data for robot learning.

Their solution? Let humans show the way.

Cheaper, faster training

To collect the training data, the team developed an Apple Vision Pro app that captures video from the device’s bottom-left camera, and uses Apple’s ARKit to track 3D head and hand motion.

However, to explore a more affordable solution, they also 3D-printed a mount to attach a ZED Mini Stereo camera to other headsets, like the Meta Quest 3, offering similar 3D motion tracking at a lower cost.

New Apple Study Teaches Robots How to Act by Watching First-Person Videos of Humans

The result was a setup that let them record high-quality demonstrations in seconds, a pretty big improvement over traditional robot tele-op methods, which are slower, more expensive, and harder to scale.

And here’s one last interesting detail: since people move way faster than robots, the researchers slowed down the human demos by a factor of four during training, just enough for the robot to keep up without needing further adjustments.

The Human Action Transformer (HAT)

The key to the whole study is the HAT, a model trained on both human and robot demonstrations in a shared format.

Instead of splitting the data by source (humans v. robots), HAT learns a single policy that generalizes across both types of bodies, making the system more flexible and data-efficient.

In some tests, this shared training approach helped the robot handle more challenging tasks, including ones that it hadn’t seen before, when compared to more traditional methods.

New Apple Study Teaches Robots How to Act by Watching First-Person Videos of Humans

Overall, the study is pretty interesting and worth checking out if you are into robotics.

Is the idea of a house humanoid robot scary, exciting, or pointless to you? Let us know in the comments.

Source: 9to5mac