Learn to Teach
Sample-efficient privileged learning for humanoid locomotion over diverse terrains.
Overview
Learn to Teach (L2T) targets a practical bottleneck in humanoid reinforcement learning: policies can become robust in simulation, but training them across enough terrain, contact, and disturbance variation usually consumes enormous numbers of simulator samples.
The paper reframes teacher-student learning as a one-stage privileged learning problem. Rather than train a privileged teacher first and distill it into a deployable student afterward, L2T learns the teacher and student together. The two policies share the same simulated dynamics, so samples collected for one side can keep teaching the other. This synchronizes the learning trajectories, recycles simulator data, and reduces both sample complexity and training time.
On the Digit humanoid, the RL variant (L2T-RL) demonstrates zero-shot sim-to-real transfer. The hardware videos cover outdoor and indoor walking, concrete, gravel, grass, sand, rocky terrain, slippery surfaces, elevation changes, payload carrying, turning, and recovery from pushes and pulls.
Why it matters
- It keeps the useful structure of privileged learning while removing the brittle handoff between a completed teacher and a later student.
- It makes simulation data work harder by updating teacher and student policies from shared rollouts.
- It shows the resulting student policy can transfer to real Digit hardware across 12+ terrain and disturbance settings.
- It complements my earlier humanoid locomotion work on reward learning from demonstrations: Infer and Adapt learns a reward for bipedal locomotion, while L2T focuses on sample-efficient privileged policy learning.