SUGAR: A Scalable Human-Video-Driven
Generalizable Humanoid Loco-Manipulation
Learning Framework

Tianshu Wu^1*, Xiangqi Kong^2*, Yue Chen^1*,
Qize Yu¹, Hang Ye¹, Jia Li¹, Yizhou Wang¹, Hao Dong^1✉

¹CFCS, School of Computer Science, Peking University
²School of Computer Science and Engineering, Beihang University
^*Equal Contribution ^✉Corresponding Author

Paper Code

Video

Summary

Building humanoid robots that perform generalizable whole-body loco-manipulation in the real world remains a fundamental challenge: existing approaches either rely on heavy task-specific reward engineering, rigidly replay reference motions that fail to generalize, or depend on costly teleoperation that limits scalability. While human videos capture diverse human behaviors, the motion priors inferred from them are inherently imperfect, suffering from occlusion, contact artifacts, and retargeting errors that render them unsuitable for direct policy learning. To this end, we present SUGAR, a data-driven framework that converts diverse human videos into deployable humanoid loco-manipulation skills, without any task-specific reward engineering or reference-motion conditioning at inference.

Method Overview

SUGAR consists of three stages: (1) extracting kinematic interaction priors from unstructured human videos through a fully automated pipeline; (2) refining the priors into physically feasible skills with a privileged RL policy; and (3) training a hierarchical autonomous policy on the refined demonstrations for robust humanoid loco-manipulation.

Real-World Experimental Results

Carry Box

Kick Box

Pick Bottle

Sit Chair

Stand Bottle

Push Box

Robust Closed-Loop Execution & Autonomous Recovery

Our method ensures stable long-horizon interaction even under significant external perturbations
and resumes tasks seamlessly after unexpected failures.