PriPG-RL: Privileged Planner-Guided Reinforcement Learning for Partially Observable Systems with Anytime-Feasible MPC
A training framework that uses a privileged planner to teach a robot to navigate obstacles with only partial sensor data — validated on a real quadruped dog.

The Thesis
Most real robots don't get to see the full state of the world — sensors are noisy, occluded, or simply absent. This paper proposes a training technique where a well-informed 'teacher' planner, which sees everything, coaches a 'student' policy that only sees a degraded slice of reality. The student policy, once trained, runs on the robot alone with no help from the teacher. The practical payoff is a legged robot that can pick its way through cluttered environments without requiring perfect sensing at deployment time. The catch is that the approach was tested in one specific simulation environment and on a single robot platform, so generality remains to be proven.
Catalyst
Two things converged: simulation platforms like NVIDIA Isaac Lab now make it cheap to run thousands of physics-accurate training episodes, and anytime-feasible MPC — a flavor of model-predictive control (a real-time optimization method for trajectory planning) that guarantees a valid, safe plan is always available even if interrupted mid-computation — has matured enough to serve reliably as a training supervisor. Neither of those pieces was stable and accessible enough two years ago to build this kind of privileged-teacher pipeline around.
What's New
Prior approaches to partial observability in reinforcement learning either trained the student policy entirely through trial-and-error (slow, fragile) or used imitation learning from a privileged expert but struggled to transfer that knowledge when the expert's observations differed sharply from the student's. This paper's P2P-SAC method — a variant of Soft Actor-Critic, a popular RL algorithm — adds a structured distillation step that explicitly bridges the information gap between the planner's privileged view and the robot's limited sensors. The authors also introduce an anytime-feasible MPC formulation, meaning the planner always provides a safe reference trajectory rather than occasionally failing to find one, which makes training more stable.
The Counter
The real-world deployment consists of a single robot in a single lab environment — not a systematic benchmark against competing methods on standardized obstacle courses. The paper's theoretical guarantees apply to the MPC planner's feasibility, not to the end-to-end policy's safety or performance in unseen environments. Privileged-teacher approaches have been explored for years in robot learning (most notably in legged locomotion work from ETH Zurich and CMU), so the novelty claim is incremental, not foundational. The gap between Isaac Lab simulation and real warehouse or outdoor terrain is notoriously large, and the paper does not quantify how much performance degrades in that transfer. Finally, the approach still requires a reasonably accurate dynamics model for the planner — exactly the thing that is hard to get in practice for novel robots or environments.
Longs
- NVDA — Isaac Lab simulation infrastructure underpins the training pipeline directly
- BOTZ (Global X Robotics & AI ETF) — broad exposure to legged robot and autonomous navigation commercialization
- IRBT — iRobot and adjacent floor/mobile robot players benefit from improved navigation in unstructured spaces
- Unitree Robotics (private) — Go2 quadruped is the deployment hardware; a direct beneficiary if technique scales
Shorts
- Classical MPC-only navigation vendors — if learned policies trained this way outperform pure MPC at deployment with lighter compute budgets, dedicated MPC stack companies lose their edge
- LiDAR-heavy autonomy stacks — systems that solve partial observability by adding more sensors become less attractive if this approach achieves comparable performance with cheaper, sparser sensing
Enablers (Picks & Shovels)
- NVIDIA Isaac Lab — the physics simulation environment used for training and validation
- Unitree Go2 — the real-world robot hardware platform used for deployment experiments
- Soft Actor-Critic (open-source RL algorithm) — the base learning algorithm extended by P2P-SAC
- CasADi (open-source optimization framework) — commonly used to implement MPC solvers of this type
Private Watchlist
- Unitree Robotics — the Go2 quadruped used in deployment experiments
- Boston Dynamics — competing quadruped platform that could adopt similar training frameworks
- Skild AI — building general robot policies that face the same partial observability problem
- Physical Intelligence (Pi) — robot foundation model startup facing identical sensor-limitation challenges
Resources
The Paper
This paper addresses the problem of training a reinforcement learning (RL) policy under partial observability by exploiting a privileged, anytime-feasible planner agent available exclusively during training. We formalize this as a Partially Observable Markov Decision Process (POMDP) in which a planner agent with access to an approximate dynamical model and privileged state information guides a learning agent that observes only a lossy projection of the true state. To realize this framework, we introduce an anytime-feasible Model Predictive Control (MPC) algorithm that serves as the planner agent. For the learning agent, we propose Planner-to-Policy Soft Actor-Critic (P2P-SAC), a method that distills the planner agent's privileged knowledge to mitigate partial observability and thereby improve both sample efficiency and final policy performance. We support this framework with rigorous theoretical analysis. Finally, we validate our approach in simulation using NVIDIA Isaac Lab and successfully deploy it on a real-world Unitree Go2 quadruped navigating complex, obstacle-rich environments.