Abstract
PMER is a lightweight replay mechanism for Dyna-style model-based reinforcement learning. It targets a practical instability in MBRL: the dynamics model is trained on replay data collected by many historical policies, but it is queried by the continually changing current policy. As a result, global validation loss can decrease while local, policy-relevant model error remains high and destabilizes model rollouts.
Instead of explicitly estimating policy distances, PMER uses dynamics prediction error as a proxy for policy mismatch. During model training, each batch mixes uniformly sampled transitions from the environment buffer with high-error transitions from a prioritized model experience buffer, repeatedly revisiting the regions most likely to affect current policy improvement.
Figure 1: Dynamics model learning mismatch between global and local model error.
Key Insight
Prior works provide strong empirical evidence that curiosity- or error-based replay can improve sample efficiency. In contrast, PMER focuses on the underlying mechanism and provides new insights into why error-based replay is effective: prediction error acts as a practical proxy for policy-induced mismatch, and replay inghigh-error transitions helps the dynamics model focus on underfitted, policy-relevant regions.
Figure 2: We use the 50k-step dynamics model to evaluate prediction error, and measure policy shift as the KL divergence between the 50k-step policy and historical policies. Prediction error is not identical to policy shift. However, high-error samples are concentrated in regions closer to the current policy, whereas samples from larger-shift historical policies often have lower error because they have been better covered during model training. This supports our view that prediction error highlights underfitted, policy-relevant regions without explicit policy-distance estimation.