Prioritized Model Experience Replay

ICML 2026
*corresponding author

Abstract

PMER is a lightweight replay mechanism for Dyna-style model-based reinforcement learning. It targets a practical instability in MBRL: the dynamics model is trained on replay data collected by many historical policies, but it is queried by the continually changing current policy. As a result, global validation loss can decrease while local, policy-relevant model error remains high and destabilizes model rollouts.

Instead of explicitly estimating policy distances, PMER uses dynamics prediction error as a proxy for policy mismatch. During model training, each batch mixes uniformly sampled transitions from the environment buffer with high-error transitions from a prioritized model experience buffer, repeatedly revisiting the regions most likely to affect current policy improvement.

Visualization of model learning mismatch

Figure 1: Dynamics model learning mismatch between global and local model error.

Key Insight

Prior works provide strong empirical evidence that curiosity- or error-based replay can improve sample efficiency. In contrast, PMER focuses on the underlying mechanism and provides new insights into why error-based replay is effective: prediction error acts as a practical proxy for policy-induced mismatch, and replay inghigh-error transitions helps the dynamics model focus on underfitted, policy-relevant regions.

Visualization of model learning mismatch

Figure 2: We use the 50k-step dynamics model to evaluate prediction error, and measure policy shift as the KL divergence between the 50k-step policy and historical policies. Prediction error is not identical to policy shift. However, high-error samples are concentrated in regions closer to the current policy, whereas samples from larger-shift historical policies often have lower error because they have been better covered during model training. This supports our view that prediction error highlights underfitted, policy-relevant regions without explicit policy-distance estimation.

How it works