Prioritized Model Experience Replay

Muxi Tao

Tsinghua University

Jiangtao Wen ^*

New York University

Yuxing Han

Tsinghua University

ICML 2026

^*corresponding author

Paper Code arXiv

coming soon

Abstract

PMER is a lightweight replay mechanism for Dyna-style model-based reinforcement learning. It targets a practical instability in MBRL: the dynamics model is trained on replay data collected by many historical policies, but it is queried by the continually changing current policy. As a result, global validation loss can decrease while local, policy-relevant model error remains high and destabilizes model rollouts.

Instead of explicitly estimating policy distances, PMER uses dynamics prediction error as a proxy for policy mismatch. During model training, each batch mixes uniformly sampled transitions from the environment buffer with high-error transitions from a prioritized model experience buffer, repeatedly revisiting the regions most likely to affect current policy improvement.

Visualization of model learning mismatch — Figure 1: Dynamics model learning mismatch between global and local model error.

Key Insight

Prior works provide strong empirical evidence that curiosity- or error-based replay can improve sample efficiency. In contrast, PMER focuses on the underlying mechanism and provides new insights into why error-based replay is effective: prediction error acts as a practical proxy for policy-induced mismatch, and replay inghigh-error transitions helps the dynamics model focus on underfitted, policy-relevant regions.

Abstract

Key Insight

How it works