Provably Efficient Exploration for RL with Unsupervised Learning

Fei Feng, Ruosong Wang, Wotao Yin, Simon S. Du, and Lin F. Yang



Reinforcement learning (RL) learns to control a system through trial and error to maximize cumulative rewards. RL takes observations and returns a policy, which is a mapping from observations to actions. To find a policy that generates high rewards, RL must reach many observations for all the opportunities of high rewards. This task is, however, very expensive to complete in modern RL applications since the set of possible observations (e.g., those consisting of images or texts) is enormous.

In this paper, we develop a method that uses unsupervised learning to improve the efficiency of RL exploration. We assume that there exists a smaller number of latent states that represent a large number of observations in a way that, by properly mapping one or more observations to a latent state, the rewards and state transitions defined with the latent states match those with the observations. Under such a mapping, RL can identify high-reward opportunities by visiting just the latent states, even if many observations are never reached.

We integrate unsupervised learning into a no-regret tabular method for RL. The former learns a mapping between latent states and observations, while the latter improves a control policy. Our method is built upon the existing work of using unsupervised learning for efficient RL exploration. However, we show that our method can provably find a near-optimal policy with sample complexity polynomial in the (small) number of latent states.


F. Feng, R. Wang, W. Yin, S. Du, and L. Yang, Provably efficient exploration for RL with unsupervised learning, arXiv:2003.06898, 2020.

« Back