C-Learning: No reward function needed for Goal-Conditioned RL

Introduction

Fig. from Original Author’s Presentation
Fig. from Original Author’s Presentation
  • Reframing of goal-conditioned RL as estimating the probability density over future states.
  • C-learning is the proposed novel algorithm.
  • Reframing problem in this way allows us to hypothesize on the optimal ratio for sampling.
  • C-learning is great for estimating the density over future states, and producing comparable success with recent goal-conditioned RL method for various robotic tasks.

Problem Set-Up

Given the standard definitions:

  • sₜ: state at time-step t
  • aₜ: action at time-step t
  • sₜ+: future state

Framing GCRL as Density Estimation

future state density
marginal state density

The Classifier in C-Learning

As mentioned before, rather than estimating the future state density directly, we estimate it indirectly by learning a classifier. Classification is easier than density estimation, but it also allow us to develop an off-policy algorithm.

  • The classifier takes (s_t, a_t) as input, together with s_(t+), and predicts whether sₜ+ was sampled from the future state density. This is labeled as F = 1
  • If it was sampled from the marginal state density. This is labeled as F = 0.
learned classifier

On-Policy: Learning the Classifier

  1. Sample state, action pair (sₜ, aₜ) ~ p(sₜ, aₜ)
  2. Sample future state sₜ+ from:
  • ~ p(sₜ+ | sₜ, aₜ) (the future state distribution conditioned on sₜ, aₜ)
  • ~ p(sₜ+) (the marginal future state distribution)

Off-Policy: Learning the Classifier

The on-policy setting depends on the current policy π and the commanded goal. Thus, even if we fix the policy parameters, we cannot use this experience to learn a classifier for another and precludes the ability to share experience amongst tasks. Thus, a bootstrapped version of the above mentioned algorithm is proposed to work with off-policy data.

  1. Sample (sₜ+1, sₜ, aₜ) transition from the dataset
  2. Sample sₜ+ from p(sₜ+)
  3. Sample next action aₜ+1 ~ π(aₜ+1 | sₜ+1, sₜ+)
  4. Compute the importance weight via the classifier
  5. Plug in importance weight as shown in Eq. 7
  6. Update classifier using gradient of the objective in Eq. 7

Goal-Conditioned RL with C-Learning

As you recall, the main point of this paper hinges around the central idea of viewing goal-conditioned RL as a problem for predicting and controlling the future.

  1. Given dataset of transitions
  2. Alternate between:
  • Estimating future state density of goal-conditioned policy
  • Updating policy to maximize the probability density of reaching the goal

Experiment and Results

The authors share their experiments and results by asking and answering the following questions:

  1. Do Q-learning and C-learning accurately estimate the future state density ?
  2. Does Q-learning underestimate the future state density function?
  3. Is the predicted relabeling ratio λ = (1 + γ)/2 optimal for Q-learning?
  4. How does C-learning compare with prior goal-conditioned RL methods on benchmark tasks?

1. Does Q-Learning and C-learning accurately predict the future? How do they compare?

  • On-policy: MC C-learning and TD C-learning perform similarly, prediction error for Q-learning is 3 times worse than learning.
  • Off-policy: TD C-learning is more accurate than Q-learning, with KL divergence that is 14% worse.

2. Does Q-learning underestimate the future state density function?

3. Is the predicted relabeling ratio λ = (1 + γ)/2 optimal for Q-learning?

4. How does C-learning compare with prior goal-conditioned RL methods on benchmark tasks?

Conclusion

The paper C-Learning: Learning to Achieve Goals via Recursive Classification study the problem of predicting and controlling future states distribution of an autonomous agent. It does not rely on reward functions, which allows the an agent in C-learning to solve this future prediction problem without any human supervision to solve many downstream tasks. This is further augmented in the experiments where C-Learning yields competitive results for difficult high-dimensional continuous control tasks.

References and Citations

Eysenbach, Benjamin, Ruslan Salakhutdinov, and Sergey Levine. “C-Learning: Learning to Achieve Goals via Recursive Classification.” arXiv preprint arXiv:2011.08909 (2020).

@inproceedings{
eysenbach2021clearning,
title={C-Learning: Learning to Achieve Goals via Recursive Classification},
author={Benjamin Eysenbach and Ruslan Salakhutdinov and Sergey Levine},
booktitle={International Conference on Learning Representations},
year={2021},
url={https://openreview.net/forum?id=tc5qisoB-C}
}

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store