C-Learning: No reward function needed for Goal-Conditioned RL

Introduction

Fig. from Original Author’s Presentation
Fig. from Original Author’s Presentation
  • Reframing of goal-conditioned RL as estimating the probability density over future states.
  • C-learning is the proposed novel algorithm.
  • Reframing problem in this way allows us to hypothesize on the optimal ratio for sampling.
  • C-learning is great for estimating the density over future states, and producing comparable success with recent goal-conditioned RL method for various robotic tasks.

Problem Set-Up

  • sₜ: state at time-step t
  • aₜ: action at time-step t
  • sₜ+: future state

Framing GCRL as Density Estimation

future state density
marginal state density

The Classifier in C-Learning

  • The classifier takes (s_t, a_t) as input, together with s_(t+), and predicts whether sₜ+ was sampled from the future state density. This is labeled as F = 1
  • If it was sampled from the marginal state density. This is labeled as F = 0.
learned classifier

On-Policy: Learning the Classifier

  1. Sample state, action pair (sₜ, aₜ) ~ p(sₜ, aₜ)
  2. Sample future state sₜ+ from:
  • ~ p(sₜ+ | sₜ, aₜ) (the future state distribution conditioned on sₜ, aₜ)
  • ~ p(sₜ+) (the marginal future state distribution)

Off-Policy: Learning the Classifier

  1. Sample (sₜ+1, sₜ, aₜ) transition from the dataset
  2. Sample sₜ+ from p(sₜ+)
  3. Sample next action aₜ+1 ~ π(aₜ+1 | sₜ+1, sₜ+)
  4. Compute the importance weight via the classifier
  5. Plug in importance weight as shown in Eq. 7
  6. Update classifier using gradient of the objective in Eq. 7

Goal-Conditioned RL with C-Learning

  1. Given dataset of transitions
  2. Alternate between:
  • Estimating future state density of goal-conditioned policy
  • Updating policy to maximize the probability density of reaching the goal

Experiment and Results

  1. Do Q-learning and C-learning accurately estimate the future state density ?
  2. Does Q-learning underestimate the future state density function?
  3. Is the predicted relabeling ratio λ = (1 + γ)/2 optimal for Q-learning?
  4. How does C-learning compare with prior goal-conditioned RL methods on benchmark tasks?

1. Does Q-Learning and C-learning accurately predict the future? How do they compare?

  • On-policy: MC C-learning and TD C-learning perform similarly, prediction error for Q-learning is 3 times worse than learning.
  • Off-policy: TD C-learning is more accurate than Q-learning, with KL divergence that is 14% worse.

2. Does Q-learning underestimate the future state density function?

3. Is the predicted relabeling ratio λ = (1 + γ)/2 optimal for Q-learning?

4. How does C-learning compare with prior goal-conditioned RL methods on benchmark tasks?

Conclusion

References and Citations

@inproceedings{
eysenbach2021clearning,
title={C-Learning: Learning to Achieve Goals via Recursive Classification},
author={Benjamin Eysenbach and Ruslan Salakhutdinov and Sergey Levine},
booktitle={International Conference on Learning Representations},
year={2021},
url={https://openreview.net/forum?id=tc5qisoB-C}
}

--

--

--

thankful to be able to study what I love :)

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Deep Learning in Medical Imaging V

What is machine learning?

Replication of the CRFNet

Design and Development of AI based Camera for Wildlife Study

The Pruning Radix Trie — a Radix tree on steroids

Nvidia developed a radically different way to compress video calls

Post Pruning Decision Trees Using Python

Tackling Exploration-Exploitation Dilemma in K-armed Bandits

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Dong Won (Don) Lee

Dong Won (Don) Lee

thankful to be able to study what I love :)

More from Medium

Meta’s new supercomputer, 16000 GPUs — A complete AI beast

Compress Your Deep Learning Models with No Code, No Hassle

Intel vs AMD CPUs: Which Is Better?

AI Powered 3D Human Shape Estimation