C-Learning: No reward function needed for Goal-Conditioned RL

I was fortunate to come across a fascinating paper on C-Learning: Learning to Achieve Goals via Recursive Classification (Eysenbach, et. al) from ICLR 2021. I truly believe that the ideas presented in the paper are unique and valuable for many researchers in reinforcement learning. I hope this article clarifies and highlights the important insights in this paper, so that more people could apply this idea for relevant problems!

Introduction

In order to do this, the authors propose reframing the goal-conditioned RL problem for predicting and controlling the future (i.e. learning a conditional probability density function over future states). The problem of “prediction the future” is equivalent to learning a probability density function over future states, irrelevant of when the future state is reached. A key insight is that the future depends on the actions taken by the policy, so our predictions should depend on the agent’s policy.

Fig. from Original Author’s Presentation

How do we do this exactly? Instead of directly estimating this conditional probability density function, they indirectly learn a to predict whether a state comes from the future or not.

Fig. from Original Author’s Presentation

Specifically, utilizing the Bayes’ Rule, predictions from the classifier can be transformed into a density over future states. This a crucial aspect of the paper, which I will elaborate more on soon.

In summary, the main contributions are the following:

  • Reframing of goal-conditioned RL as estimating the probability density over future states.
  • C-learning is the proposed novel algorithm.
  • Reframing problem in this way allows us to hypothesize on the optimal ratio for sampling.
  • C-learning is great for estimating the density over future states, and producing comparable success with recent goal-conditioned RL method for various robotic tasks.

Problem Set-Up

Given the standard definitions:

  • sₜ: state at time-step t
  • aₜ: action at time-step t
  • sₜ+: future state

The above density function reflects the state that an agent would visit, if we collect infinite-length trajectories, with near-term states weighted higher. Here, it highlights the key intuition that: we do not have to define a reward function to define the problems of predicting/controlling the future, but define it entirely in terms of the gamma-discounted state density function.

Then, the problem of estimating the future state distribution into an RL problem can be converted to defining a reward function as indicator when sₜ = sₜ+ and terminating the episode when the agent arrives at the goal.

Framing GCRL as Density Estimation

future state density
marginal state density

The main idea is to view goal-conditioned RL as a problem of estimating the following density, over future states that a policy π will visit. How would we estimate this following function:

If we have the policy, with integration, we can find the probability that a future state belongs to a set.

The Classifier in C-Learning

As mentioned before, rather than estimating the future state density directly, we estimate it indirectly by learning a classifier. Classification is easier than density estimation, but it also allow us to develop an off-policy algorithm.

The classifier does the following:

  • The classifier takes (s_t, a_t) as input, together with s_(t+), and predicts whether sₜ+ was sampled from the future state density. This is labeled as F = 1
  • If it was sampled from the marginal state density. This is labeled as F = 0.

Then, Bayes’ optimal classifier is:

Using the learned classifier:

learned classifier

We can get the estimate of the future state density as:

Here is the math (mostly for me to understand):

On-Policy: Learning the Classifier

  1. Sample state, action pair (sₜ, aₜ) ~ p(sₜ, aₜ)
  2. Sample future state sₜ+ from:
  • ~ p(sₜ+ | sₜ, aₜ) (the future state distribution conditioned on sₜ, aₜ)
  • ~ p(sₜ+) (the marginal future state distribution)

3. Then we train the classifier to maximize the following log likelihood:

Thus we get the final on-policy algorithm:

Off-Policy: Learning the Classifier

The on-policy setting depends on the current policy π and the commanded goal. Thus, even if we fix the policy parameters, we cannot use this experience to learn a classifier for another and precludes the ability to share experience amongst tasks. Thus, a bootstrapped version of the above mentioned algorithm is proposed to work with off-policy data.

Because of the off-policy nature of this setting, a new challenge emerges in sampling from p(sₜ+ | sₜ, aₜ) (the future state distribution conditioned on sₜ, aₜ). This is addressed by simply deconstructing the future state density using the recursive relationship between future state density at the current and next time step.

Then, naturally, the classification objective in Eq. 3 can be re-written as: (simply by substituting (4)).

But as you may have noticed, this is different from the MC objective because it depends on the new policy and the problem of sampling from the future state distribution conditioned on sₜ+1, aₜ+1 is still there!

The authors cleverly observe that we can estimate the expectations that use p(sₜ+ | sₜ, aₜ) by sampling from the marginal p(sₜ+) and then weighting the samples by an importance weight, via our classifier!

Theses weights account for the new policy on the future state density. And we can simply plug this into Eq. 5 for the p(s+| sₜ+1, aₜ+1) term in the expectation. Then, we get:

Phew! That might have been confusing but here is the summary:

  1. Sample (sₜ+1, sₜ, aₜ) transition from the dataset
  2. Sample sₜ+ from p(sₜ+)
  3. Sample next action aₜ+1 ~ π(aₜ+1 | sₜ+1, sₜ+)
  4. Compute the importance weight via the classifier
  5. Plug in importance weight as shown in Eq. 7
  6. Update classifier using gradient of the objective in Eq. 7

Here is the off-policy algorithm:

Great! We’ve now covered how we can learn the classifier in an on-policy and off-policy setting.

Goal-Conditioned RL with C-Learning

As you recall, the main point of this paper hinges around the central idea of viewing goal-conditioned RL as a problem for predicting and controlling the future.

We’ve shown above how to estimate future state density of a single policy, for GCRL we want to estimate the future state density of a conditional policy, which may be conditioned on many goals.

Thus, to learn a classifier for a goal-conditioned policy, our objective function must be applied to all policies π(a| s, g) by conditioning the classifier and the commanded goal g. But, we will only need to query the classifier on inputs where sₜ+ = g, to learn a goal-reaching policy.

Then naturally, the objective becomes:

The difference between this objective and previous one (Eq. 7) is that next action is sampled from a goal-conditioned policy. And the density function from this classifier represents the future state density of sₜ+ given that the policy was “commanded to reach goal g”.

Now, we can estimate the future state density of a goal-conditioned policy. How would we define a reward function that says how good a particular future state density is for reaching a particular goal?

We use the KL divergence between a Dirac density at a commanded goal and the future state density of the goal-conditioned policy.

Proof:

As the authors state, computing this KL only requires the future state density of the commanded goal. We can write the policy objective in terms of the classifier predictions:

Typing everything together, the approach looks like this:

  1. Given dataset of transitions
  2. Alternate between:
  • Estimating future state density of goal-conditioned policy
  • Updating policy to maximize the probability density of reaching the goal

The algorithm is shown below:

Experiment and Results

The authors share their experiments and results by asking and answering the following questions:

  1. Do Q-learning and C-learning accurately estimate the future state density ?
  2. Does Q-learning underestimate the future state density function?
  3. Is the predicted relabeling ratio λ = (1 + γ)/2 optimal for Q-learning?
  4. How does C-learning compare with prior goal-conditioned RL methods on benchmark tasks?

1. Does Q-Learning and C-learning accurately predict the future? How do they compare?

When tested on a continuous version of grid-world:

  • On-policy: MC C-learning and TD C-learning perform similarly, prediction error for Q-learning is 3 times worse than learning.
  • Off-policy: TD C-learning is more accurate than Q-learning, with KL divergence that is 14% worse.

2. Does Q-learning underestimate the future state density function?

The authors report that Q-learning will underestimate the future state density function, as the sum of predicted future state density function results in being less than 1 for Q-learning, whereas the predictions from C-learning summed to 1 consistently.

3. Is the predicted relabeling ratio λ = (1 + γ)/2 optimal for Q-learning?

Experiments show that the performance of Q-learning is highly dependent on relabeling ratio. Furthermore, it shows that the hypothesized relabeling ratio λ = (1 + γ)/2 almost exactly matches the optimal value. C-learning consistently performs better at estimating the future state density function (even for the best choice of λ for Q-Learning).

4. How does C-learning compare with prior goal-conditioned RL methods on benchmark tasks?

The authors apply goal conditioned C-learning to benchmark continuous control tasks, with varying difficulty. As shown in the plots, C-learning is competitive in all tasks. Furthermore, visualizing the learned policies C-learning has discovered re-grasping and fine-grained adjustment behaviors, which is super cool, since no manually designed reward functions were used to promote these behaviors.

Conclusion

The paper C-Learning: Learning to Achieve Goals via Recursive Classification study the problem of predicting and controlling future states distribution of an autonomous agent. It does not rely on reward functions, which allows the an agent in C-learning to solve this future prediction problem without any human supervision to solve many downstream tasks. This is further augmented in the experiments where C-Learning yields competitive results for difficult high-dimensional continuous control tasks.

I would strongly recommend readers to check out the original paper, and the presentation video (below). I hope this article helped people understand the key ideas behind C-Learning so that it could be used to solve interesting research problems.

References and Citations

Eysenbach, Benjamin, Ruslan Salakhutdinov, and Sergey Levine. “C-Learning: Learning to Achieve Goals via Recursive Classification.” (2020).

Paper:

@inproceedings{
eysenbach2021clearning,
title={C-Learning: Learning to Achieve Goals via Recursive Classification},
author={Benjamin Eysenbach and Ruslan Salakhutdinov and Sergey Levine},
booktitle={International Conference on Learning Representations},
year={2021},
url={https://openreview.net/forum?id=tc5qisoB-C}
}

Website: https://ben-eysenbach.github.io/c_learning/

Presentation: https://slideslive.com/38941367/clearning-learning-to-achieve-goals-via-recursive-classification?ref=speaker-16445-latest

Code: https://github.com/google-research/google-research/tree/master/c_learning

thankful to be able to study what I love :)