Inverse Reinforcement Learning

By Alexandre Gonfalonieri

This article is based on the work of Johannes Heidecke, Jacob Steinhardt, Owain Evans, Jordan Alexander, Prasanth Omanakuttan, Bilal Piot, Matthieu Geist, Olivier Pietquin and other influencers in the field of Inverse Reinforcement Learning. I used their words to help people understand IRL.

Inverse reinforcement learning is a recently developed machine-learning framework that can solve the inverse problem of RL.

Basically, IRL is about learning from humans.

Inverse reinforcement learning (IRL) is the field of learning an agent’s objectives, values, or rewards by observing its behavior.

Johannes Heidecke said “We might observe the behavior of a human in some specific task and learn which states of the environment the human is trying to achieve and what the concrete goals might be.” (source)

“IRL is a paradigm relying on Markov Decision Processes (MDPs), where the goal of the apprentice agent is to find a reward function from the expert demonstrations that could explain the expert behavior.” Bilal Piot, Matthieu Geist and Oliver Pietquin, Bridging the Gap between Imitation Learning and Inverse Reinforcement Learning
In the case that one day some artificial intelligence reaches super-human capabilities, IRL might be one approach to understand what humans want and to hopefully work towards these goals.

Jordan Alexander said “The goal is to learn a decision process to produce behavior that maximizes some predefined reward function. Basically, the goal is to extract the reward function from the observed behavior of an agent.

For instance, consider the task of autonomous driving. One approach would be to create a reward function that captures the desired behavior of a driver, like stopping at red lights, avoiding pedestrians, etc. However, this would require an exhaustive list of every behavior we’d want to consider, as well as a list of weights describing how important each behavior is.” (source)

Prasanth Omanakuttan, AI Researcher, said “However, through IRL, the task is to take a set of human-generated driving data and extract an approximation of that human’s reward function for the task. Still, much of the information necessary for solving a problem is captured within the approximation of the true reward function. Once we have the right reward function, the problem is reduced to finding the right policy, and can be solved with standard reinforcement learning methods.” (source)

“The main problem when converting a complex task into a simple reward function is that a given policy may be optimal for many different reward functions. That is, even though we have the actions from an expert, there exist many different reward functions that the expert might be attempting to maximize.” Jordan Alexander, Stanford Univerisity, Learning from humans: what is inverse reinforcement learning?

Bilal Piot, Matthieu Geist and Olivier Pietquin have said “In other words, our goal is to model an agent taking actions in a given environment. We therefore suppose that we have a state space S (the set of states the agent and environment can be in), an action space A (the set of actions the agent can take), and a transition function T(s′|s,a), which gives the probability of moving from state s to state s′ when taking action a. For instance, for an AI learning to control a car, the state space would be the possible locations and orientations of the car, the action space would be the set of control signals that the AI could send to the car, and the transition function would be the dynamics model for the car. The tuple of (S,A,T) is called an MDP∖R, which is a Markov Decision Process without a reward function. (The MDP∖R will either have a known horizon or a discount rate γ but we’ll leave these out for simplicity.)


The inference problem for IRL is to infer a reward function R given an optimal policy π∗:S→A for the MDP∖R. We learn about the policy π∗ from samples (s,a) of states and the corresponding action according to π∗ (which may be random). Typically, these samples come from a trajectory, which records the full history of the agent’s states and actions in a single episode:

In the car example, this would correspond to the actions taken by an expert human driver who is demonstrating desired driving behaviour (where the actions would be recorded as the signals to the steering wheel, brake, etc.).

Given the MDP∖R and the observed trajectory, the goal is to infer the reward function R. In a Bayesian framework, if we specify a prior on R we have:

The likelihood P(ai|si,R) is just πR(s)[ai], where πR is the optimal policy under the reward function R. Note that computing the optimal policy given the reward is in general non-trivial; except in simple cases, we typically approximate the policy using reinforcement learning. Due to the challenges of specifying priors, computing optimal policies and integrating over reward functions, most work in IRL uses some kind of approximation to the Bayesian objective.” (source)

Johannes Heidecke said “In most reinforcement learning tasks there is no natural source for the reward signal. Instead, it has to be hand-crafted and carefully designed to accurately represent the task.

Often, it is necessary to manually tweak the rewards of the RL agent until desired behavior is observed. A better way of finding a well fitting reward function for some objective might be to observe a (human) expert performing the task in order to then automatically extract the respective rewards from these observations.” (source)

The biggest motivation for IRL is that it is often immensely difficult to manually specify a reward function for a task.

Johannes Steinhardt said “IRL is a promising approach to learning human values in part because of the easy availability of data. For supervised learning, humans need to produce many labeled instances specialized for a task. IRL, by contrast, is an unsupervised/semi-supervised approach where any record of human behavior is a potential data source. Facebook’s logs of user behavior, YouTube videos etc. provide many data-points on human behavior.

However, while there is lots of existing data that is informative about human preferences, exploiting this data for IRL is difficult with current techniques.”(source)

Another element mentioned by Johannes Steinhardt is about the issue of data. He said “The records of human behaviour in books and videos are difficult for IRL algorithms to use. However, Data from Facebook seems promising: we can store the state and each human action (clicks and scrolling).

While this covers a broad range of tasks, there are obvious limitations. Some kinds of human preferences seem hard to learn about from behaviour on a computer.”

Human actions depend both on their preferences and their beliefs.

Owain Evans and Johannes Steinhardt said “The beliefs, like the preferences, are never directly observed. For narrow tasks (e.g. people choosing their favorite photos from a display), we can model humans as having full knowledge of the state. But for most real-world tasks, humans have limited information and their information changes over time. If IRL assumes the human has full information, then the model is misspecified and generalizing about what the human would prefer in other scenarios can be mistaken. Here are some examples:

  • Someone travels from their house to a restaurant, which has already closed. If they are assumed to have full knowledge, then IRL would infer an alternative preference (e.g. going for a walk) rather than a preference to get some food.
  • Suppose an IRL algorithm is inferring a person’s goals from key-presses on their laptop. The person repeatedly forgets their login passwords and has to reset them. This behavior is hard to capture with a POMDP-style model: humans forget some strings of characters and not others. IRL might infer that the person intends to repeatedly reset their passwords.

The above arises from humans forgetting information — even if the information is only a short string of characters. This is one way in which humans systematically deviate from rational Bayesian agents.” (source)

Another element given by Owain Evans and Johannes Steinhardt is long-term plans. Indeed, they said “Agents will often take long series of actions that generate negative utility for them in the moment in order to accomplish a long-term goal. Such long-term plans can make IRL more difficult for a few reasons. Let’s focus on two:

  • IRL systems may not have access to the right type of data for learning about long-term goals.
  • Needing to predict long sequences of actions can make algorithms more fragile in the face of model misspecification.

To make inferences based on long-term plans, it would be helpful to have coherent data about a single agent’s actions over a long period of time. But in practice, we will likely have substantially more data consisting of short snapshots of a large number of different agents (because many websites or online services already record user interactions, but it is uncommon for a single person to be exhaustively tracked and recorded over an extended period of time even while they are offline).

On the other hand, there are some services that do have extensive data about individual users across a long period of time. However, this data has another issue: it is incomplete in a very systematic way (since it only tracks online behaviour). For instance, someone might go online most days to read course notes and Wikipedia for a class; this is data that would likely be recorded. However, it is less likely that one would have a record of that person taking the final exam, passing the class and then getting an internship based on their class performance. Of course, some pieces of this sequence would be inferable based on some people’s e-mail records, etc., but it would likely be under-represented in the data relative to the record of Wikipedia usage. In either case, some non-trivial degree of inference would be necessary to make sense of such data.

Next, we discuss another potential issue — fragility to model misspecification.

Suppose someone spends 99 days doing a boring task to accomplish an important goal on day 100. A system that is only trying to correctly predict actions will be right 99% of the time if it predicts that the person inherently enjoys boring tasks. Of course, a system that understands the goal and how the tasks lead to the goal will be right 100% of the time, but even minor errors in its understanding could bring the accuracy back below 99%.

Basically, large changes in the model of the agent might only lead to small changes in the predictive accuracy of the model, and the longer the time horizon on which a goal is realized, the more this might be the case. This means that even slight misspecifications in the model could tip the scales back in favor of a (very) incorrect reward function. One solution could be to identify “important” predictions that seem closely tied to the reward function, and focus particularly on getting those predictions right.” (source)

In the case of even slight model mis-specification, the “correct” model might actually perform worse under typical metrics such as predictive accuracy. Therefore, more careful methods of constructing a model might be necessary.

Johannes Heidecke, AI Researcher, said “In IRL, we are given some agent’s policy or a history of behavior and we try to find a reward function that explains the given behavior. Under the assumption that our agent acted optimally, i.e. always picks the best possible action for its reward function, we try to estimate a reward function that could have led to this behavior.” (source)

How to find a reward function under which observed behavior is optimal. This comes with two main problems:

  • For most observations of behavior there are many fitting reward functions. The set of solutions often contains many degenerate solutions, i.e. assigning zero reward to all states.
  • The IRL algorithms assume that the observed behavior is optimal. This is a strong assumption, arguably too strong when we talk about human demonstrations.

Important: IRL seeks the reward functions that ‘explains’ the demonstrations. Do not confuse this with Apprenticeship learning (AL) where the primary interest is a policy which can generate the seen demonstrations.

For Bilal Piot, Matthieu Geist and Oliver Pietquin, “IRL relies on the assumption that the expert’s policy is optimal with respect to an unknown reward function. In this case, the first aim of the apprentice is to learn a reward function that explains the observed expert behavior. Then, using direct reinforcement learning, it optimizes its policy according to this reward and hopefully behaves as well as the expert. Learning a reward has some advantages over learning a policy immediately. First, the reward can be analyzed so as to better understand the expert’s behavior. Second, it allows adapting to perturbations in the dynamics of the environment .

In other words, it is transferable to other environments. Third, it allows improving with time through real interactions and without requiring new demonstrations. However, a major issue is that an MDP must be solved to obtain the optimal policy with respect to the learned reward. Another issue is that the IRL problem is ill-posed as every policy is optimal for the null reward (which is obviously not the reward one is looking for).” (source)

For more information, I recommend these articles: