A method of reinforcement learning (RL) that takes into account the safety of an agent during training and deployment. In many sequential decision-making problems, it is important that while the agent optimizes its long-term performance, it also avoids generating unsafe policies, both during training and at convergence. In these problems, an unsafe policy could cause damage to the agent (e.g., a robot) or to the environment (the plant or the people working nearby). In this work, we first formulate safety by imposing constraints and then propose a class of reinforcement learning algorithms that 1) learn optimal safe policies, and 2) do not generate policies that violate the constraints (unsafe policies), even during training. The main characteristic of our algorithms is the use of Lyapunov functions, a concept that has been extensively studied in control theory to analyze the stability of dynamical systems, to guarantee safety.

## HOW IT WORKS:

We first formulate the problem of safe sequential decision-making as a constrained Markov decision process (CMDP). Our goal here is to learn an optimal feasible policy, while we ensure that only feasible policies are executed during training. We propose a linear-programming-based algorithm to construct Lyapunov functions with respect to the CMDP constraints. We then leverage the properties of these functions to develop dynamic programming (DP) algorithms to solve the CMDP.

**Contribution 1: **Assuming the CMDP is known and finite, we present two safe DP algorithms, called safe policy iteration and safe value iteration. We analyze these algorithms and prove that 1) the policies generated at the iterations of these algorithms are safe and monotonically improving, and 2) under certain technical assumptions, the algorithms achieve optimality. We evaluate our algorithms on planning tasks in a benchmark 2D maze and show that they outperform common baselines in terms of balancing performance and constraint satisfaction.

**Contribution 2: **To handle unknown and large (or continuous) CMDPs, we propose two approximate DP algorithms, called safe DQN and safe DPI. In order to evaluate these algorithms, we repeat the same experiments as above, but this time we assume the model is unknown and the state space is the aerial pixel image of the maze, and not the (x,y) coordinate of the agent. The results still indicate that our algorithms balance performance and constraint satisfaction better than the baselines.

The safe RL algorithms proposed in this work are all value-function-based and therefore cannot handle easily problems that have large or continuous actions. In a follow-up work, we extend our approach to algorithms that are more suitable for continuous action problems and evaluate their performance in several simulated robot locomotion tasks, as well as in a real-world indoor robot navigation problem.

## WHY IT MATTERS:

Our work represents a step toward using RL to solve real-world problems, where constraints on an agent's behavior are sometimes necessary for the sake of safety. This research could help remove an important obstacle hindering the widespread application of RL algorithms: the lack of algorithms that are able to keep an agent safe during training and also to return policies that are safe and perform well. We believe our work can help design RL algorithms that find a good balance between safety and performance.