2019-05-08
References on Optimal Control, Reinforcement Learning and Motion Planning

  • (book) Dynamic Programming, Bellman R. (1957).
  • (book) Dynamic Programming and Optimal Control, Volumes 1 and 2, Bertsekas D. (1995).
  • (book) Markov Decision Processes - Discrete Stochastic Dynamic Programming, Puterman M. (1995).


  • ExpectiMinimax Optimal strategy in games with chance nodes, Melkó E., Nagy B. (2007).
  • Sparse sampling A sparse sampling algorithm for near-optimal planning in large Markov decision processes, Kearns M. et al. (2002).
  • MCTS Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search, Rémi Coulom, SequeL (2006).
  • UCT Bandit based Monte-Carlo Planning, Kocsis L., Szepesvári C. (2006).
  • Bandit Algorithms for Tree Search, Coquelin P-A., Munos R. (2007).
  • OPD Optimistic Planning for Deterministic Systems, Hren J., Munos R. (2008).
  • OLOP Open Loop Optimistic Planning, Bubeck S., Munos R. (2010).
  • LGP Logic-Geometric Programming: An Optimization-Based Approach to Combined Task and Motion Planning, Toussaint M. (2015). video️
  • AlphaGo Mastering the game of Go with deep neural networks and tree search, Silver D. et al. (2016).
  • AlphaGo Zero Mastering the game of Go without human knowledge, Silver D. et al. (2017).
  • AlphaZero Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm, Silver D. et al. (2017).
  • TrailBlazer Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning, Grill J. B., Valko M., Munos R. (2017).
  • MCTSnets Learning to search with MCTSnets, Guez A. et al. (2018).
  • ADI Solving the Rubik's Cube Without Human Knowledge, McAleer S. et al. (2018).
  • OPC/SOPC Continuous-action planning for discounted infinite-horizon nonlinear optimal control with Lipschitz values, Busoniu L., Pall E., Munos R. (2018).


  • (book) Constrained Control and Estimation, Goodwin G. (2005).
  • PI² A Generalized Path Integral Control Approach to Reinforcement Learning, Theodorou E. et al. (2010).
  • PI²-CMA Path Integral Policy Improvement with Covariance Matrix Adaptation, Stulp F., Sigaud O. (2010).
  • iLQG A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems, Todorov E. (2005).
  • iLQG+ Synthesis and stabilization of complex behaviors through online trajectory optimization, Tassa Y. (2012).


  • (book) Model Predictive Control, Camacho E. (1995).
  • (book) Predictive Control With Constraints, Maciejowski J. M. (2002).
  • Linear Model Predictive Control for Lane Keeping and Obstacle Avoidance on Low Curvature Roads, Turri V. et al. (2013).
  • MPCC Optimization-based autonomous racing of 1:43 scale RC cars, Liniger A. et al. (2014). Video 1 | Video 2
  • MIQP Optimal trajectory planning for autonomous driving integrating logical constraints: An MIQP perspective, Qian X., Altché F., Bender P., Stiller C. de La Fortelle A. (2016).



  • Minimax analysis of stochastic problems, Shapiro A., Kleywegt A. (2002).
  • Robust DP Robust Dynamic Programming, Iyengar G. (2005).
  • Robust Planning and Optimization, Laumanns M. (2011). (lecture notes)
  • Robust Markov Decision Processes, Wiesemann W., Kuhn D., Rustem B. (2012).
  • Coarse-Id On the Sample Complexity of the Linear Quadratic Regulator, Dean S., Mania H., Matni N., Recht B., Tu S. (2017).
  • Tube-MPPI Robust Sampling Based Model Predictive Control with Sparse Objective Information, Williams G. et al. (2018). Video


  • A Comprehensive Survey on Safe Reinforcement Learning, García J., Fernández F. (2015).
  • RA-QMDP Risk-averse Behavior Planning for Autonomous Driving under Uncertainty, Naghshvar M. et al. (2018).


  • ICS Will the Driver Seat Ever Be Empty, Fraichard T. (2014).
  • RSS On a Formal Model of Safe and Scalable Self-driving Cars, Shalev-Shwartz S. et al. (2017).
  • HJI-reachability Safe learning for control: Combining disturbance estimation, reachability analysis and reinforcement learning with systematic exploration, Heidenreich C. (2017).
  • BFTQ A Fitted-Q Algorithm for Budgeted MDPs, Carrara N. et al. (2018).
  • MPC-HJI On Infusing Reachability-Based Safety Assurance within Probabilistic Planning Frameworks for Human-Robot Vehicle Interactions, Leung K. et al. (2018).


  • Simulation of Controlled Uncertain Nonlinear Systems, Tibken B., Hofer E. (1995).
  • Trajectory computation of dynamic uncertain systems, Adrot O., Flaus J-M. (2002).
  • Simulation of Uncertain Dynamic Systems Described By Interval Models: a Survey, Puig V. et al. (2005).
  • Design of interval observers for uncertain dynamical systems, Efimov D., Raïssi T. (2016).


Multi-Armed Bandit

  • UCB1/UCB2 Finite-time Analysis of the Multiarmed Bandit Problem, Auer P., Cesa-Bianchi N., Fischer P. (2002).
  • GP-UCB Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design, Srinivas N., Krause A., Kakade S., Seeger M. (2009).
  • kl-UCB The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond, Garivier A., Cappé O. (2011).
  • KL-UCB Kullback-Leibler Upper Confidence Bounds for Optimal Sequential Allocation, Cappé O. et al. (2013).
  • LUCB PAC Subset Selection in Stochastic Multi-armed Bandits, Kalyanakrishnan S. et al. (2012).
  • POO Black-box optimization of noisy functions with unknown smoothness, Grill J-B., Valko M., Munos R. (2015).
  • Track-and-Stop Optimal Best Arm Identification with Fixed Confidence, Garivier A., Kaufmann E. (2016).
  • M-LUCB/M-Racing Maximin Action Identification: A New Bandit Framework for Games, Garivier A., Kaufmann E., Koolen W. (2016).
  • LUCB-micro Structured Best Arm Identification with Fixed Confidence, Huang R. et al. (2017).
  • Bayesian Optimization in AlphaGo, Chen Y. et al. (2018)


  • Reinforcement learning: A survey, Kaelbling L. et al. (1996).


  • NFQ Neural fitted Q iteration - First experiences with a data efficient neural Reinforcement Learning method, Riedmiller M. (2005).
  • DQN Playing Atari with Deep Reinforcement Learning, Mnih V. et al. (2013). Video
  • DDQN Deep Reinforcement Learning with Double Q-learning, van Hasselt H., Silver D. et al. (2015).
  • DDDQN Dueling Network Architectures for Deep Reinforcement Learning, Wang Z. et al. (2015). Video
  • PDDDQN Prioritized Experience Replay, Schaul T. et al. (2015).
  • NAF Continuous Deep Q-Learning with Model-based Acceleration, Gu S. et al. (2016).
  • Rainbow Rainbow: Combining Improvements in Deep Reinforcement Learning, Hessel M. et al. (2017).
  • Ape-X DQfD Observe and Look Further: Achieving Consistent Performance on Atari, Pohlen T. et al. (2018). Video



  • REINFORCE Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning, Williams R. (1992).
  • Natural Gradient A Natural Policy Gradient, Kakade S. (2002).
  • Policy Gradient Methods for Robotics, Peters J., Schaal S. (2006).
  • TRPO Trust Region Policy Optimization, Schulman J. et al. (2015). video️
  • PPO Proximal Policy Optimization Algorithms, Schulman J. et al. (2017). video️
  • DPPO Emergence of Locomotion Behaviours in Rich Environments, Heess N. et al. (2017). video️

  • AC Policy Gradient Methods for Reinforcement Learning with Function Approximation, Sutton R. et al. (1999).
  • NAC Natural Actor-Critic, Peters J. et al. (2005).
  • DPG Deterministic Policy Gradient Algorithms, Silver D. et al. (2014).
  • DDPG Continuous Control With Deep Reinforcement Learning, Lillicrap T. et al. (2015). video️ 1 | 2 | 3 | 4
  • MACE Terrain-Adaptive Locomotion Skills Using Deep Reinforcement Learning, Peng X., Berseth G., van de Panne M. (2016). video1️ | video2️
  • A3C Asynchronous Methods for Deep Reinforcement Learning, Mnih V. et al 2016. video️ 1 | 2 | 3
  • SAC Soft Actor-Critic : Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, Haarnoja T. et al. (2018). video️

  • CEM Learning Tetris Using the Noisy Cross-Entropy Method, Szita I., Lörincz A. (2006). video️
  • CMAES Completely Derandomized Self-Adaptation in Evolution Strategies, Hansen N., Ostermeier A. (2001).
  • NEAT Evolving Neural Networks through Augmenting Topologies, Stanley K. (2002). video️


  • Dyna Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming, Sutton R. (1990).
  • UCRL2 Near-optimal Regret Bounds for Reinforcement Learning, Jaksch T. (2010).
  • PILCO PILCO: A Model-Based and Data-Efficient Approach to Policy Search, Deisenroth M., Rasmussen C. (2011). (talk)
  • DBN Probabilistic MDP-behavior planning for cars, Brechtel S. et al. (2011).
  • GPS End-to-End Training of Deep Visuomotor Policies, Levine S. et al. (2015). video️
  • DeepMPC DeepMPC: Learning Deep Latent Features for Model Predictive Control, Lenz I. et al. (2015). video️
  • SVG Learning Continuous Control Policies by Stochastic Value Gradients, Heess N. et al. (2015). video️
  • Optimal control with learned local models: Application to dexterous manipulation, Kumar V. et al. (2016). video️
  • BPTT Long-term Planning by Short-term Prediction, Shalev-Shwartz S. et al. (2016). video️ 1 | 2
  • Deep visual foresight for planning robot motion, Finn C., Levine S. (2016). video️
  • VIN Value Iteration Networks, Tamar A. et al (2016). video️
  • V** Value Prediction Network, Oh J. et al. (2017).
  • An LSTM Network for Highway Trajectory Prediction, Altché F., de La Fortelle A. (2017).
  • DistGBP Model-Based Planning with Discrete and Continuous Actions, Henaff M. et al. (2017). video️ 1 | 2
  • Prediction and Control with Temporal Segment Models, Mishra N. et al. (2017).
  • Predictron The Predictron: End-To-End Learning and Planning, Silver D. et al. (2017). video️
  • MPPI Information Theoretic MPC for Model-Based Reinforcement Learning, Williams G. et al. (2017). video️
  • Learning Real-World Robot Policies by Dreaming, Piergiovanni A. et al. (2018).
  • PlaNet Learning Latent Dynamics for Planning from Pixels, Hafner et al. (2018). video️
