Reinforcement Learning - A Technical Introduction
Journal: Journal of Autonomous Intelligence DOI: 10.32629/jai.v2i2.45
Abstract
Reinforcement learning provides a cognitive science perspective to behavior and sequential decision making provided that RL-algorithms introduce a computational concept of agency to the learning problem. Hence it addresses an abstract class of problems that can be characterized as follows: An algorithm confronted with information from an unknown environment is supposed to find stepwise an optimal way to behave based only on some sparse, delayed or noisy feedback from some environment, that changes according to the algorithm's behavior. Hence reinforcement learning offers an abstraction to the problem of goal-directed learning from interaction. The paper offers an opintionated introduction in the algorithmic advantages and drawbacks of several algorithmic approaches such that one can understand recent developments and open problems in reinforcement learning.
Keywords
classical reinforcement learning, Markov decision processes, prediction and adaptive control in unknown environments
Funding
Elements of Euclid
Full Text
PDF - Viewed/Downloaded: 92 TimesReferences
[1] C. Anderson A. Barto, R. Sutton. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, pages 834 – 846, 1983.
[2] W. Powell A. George. Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming. Machine Learning, 65:167– 198, 2006.
[3] R. E. Bellman. Dynamic Programming. 1957.
[4] D. Bertsekas. Dynamic programming and optimal control, Vol II. 2012.
[5] D. P. Bertsekas. A new value iteration method for the average cost dynamic programming problem. SIAM Journal on Control and Optimization, 36:742– 759, 1998.
[6] J. N. Tsitsiklis C. H. Paradimitriou. The complexity of markov decision processes. Mathematics of Operations Research, 12:441–450, 1987.
[7] W. T. Scherer C. W. Zobel. An empirical study of policy convergence in markov decision process value iteration. Computers and Operations Research, 32:127–142, 2005.
[8] X.R. Cao. A basic formula for online policy gradient algorithms. IEEE Transactions on Automatic Control, 50:696–699, 2005.
[9] C. Coello Coello. Evolutionary Algorithms for Solving Multi-Objective Problems. 2007.
[10] B. V. Roy D. de Farias. On the existence of fixed points for appproximate value iteration and temporal-difference learning. Journal of Optimization Theory and Applications, 105:589–608, 2000.
[11] B. V. Roy D. de Farias. The linear programming approach to approximate dynamic programming. Operations Research, 51:850–856, 2003.
[12] J. Grefenstette D. Moriarty, A. Schultz. Evolutionsary algorithms for reinforcement learning. Journal of Artificial Intelligence Research, pages 241– 276, 1999.
[13] C. Maddison et al. D. Silver, A. Huang. Mastering the game of go with deep neural networks and tree search. Nature, 529:484–489, 2016.
[14] M. Diederik et.al. D. Steckelmacher, H. Plisnier. Sample-efficient modelfree reinforcement learning with off-policy critics. European Conference on Machine Learning, 2019.
[15] F. L. Lewis D. Vrabie, K. G. Vamvoudakis. Optimal Adaptive Control and Differential Games by Reinforcement Learning Principles. 2013.
[16] F. Dayan. The convergence of td(λ) for general λ. Machine Learning, 8:341– 362, 1992.
[17] C. Derman. Finite State Markovian Decision Processes. 1970.
[18] J.N. Tsitsiklis D.P. Bertsekas. Neuro-Dynamic Programming. 2006.
[19] M. Drugan. Reinforcement learning versus evolutionary computation: A survey on hybrid algorithms. Swarm and Evolutionary Computation, pages 228–246, 2019.
[20] S. Dreyfus E. Mizutani. Totally model-free actor-critic recurrent neuralnetwork reinforcement learning in non-markovian domains. Annals of Operations Research, page 107?131, 2017.
[21] A. Shwartz E.A. Feinberg. Handbook of Markov Decision Processes: Methods and Applications, chapter Total Reward Criteria. Kluwer, 2007.
[22] A. J. Kleinman H. J. Kushner. Accelerated procedures for the solution of discrete markov control problems. IEEE Transactions on Automatic Control, 16:147–152, 1971.
[23] M. Herzberg and U. Yechiali. A k-step look-ahead analysis of value iteration algorithms for markov decision processes. Europian Jornal of Operations Research, 88:622–636, 1996.
[24] R.A. Howard. Dynamic Programming and Markov Processes. 1960.
[25] R. Sutton at. al. H.v. Seijen. True online temporal-difference learning. Journal of Machine Learning Research, pages 1–40, 2016.
[26] P.L. Bartlett J. Baxter. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, pages 319–350, 2001.
[27] J. Peters J. Kober. Policy search for motor primitives in robotics. Advances in neural information processing systems, pages 849–856, 2009.
[28] S. Schaal J. Peters. Natural actor-critic. Neurocomputing, 71:1180–1190, 2008.
[29] P. Abbeel et. al. J. Schulman, S. Levine. Trust region policy optimization. ICML, pages 1889 – 1897, 2015.
[30] P. Dhariwal et. al. J. Schulman, F. Wolski. Proximal policy optimization algorithms. page arxiv prepreint: https://arxiv.org/abs/1707.06347, 07 2017.
[31] Pieter Abbeel et. al. J. Schulman, P. Moritz. High-dimensional continuous control using generalized advantage estimation. ICML, 2016.
[32] R. Miikkulainen K. Stanley. Evolving a roving eye for go. 2004.
[33] V.R. Konda and J.N. Tsitsiklis. On actor-critic algorithms. SIAM Journal on Control and Optimization, pages 1143–1166, 2003.
[34] V. Krishnamurthy. Partially Observed Markov Decision Processes. 2016.
[35] R. Babuka et. al. L. Busoniu. Reinforcement learning and dynamic programming using function approximators. 2009.
[36] R. Harley et. al L. C. Thomas. Computational comparison of value iteration algorithms for discounted markov decision processes. Operations Research Letters, 2:72–76, 1983.
[37] A. Moore L. Kaelbling, M. Littman. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237–285, 1996.
[38] Yuxi Li. Deep reinforcement learning: An overview. CoRR, abs/1701.07274, 2017.
[39] Sean Luke. Essentials of Metaheuristics. Lulu, second edition, 2013.
[40] S. E. Zin M. A. Trick. Spline approximations to value functions. Macroeconomic Dynamics, 1:255–277, 1997.
[41] J. Peters M. Deisenroth, G. Neumann. A survey on policy search for robotics. Robotics: Foundations and Trends, 2:1–142, 2013.
[42] S. Mannor et. al. M. Ghavamzadeh. Bayesian reinforcement learning: A survey. Foundations and Trends in Machine Learning, 8:359–492, 2015.
[43] M. Mundhenk M. L. Littman, J. Goldsmith. The computational complexity of probabilistic planning. Journal of Artificial Intelligence Research, 9:1–36, 1998.
[44] M. C. Shin M. L. Puterman. Modified policy iteration algorithms for discounted markov decision problems. Management Science, 64:1127–1137, 78.
[45] M. van Otterlo (eds.) M. Wiering. Reinforcement Learning. 2012.
[46] G. van Ryzin N. Gans. Dynamic vehicle dispatching: Optimal heavy traffic performance and practical insights. Operations Research, 47:675–693, 1999.
[47] O. Buffet (eds.) O. Sigaud. Markov Decision Processes in Artificial Intelligence. 2010.
[48] M. Potter. The Design and Analysis of a Computational Model of Cooperative Coevolution. George Mason University, 1997.
[49] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. 1994.
[50] W. Yue Q. Hu. Markov Decision Processes with Their Applications. 2008.
[51] S. Meyn R.-R. Chen. Value iteration and optimization of multiclass queueing networks. Queueing Systems, 32:65–97, 1999.
[52] D. McAllester et. al. R.S. Sutton. Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems, pages 1057–1063, 2000.
[53] Reazul Hasan Russel. A short survey on probabilistic reinforcement learning. arxiv, page 1901.07010v1, 2019.
[54] D. Meger S. Fujimoto, H. van Hoof. Addressing function approxima approximation error in actor-critic methods. in proceedings of the international conference on machine learning. Conference paper ICML, 2018.
[55] C. C. White III S. Kim, M. E. Lewis. Optimal vehicle routing with realtime traffic information. IEEE Transactions on Intelligent Transportation Systems, 6:178–188, 2005.
[56] C. Finn et. al S. Levine. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17:1–40, 2016.
[57] J. Peters et. al. S. Parisi, V. Tangkaratt. Td-regularized actor-critic methods. Machine Learning, pages 1–35, 2019.
[58] R. Sutton S. Singh. Reinforcement learning with replacing eligibility traces. Machine Learning, 22:123–158, 1996.
[59] Hoang Pham S. V. Amari, L. McLaughlin. Cost-effective condition-based maintenance using markov decision processes. Proceedings of the RAMS. Annual Reliability and Maintainability Symposium, pages 464–469, 2006.
[60] J. Schmidhuber. A general method for multi-agent learning and incremental selfimprovement in unrestricted environments., chapter Yao, X. (ed.), Evolutionary Computation: Theory and Applications. 1996.
[61] M. Sugiyama. Statistical Reinforcement Learning. 2015.
[62] R. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. Proceedings of the Seventh International Conference on Machine Learning, 1990.
[63] R. Sutton and A. Barto. Reinforcement Learning: An Introduction. 2018.
[64] C. Szepesvari. Algorithms for Reinforcement Learning. 2010.
[65] P. Abbeel et. al. T. Haarnoja, A. Zhou. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proceedings of the International Conference on Machine Learning (ICML), 2018.
[66] A. Pritzel et al. T. Lillicrap, J. Hunt. Continuous control with deep reinforcement learning. International Conference on Learning Representations, 2016.
[67] K. Kavukcuoglu et. al. V. Mnih. Playing atari with deep reinforcement learning. Technical report, Google DeepMind, page http://arxiv.org/abs/1312.5602, 2013.
[68] M.Mirza et. al. V. Mnih, A.Badia. Asynchronous methods for deep reinforcement learning. Proceedings of The 33rd International Conference on Machine Learning, pages 1928–1937, 2016.
[69] Peter Henderson et. al. Vincent Fran¸cois-Lavet. An introduction to deep reinforcement learning. Foundations and Trends in Machine Learning, page DOI: 10.1561/2200000071, 2018.
[70] D. J. White. Markov Decision Processes. 1994.
[71] S. Whiteson. Evolutionary Computation for Reinforcement Learning, pages 325–355. Springer, 2012.
[72] J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229 – 256, 1992.
[73] D. Wingate and K. D. Seppi. Prioritization methods for accelerating mdp solvers. Journal of Machine Learning Research, 6:851–881, 2005.
[74] S. Kuriyama Y. Adachia, T. Nosea. Optimal inventory control policy subject to different selling prices of perishable commodities. International Journal of Production Economics, 60-61:389–394, 1999.
[75] Y. Ye. A new complexity result on solving the markov decision problem. Mathematics of Operations Research, 2005:733–749, 38.
[76] H. Yu and D. P. Bertsekas. Basis function adaptation methods for cost approximation. Proc. of IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning, 2009.
[2] W. Powell A. George. Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming. Machine Learning, 65:167– 198, 2006.
[3] R. E. Bellman. Dynamic Programming. 1957.
[4] D. Bertsekas. Dynamic programming and optimal control, Vol II. 2012.
[5] D. P. Bertsekas. A new value iteration method for the average cost dynamic programming problem. SIAM Journal on Control and Optimization, 36:742– 759, 1998.
[6] J. N. Tsitsiklis C. H. Paradimitriou. The complexity of markov decision processes. Mathematics of Operations Research, 12:441–450, 1987.
[7] W. T. Scherer C. W. Zobel. An empirical study of policy convergence in markov decision process value iteration. Computers and Operations Research, 32:127–142, 2005.
[8] X.R. Cao. A basic formula for online policy gradient algorithms. IEEE Transactions on Automatic Control, 50:696–699, 2005.
[9] C. Coello Coello. Evolutionary Algorithms for Solving Multi-Objective Problems. 2007.
[10] B. V. Roy D. de Farias. On the existence of fixed points for appproximate value iteration and temporal-difference learning. Journal of Optimization Theory and Applications, 105:589–608, 2000.
[11] B. V. Roy D. de Farias. The linear programming approach to approximate dynamic programming. Operations Research, 51:850–856, 2003.
[12] J. Grefenstette D. Moriarty, A. Schultz. Evolutionsary algorithms for reinforcement learning. Journal of Artificial Intelligence Research, pages 241– 276, 1999.
[13] C. Maddison et al. D. Silver, A. Huang. Mastering the game of go with deep neural networks and tree search. Nature, 529:484–489, 2016.
[14] M. Diederik et.al. D. Steckelmacher, H. Plisnier. Sample-efficient modelfree reinforcement learning with off-policy critics. European Conference on Machine Learning, 2019.
[15] F. L. Lewis D. Vrabie, K. G. Vamvoudakis. Optimal Adaptive Control and Differential Games by Reinforcement Learning Principles. 2013.
[16] F. Dayan. The convergence of td(λ) for general λ. Machine Learning, 8:341– 362, 1992.
[17] C. Derman. Finite State Markovian Decision Processes. 1970.
[18] J.N. Tsitsiklis D.P. Bertsekas. Neuro-Dynamic Programming. 2006.
[19] M. Drugan. Reinforcement learning versus evolutionary computation: A survey on hybrid algorithms. Swarm and Evolutionary Computation, pages 228–246, 2019.
[20] S. Dreyfus E. Mizutani. Totally model-free actor-critic recurrent neuralnetwork reinforcement learning in non-markovian domains. Annals of Operations Research, page 107?131, 2017.
[21] A. Shwartz E.A. Feinberg. Handbook of Markov Decision Processes: Methods and Applications, chapter Total Reward Criteria. Kluwer, 2007.
[22] A. J. Kleinman H. J. Kushner. Accelerated procedures for the solution of discrete markov control problems. IEEE Transactions on Automatic Control, 16:147–152, 1971.
[23] M. Herzberg and U. Yechiali. A k-step look-ahead analysis of value iteration algorithms for markov decision processes. Europian Jornal of Operations Research, 88:622–636, 1996.
[24] R.A. Howard. Dynamic Programming and Markov Processes. 1960.
[25] R. Sutton at. al. H.v. Seijen. True online temporal-difference learning. Journal of Machine Learning Research, pages 1–40, 2016.
[26] P.L. Bartlett J. Baxter. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, pages 319–350, 2001.
[27] J. Peters J. Kober. Policy search for motor primitives in robotics. Advances in neural information processing systems, pages 849–856, 2009.
[28] S. Schaal J. Peters. Natural actor-critic. Neurocomputing, 71:1180–1190, 2008.
[29] P. Abbeel et. al. J. Schulman, S. Levine. Trust region policy optimization. ICML, pages 1889 – 1897, 2015.
[30] P. Dhariwal et. al. J. Schulman, F. Wolski. Proximal policy optimization algorithms. page arxiv prepreint: https://arxiv.org/abs/1707.06347, 07 2017.
[31] Pieter Abbeel et. al. J. Schulman, P. Moritz. High-dimensional continuous control using generalized advantage estimation. ICML, 2016.
[32] R. Miikkulainen K. Stanley. Evolving a roving eye for go. 2004.
[33] V.R. Konda and J.N. Tsitsiklis. On actor-critic algorithms. SIAM Journal on Control and Optimization, pages 1143–1166, 2003.
[34] V. Krishnamurthy. Partially Observed Markov Decision Processes. 2016.
[35] R. Babuka et. al. L. Busoniu. Reinforcement learning and dynamic programming using function approximators. 2009.
[36] R. Harley et. al L. C. Thomas. Computational comparison of value iteration algorithms for discounted markov decision processes. Operations Research Letters, 2:72–76, 1983.
[37] A. Moore L. Kaelbling, M. Littman. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237–285, 1996.
[38] Yuxi Li. Deep reinforcement learning: An overview. CoRR, abs/1701.07274, 2017.
[39] Sean Luke. Essentials of Metaheuristics. Lulu, second edition, 2013.
[40] S. E. Zin M. A. Trick. Spline approximations to value functions. Macroeconomic Dynamics, 1:255–277, 1997.
[41] J. Peters M. Deisenroth, G. Neumann. A survey on policy search for robotics. Robotics: Foundations and Trends, 2:1–142, 2013.
[42] S. Mannor et. al. M. Ghavamzadeh. Bayesian reinforcement learning: A survey. Foundations and Trends in Machine Learning, 8:359–492, 2015.
[43] M. Mundhenk M. L. Littman, J. Goldsmith. The computational complexity of probabilistic planning. Journal of Artificial Intelligence Research, 9:1–36, 1998.
[44] M. C. Shin M. L. Puterman. Modified policy iteration algorithms for discounted markov decision problems. Management Science, 64:1127–1137, 78.
[45] M. van Otterlo (eds.) M. Wiering. Reinforcement Learning. 2012.
[46] G. van Ryzin N. Gans. Dynamic vehicle dispatching: Optimal heavy traffic performance and practical insights. Operations Research, 47:675–693, 1999.
[47] O. Buffet (eds.) O. Sigaud. Markov Decision Processes in Artificial Intelligence. 2010.
[48] M. Potter. The Design and Analysis of a Computational Model of Cooperative Coevolution. George Mason University, 1997.
[49] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. 1994.
[50] W. Yue Q. Hu. Markov Decision Processes with Their Applications. 2008.
[51] S. Meyn R.-R. Chen. Value iteration and optimization of multiclass queueing networks. Queueing Systems, 32:65–97, 1999.
[52] D. McAllester et. al. R.S. Sutton. Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems, pages 1057–1063, 2000.
[53] Reazul Hasan Russel. A short survey on probabilistic reinforcement learning. arxiv, page 1901.07010v1, 2019.
[54] D. Meger S. Fujimoto, H. van Hoof. Addressing function approxima approximation error in actor-critic methods. in proceedings of the international conference on machine learning. Conference paper ICML, 2018.
[55] C. C. White III S. Kim, M. E. Lewis. Optimal vehicle routing with realtime traffic information. IEEE Transactions on Intelligent Transportation Systems, 6:178–188, 2005.
[56] C. Finn et. al S. Levine. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17:1–40, 2016.
[57] J. Peters et. al. S. Parisi, V. Tangkaratt. Td-regularized actor-critic methods. Machine Learning, pages 1–35, 2019.
[58] R. Sutton S. Singh. Reinforcement learning with replacing eligibility traces. Machine Learning, 22:123–158, 1996.
[59] Hoang Pham S. V. Amari, L. McLaughlin. Cost-effective condition-based maintenance using markov decision processes. Proceedings of the RAMS. Annual Reliability and Maintainability Symposium, pages 464–469, 2006.
[60] J. Schmidhuber. A general method for multi-agent learning and incremental selfimprovement in unrestricted environments., chapter Yao, X. (ed.), Evolutionary Computation: Theory and Applications. 1996.
[61] M. Sugiyama. Statistical Reinforcement Learning. 2015.
[62] R. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. Proceedings of the Seventh International Conference on Machine Learning, 1990.
[63] R. Sutton and A. Barto. Reinforcement Learning: An Introduction. 2018.
[64] C. Szepesvari. Algorithms for Reinforcement Learning. 2010.
[65] P. Abbeel et. al. T. Haarnoja, A. Zhou. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proceedings of the International Conference on Machine Learning (ICML), 2018.
[66] A. Pritzel et al. T. Lillicrap, J. Hunt. Continuous control with deep reinforcement learning. International Conference on Learning Representations, 2016.
[67] K. Kavukcuoglu et. al. V. Mnih. Playing atari with deep reinforcement learning. Technical report, Google DeepMind, page http://arxiv.org/abs/1312.5602, 2013.
[68] M.Mirza et. al. V. Mnih, A.Badia. Asynchronous methods for deep reinforcement learning. Proceedings of The 33rd International Conference on Machine Learning, pages 1928–1937, 2016.
[69] Peter Henderson et. al. Vincent Fran¸cois-Lavet. An introduction to deep reinforcement learning. Foundations and Trends in Machine Learning, page DOI: 10.1561/2200000071, 2018.
[70] D. J. White. Markov Decision Processes. 1994.
[71] S. Whiteson. Evolutionary Computation for Reinforcement Learning, pages 325–355. Springer, 2012.
[72] J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229 – 256, 1992.
[73] D. Wingate and K. D. Seppi. Prioritization methods for accelerating mdp solvers. Journal of Machine Learning Research, 6:851–881, 2005.
[74] S. Kuriyama Y. Adachia, T. Nosea. Optimal inventory control policy subject to different selling prices of perishable commodities. International Journal of Production Economics, 60-61:389–394, 1999.
[75] Y. Ye. A new complexity result on solving the markov decision problem. Mathematics of Operations Research, 2005:733–749, 38.
[76] H. Yu and D. P. Bertsekas. Basis function adaptation methods for cost approximation. Proc. of IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning, 2009.
Copyright © 2019 Elmar Diederichs
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License