Markov Decision Process Transformations
Top | Notations | Bibliography
Sources
- Reinforcement Learning: An Introduction, Sutton and Barto (2nd edition, 2018). Clear presentation, builds up from simple example. Authors are major contributors in the field. David Silver (AlphaZero architect) says he read their 1st edition as a first step to learn about RL.
- How are reward functions R(s), R(s, a), and R(s, a, s’) equivalent? (2019)
- Equivalence notions and model minimization in Markov decision processes, R. Givan, T. Dean, M. Greig (2003)
- Approximate Equivalence of Markov DecisionProcesses, by E. Even-Dar and Y. Mansour
TO DO: under construction
- Explain what are equivalent MDPs
- Why formulations using stochastic state-action-state rewards, or state-action reward functions, or state reward functions are equivalent
- What are morphisms of MDPs
- Why \(J_\pi\), \(V^\pi(s)\) and \(Q^\pi(s, a)\) in an MDP can each be interpreted as action-value, value and goal, if we change the underlying MDP
- How an MDP \((\mathcal{S}, \mathcal{A}, d(s_0), p(s',r \vert s, a))\) with reward \(r\) and a discount factor \(\gamma\) is equivalent to an MDP with states \(\mathcal{S} \times \mathbb{N}\), reward \(r\) and discount factor \(1\), and how we can replace \(r_n\) with \(\gamma^{n-1}r_n\) formally in a suitable sense in MDP formulas like the Bellman equations or the policy gradient used in REINFORCE.
- If two policies \(\pi_1, \pi_2\) on an MDP satisfy \(Q_{\pi_1}(s, a) \lt Q_{\pi_2}(s, a)\), we say \(\pi1 \lt \pi_2\). The policy \(\pi_2\) is more optimal. If \(\mathcal{S}, \mathcal{A}\) are finite, there always exists an optimal policy.
- Given MDP, construct MDP that models reward + risk (or variance of reward)
Open issues:
- What is the relation of MDPs, RL algorithms, with constructive mathematics? Given that often the problem is about constructing a policy given incomplete model information