Top | Notations | Bibliography

Sources

TO DO: under construction

  • Explain what are equivalent MDPs
  • Why formulations using stochastic state-action-state rewards, or state-action reward functions, or state reward functions are equivalent
  • What are morphisms of MDPs
  • Why \(J_\pi\), \(V^\pi(s)\) and \(Q^\pi(s, a)\) in an MDP can each be interpreted as action-value, value and goal, if we change the underlying MDP
  • How an MDP \((\mathcal{S}, \mathcal{A}, d(s_0), p(s',r \vert s, a))\) with reward \(r\) and a discount factor \(\gamma\) is equivalent to an MDP with states \(\mathcal{S} \times \mathbb{N}\), reward \(r\) and discount factor \(1\), and how we can replace \(r_n\) with \(\gamma^{n-1}r_n\) formally in a suitable sense in MDP formulas like the Bellman equations or the policy gradient used in REINFORCE.
  • If two policies \(\pi_1, \pi_2\) on an MDP satisfy \(Q_{\pi_1}(s, a) \lt Q_{\pi_2}(s, a)\), we say \(\pi1 \lt \pi_2\). The policy \(\pi_2\) is more optimal. If \(\mathcal{S}, \mathcal{A}\) are finite, there always exists an optimal policy.
  • Given MDP, construct MDP that models reward + risk (or variance of reward)

Open issues:

  • What is the relation of MDPs, RL algorithms, with constructive mathematics? Given that often the problem is about constructing a policy given incomplete model information