Top | Notations | Bibliography

Sources

TO DO: under construction

  • Explain what are equivalent MDPs
  • Why formulations using stochastic state-action-state rewards, or state-action reward functions, or state reward functions are equivalent
  • What are morphisms of MDPs
  • Why Jπ, Vπ(s) and Qπ(s,a) in an MDP can each be interpreted as action-value, value and goal, if we change the underlying MDP
  • How an MDP (S,A,d(s0),p(s,r|s,a)) with reward r and a discount factor γ is equivalent to an MDP with states S×N, reward r and discount factor 1, and how we can replace rn with γn1rn formally in a suitable sense in MDP formulas like the Bellman equations or the policy gradient used in REINFORCE.
  • If two policies π1,π2 on an MDP satisfy Qπ1(s,a)<Qπ2(s,a), we say π1<π2. The policy π2 is more optimal. If S,A are finite, there always exists an optimal policy.
  • Given MDP, construct MDP that models reward + risk (or variance of reward)

Open issues:

  • What is the relation of MDPs, RL algorithms, with constructive mathematics? Given that often the problem is about constructing a policy given incomplete model information