543 words
3 minutes
[CS5446] Markov Decision Process

Markov Decision Process (MDP)#

How can we deal with uncertainty?

  • A sequential decision problem for a fully observable, stochastic environment with Markovian transition and additive rewards.

Formal definition#

  • MDP M(S,A,T,R,γ)M ≜ (S, A, T, R, γ)
    • states SS
    • actions AA
    • Transition function TT: S,A,S[0,1]S, A, S' \rightarrow [0, 1]
      • satisfies the Markov property s.t sT(s,a,s)=sP(ss,a)=1\sum_{s'} T(s,a,s') = \sum_{s} P(s'|s,a) = 1. (Sums to 1)
    • Reward function RR
    • discount facrot 0γ10 ≤ γ ≤ 1
    • solution is a policy: a function that recommend an action in each state
  • Markov property: Next state only determined by current state & action
    • P(st+1st,at,st1,at1,...,s0,a0)=P(st+1st,at)P(s_{t+1}|s_t, a_t, s_{t-1}, a_{t-1}, ..., s_0, a_0) = P(s_{t+1}|s_t, a_t)
    • Markov property simplifies real-world problems
  • Reward funciton R(s,a,s)R(s,a,s'): need to balance between risk & reward
  • Policy π:SAπ: S \rightarrow A: for every state, map an action
    • Quality of ππ: expected utility of sequence guaranteed by performing policy ππ
    • Optimal policy ππ^*: highest expected utility

Finite horizon:#

  • fixed time NN, then terminate
  • Reward: sums up to RnR_n
  • optimal action may change over time \Rightarrow Nonstationary ππ^*

Infinite horizon:#

  • no fixed deadline, may run forever
  • Stationary ππ^*
  • But R may be infinite, hard to compare policies.
    1. use discount factor γ<1γ < 1, utility becomes finite
      • Uh([S0,a0,s1,a1,s2,...])Rmax1γU_h([S_0, a_0, s_1, a_1,s_2, ...]) ≤ \frac {R_{max}} {1-γ}
    2. Environment has terminal states, and policy guaranteed to get to a terminal state.
    3. Compute average rewards obtained per time stop
      • hard to compute and analyze

Utility of States#

  • reward of executing π(s)π^*(s)
  • Bellman Equation: U(s)=Uπ(s)=maxasP(ss,a)[R(s,a,s)+γU(s)]=V(s)U(s) = U^{π^*}(s) = max_{a} \sum_{s'} P(s'|s,a)[R(s,a,s') + γU(s')] = V(s)
  • Optimal policy ππ^* is the policy that maximizes U(s)U(s): π(s)=argmaxasP(ss,a)[R(s,a,s)+γU(s)]π^*(s) = argmax_{a} \sum_{s'} P(s'|s,a)[R(s,a,s') + γU(s')]
  • Q-Function Q(s,a)Q(s,a): expected utility of taking a given action in a given state
    • U(s)=maxaQ(s,a)U(s) = max_a Q(s,a)

Value Iteration#

  • repeatedly perform Bellman Update
    • Ui+1(s)maxasP(ss,a)[R(s,a,s)+γUi(s)]U_{i+1}(s) ← max_a \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma U_i(s')] or
    • Ui+1(s)R(s)+γmaxaA(s)sP(ss,a)Ui(s)U_{i+1}(s) \leftarrow R(s) + \gamma max_{a \in A(s)} \sum_{s'} P(s'|s,a) U_i(s')
  • to find rewarding value for each state
  • initialize as 00, iterate until convergence

Policy Iteration#

  • Begin with initial policy π0π_0
  • Repeat until no change in Utility for each state:
    1. Policy evaluation: uses the given policy to calculate the utility of each state
    2. Policy improvement:
      • find the best action for πiπ_i via one-step lookahead based on UiU_i obtained from policy evaluation step
      • πi+1(s)=argmaxasP(ss,a)[R(s,a,s)+γUπk(s)]π_{i+1}(s) = argmax_a \sum_{s'} P(s'|s,a)[R(s,a,s') + γU_{π_k}(s')]
  • Complexity: S=nS = n linear equations with nn unknowns \Rightarrow O(n3)O(n^3)
  • Policy iteration may be faster than value iteration.

Online Algorithms#

Decision-time planning

  • for large problems, state space increase exponentially
  • we can use Online search with sampling
    • real time dynamic programming
    • Monte Carlo Tree Search
  • Online search tree
  • keep the best actions at each state nodes
  • tree size: ADSDA^D S^D, A is the # of actions, S is the # of states, D is the depth of tree
  • With sparse sampling:
    • sampling kk observations
    • tree size: ADkDA^D k^D
    • but still exponential with search depth

Online search with simulation

  • Monte Carlo
  • Selection: select the leave nodes to expand
    • Selection policy: Upper Confidence Bounds applied to Trees(UCT)
      • πUCT(n)=argmaxa(Q^(n,a)+cN(n)N(n,a)π_{UCT}(n) = argmax_{a}(\hat{Q}(n,a) + c \sqrt{\frac {N(n)}{N(n,a)}}
        • q^(n,a)\hat{q}(n,a) is the average return of all trials (exploitation)
        • cc: constant balancing exploitation & exploration
        • N(n)N(n): # of trials through node nn
        • N(n,a)N(n,a): # of trials through node nn starting with aa
      • 也等於 πUCT(n)=U(n)N(n)+ClogN(Parent(n))N(n)π_{UCT}(n) = \frac {U(n)}{N(n)} + C \sqrt{\frac {log N(Parent(n))}{N(n)}}
        • N(Parent(n))N(Parent(n)): the # of parent node been visited
        • N(n)N(n): the # of node nn been visited.
        • U(n)N(n)\frac {U(n)}{N(n)}: 贏的次數 / 嘗試的次數
        • logN(Parent(n))N(n)\sqrt{\frac {log N(Parent(n))}{N(n)}}: 開根號 loglog總次數 / 嘗試的次數
  • Expansion: one or more nodes created
  • Simulation: estimate the value of the node by completing one simulation run (run to leaf)
  • Backup: update the result back to root.
[CS5446] Markov Decision Process
https://itsjeremyhsieh.github.io/posts/cs5446-4-markov-decision-process/
Author
Jeremy H
Published at
2024-09-04