315 words
2 minutes
[CS5446] Partial Observable Markov Decision Process

POMDP#

  • uncertainty in observations
    • agent doesn’t konw exactly which state it is in
  • M=(S,A,E,T,O,R)M = (S,A,E,T,O,R)
    • SS: state
    • AA: action
    • EE: evidence/ observation
    • TT: transition function T:S×A×S[0,1]T:S \times A \times S' \rightarrow [0,1]
      • P(ss,a)P(s'|s,a)
    • OO: observation function O:S×E[0,1]O:S \times E \rightarrow [0,1]
      • P(es)P(e|s) (probability of observing EE from state SS, defines the sensor model)
    • RR: reward function R:SRR:S \rightarrow R
  • history ht={a1,e1,...,at,et}h_t = \{ a_1, e_1, ..., a_t, e_t \}
  • policy:
    1. define on belief state a=π(b)a = \pi (b) or
    2. define on history a=π(h)a = \pi (h)

Belief state#

  • actual state is unknown, but we can track the proability distribution over possible states
  • b(s)b(s): probability of agent is now in actual state ss by belief bb
  • belief update: b(s)=αp(es)sp(ss,a)b(s)b'(s') = \alpha p(e'|s') \sum_s p(s'|s,a)b(s) (filtering)
    • agent executes aa, receives evidence ee', then update b(s)b'(s')

Decision making of POMDP#

  1. give current belief bb, execute action a=π(b)a = \pi^*(b)
  2. receive evidence ee'
  3. update belief b(s)b'(s')

Belief space MDP#

reducing POMDP into MDP

  • transition model P(bb,a)P(b'|b,a)
    • =ep(be,a,b)p(ea,b)=ep(be,a,b)sp(es)sp(ss,a)b(s)= \sum_{e'} p(b'|e',a,b)p(e'|a,b) \\ = \sum_{e'} p(b'|e',a,b) \sum_{s'}p(e'|s') \sum_s p(s'|s,a)b(s)
  • reward function p(b,a)p(b,a)
    • =sb(s)sp(ss,a)R(s,a,s)= \sum_{s}b(s) \sum_{s'} p(s'|s,a)R(s,a,s')
  • each belief is a state

Solution for POMDP#

  • Discretize the belief state space
    1. construct belief space MDP
    2. discretize belief state space
    3. solve the MDP
    4. map the solution back to POMDP
    • <S,A,R,T,O,E><B,A,R,T><S,A,R,T,O,E> \rightarrow <B,A,R,T> (BB is discretized)
    • problem: Curse of dimensionality (state sapce too large, even after discretization)
  • Utility function representaed by piecewise linear function
    • conditional plan: policy generates action \rightarrow observation \rightarrow update belief state \rightarrow new action … The conditional plan is the policy
    • utility function of belief state
      • αp(s)\alpha_p(s): utilify of executing a conditional plan pp starting in real state ss
    • expected utility of executing pp in belief state bb: Up(b)=sb(s)αp(s)=Eb[αp(s)U_p(b) = \sum_s b(s) \alpha_p(s) = E_b[\alpha_p(s)
      • Uπ(b)=maxpUp(b)U^{\pi^*}(b) = max_p U_p(b)
    • maximum of collection of hyperplanes \Rightarrow piecewise linear, or convex
    • continuous belief space is divided into regions
    • convex

Online methods (Approximation solutions)#

  • to scale up
  • POMCP (Partially Observable Monte Carlo Planning)
    • run UCT on POMDP, uses action-observation history
    • samples a state at root from the initial belief
    • select \rightarrow expand \rughtarrow\rughtarrow simulate \rightarrow backup
  • DESPOT (Determinized Sparse Partially Observable Tree)
    • similar to POMCP, but at every observation node, only sample a single observation (make the tree smaller) DESPOT
[CS5446] Partial Observable Markov Decision Process
https://itsjeremyhsieh.github.io/posts/cs5446-8-partial-observable-mdp/
Author
Jeremy H
Published at
2024-10-16