POMDP#

uncertainty in observations
- agent doesn’t konw exactly which state it is in
$M = (S,A,E,T,O,R)$ $M = (S, A, E, T, O, R)$
- $S$ : state
- $A$ : action
- $E$ : evidence/ observation
- $T$ $T$ : transition function $T:S \times A \times S' \rightarrow [0,1]$ $T : S \times A \times S^{'} \to [0, 1]$
  - $P(s'|s,a)$
- $O$ $O$ : observation function $O:S \times E \rightarrow [0,1]$ $O : S \times E \to [0, 1]$
  - $P(e|s)$ (probability of observing $E$ from state $S$ , defines the sensor model)
- $R$ : reward function $R:S \rightarrow R$
history $h_t = \{ a_1, e_1, ..., a_t, e_t \}$
policy:
1. define on belief state $a = \pi (b)$ or
2. define on history $a = \pi (h)$

Belief state#

actual state is unknown, but we can track the proability distribution over possible states
$b(s)$ : probability of agent is now in actual state $s$ by belief $b$
belief update: $b'(s') = \alpha p(e'|s') \sum_s p(s'|s,a)b(s)$ $b^{'} (s^{'}) = α p (e^{'} ∣ s^{'}) \sum_{s} p (s^{'} ∣ s, a) b (s)$ (filtering)
- agent executes $a$ , receives evidence $e'$ , then update $b'(s')$

reducing POMDP into MDP

transition model $P(b'|b,a)$ $P (b^{'} ∣ b, a)$
- $= \sum_{e'} p(b'|e',a,b)p(e'|a,b) \\ = \sum_{e'} p(b'|e',a,b) \sum_{s'}p(e'|s') \sum_s p(s'|s,a)b(s)$
reward function $p(b,a)$ $p (b, a)$
- $= \sum_{s}b(s) \sum_{s'} p(s'|s,a)R(s,a,s')$
each belief is a state

to scale up
POMCP (Partially Observable Monte Carlo Planning)
- run UCT on POMDP, uses action-observation history
- samples a state at root from the initial belief
- select $\rightarrow$ expand $\rughtarrow$ simulate $\rightarrow$ backup
DESPOT (Determinized Sparse Partially Observable Tree)
- similar to POMCP, but at every observation node, only sample a single observation (make the tree smaller)