Model-free vs. Model-based#

Model#

Now that we have a model, how can we use it?

Direct model solving
- solve using MDP solvers
  - i.e. value/ policy iteration
- uses the model fully (use as if it were the environment)
- ADP: learn a model and solve the bellman equation of the learned model
- Value iteration network: value iteration + NN
Sample-based planning
- use the model ONLY to generate samples
- don’t consider the probability distribution of the model
  - $s' \sim P_\eta (s' | s,a)$
    $r = R_\eta (s,a,s')$
- apply model-free RL on samples
  - Monte Carlo control
  - Q-Learning
    - but too time consuming
    - Sol: plan for states that are relevant for NOW (planning for surrounding)
Model-based Data Generation
- consider both
  - real experience from environment
  - simulated experience from model
- train model free RL with both experience
- i.e
  - Dyna-Q
  - Dyna-Q+
    - Dyna-Q with exploration bonus

General Value Function (GVF)
- generalize value function to predict any signals
Temporal Abstractions via Options
- hierarchical RL
- option $O = (I, \pi, \beta)$ $O = (I, π, β)$
  - $I \subseteq s$ (state)
  - $\pi: S \times A \rightarrow [0, 1]$ : a policy to follow
  - $\beta: S \rightarrow [0, 1]$ : probability of terminating at each state
Designing Reward Signals
- Problem: agent cannot learn until it reaches the goal (where the reward is)
- Sol: add fake rewards to make the learning easier
- Extrinsic reward: rewards from the environment
  Intrinsic reward: rewards from the agent itself based on its internal state