507 words
3 minutes
[CS5446] Reinforcement Learning Generalization
CAUTION

Problem: Tabular representation cannot scale big

Two solutions:

  1. Function Approximation
  2. Policy Search

Function Approximation#

  • compact representation of true utility function
  • U^θ=θ1f1(s)+θ2f2(s)+...+θnfn(s)\hat{U}_\theta = \theta_1f_1(s) + \theta_2f_2(s) + ... + \theta_nf_n(s)
    • input: states
    • output: utility value
  • better generalization

Linear Function Function Approximation#

  • Use Linear function to approximate UU
  • U^(s)=θ0+θ1S0+θ2S1+...+θnSn\hat{U}(s) = \theta_0 + \theta_1S_0 + \theta_2S_1 + ... + \theta_nS_n
  1. Approximate Monte Carlo Learning

    • Supervised learning
    • update parameters after each trial
    • L2 loss: 12(U^θ(s)uj(s))2\frac {1}{2} (\hat{U}_\theta(s)- u_j(s))^2
    • Update: θiθi+ɑ(uj(s)U^θ(s))U^θ(s)θi\theta_i ← \theta_i + ɑ (u_j(s) - \hat{U}_\theta(s)) \frac{\partial \hat{U}_\theta(s)}{\partial \theta_i}
      • for each parameter
  2. Approximate Temporal Difference Learning

    • a.k.a semi-gradient
    • Utility: θiθi+ɑ[R(s,a,s)+γU^θ(s)]U^θ(s)\theta_i ← \theta_i + ɑ [R(s,a,s') + γ \hat{U}_\theta(s') - ]\hat{U}_\theta(s)
    • Q-Learning: θiθi+ɑ[R(s,a,s)+γmaxaQ^θ(s,a)]U^θ(s,a)\theta_i ← \theta_i + ɑ [R(s,a,s') + γ max_{a'} \hat{Q}_\theta(s', a')] - \hat{U}_\theta(s,a)
NOTE

Approximate MC / TD are similar to MC / TD but is calculated for each parameters, multiplied by its gradient.

Issues#

  • The deadly triad
    • may not converge if: function approx + bootstrapping + off-policy
  • Catastropic forgetting:
    • happen when over-trained, forget about the dangerous zone (訓練太久後,只訓練 optimal path,就忘記危險的地方。之後若走到那邊容易出事)
    • Solution: Experience Replay
      • Replay trials to ensure utility function still accurate for parts no longer visited

Non-Linear Function Approximation#

  • deep reinforcement learning, use gradient descent for back propogation
  1. Deep Q-Network (DQN)
    • uses experience replay with fixed Q-target
    • DQN

Policy Search#

  • π:SA\pi: S \rightarrow A (find good policy)
  • πθ(s)=argmaxaQ^θ(s,a)\pi_\theta(s) = argmax_a \hat{Q}_\theta(s,a)
  • Problem:
    • π^θ(s,a)=1\hat{\pi}_\theta(s,a) = 1 if max, else 00.
    • discrete, cannot use gradient
    • Solution: Stochastic Policy (probability)
      • Use Softmax function \Rightarrow differentiable
      • use gradient descend to update
NOTE

Policy search estimates the policy function, but only care if it leads to optimal policy, doesn’t care whether the estimation is close to true utility.

  1. REINFORCE
    • Monte-Carlo policy gradient
    • high variance => use baseline (center the return to reduce variance)
    • Advantage function Aπθ(s,a)=Q^θ(s,a)U^πθ(s)A_{\pi_\theta}(s,a) = \hat{Q}_\theta(s,a) - \hat{U}_{\pi_\theta}(s), where U^πθ(s)\hat{U}_{\pi_\theta}(s) is the baseline

Problem of policy gradient#

  • unstable returns \Rightarrow bad updated \Rightarrow FAIL!
  • wants to restrict the update (限制每次更新的變動範圍)
  • Solution: Minorize Maximization
    • use a simpler objective Mi(θi)M_i(\theta_i) (is the lower bound of the true one) to replace the true one ρ(θi)\rho(\theta_i).
      • Mi(θi)ρ(θi)M_i(\theta_i) ≤ \rho(\theta_i)
      • result of Mi(θi)M_i(\theta_i) is guaranteed to ρ(θi)≤ \rho(\theta_i), therefore we maximize Mi(θi)M_i(\theta_i)
    • maximize the simpler objective
    • guarantee monotonic policy improvement
  1. Trust Region Policy Optimization (TRPO)

    • uses lower bound (KL divergence) to limit the change per update, to ensure that the update is gradual and stable, moving toward the optimal action
    • the max KL diverge to δ≤ \delta TRPO
  2. Proximal Policy Optimization (PPO)

    • TRPO is too computationally complex
    • uses clipped objective to limit updates
    • πθ(aS)πθold(aS)\frac{\pi_\theta (a|S)}{\pi_{\theta_{old}} (a|S)} must be in the range of [1ϵ,1+ϵ][1- \epsilon, 1 + \epsilon]
      • if <1ϵ< 1 - \epsilon, then clip to 1ϵ1- \epsilon
      • if >1+ϵ> 1 + \epsilon, then clip to 1+ϵ1+ \epsilon
      • 限制update的幅度介於 [1+ϵ,1ϵ][1+\epsilon, 1-\epsilon] 之間 PPO

Function approximation ++ Policy search#

  1. Actor-Critic methods
    • To estimate both utility and policy
    • Learns a policy (actor) that takes action
    • Also learns an utility function (critic) for evaluating the actor’s decisions
    • The actor is running policy search
    • The critic is running value-function approximation
    • Actor adjusts policy based on feedback from critic (using policy gradient)
    • The advantage estimate shows how much the currect policy is better than the average.
      • A(s,a)=R+γV(s)V(s)A(s,a) = R + \gamma V(s') - V(s)
    • The policy gradient update is E(A(s,a)θlnπθ(a,s)E(A(s,a) \nabla_\theta ln \pi_\theta(a,s) Actor-Critic
[CS5446] Reinforcement Learning Generalization
https://itsjeremyhsieh.github.io/posts/cs5446-6-reinforcement-learning-generalization/
Author
Jeremy H
Published at
2024-09-18