508 words
3 minutes
[CS5446] Reinforcement Learning Foundation

Reinforcement Learning#

Reinforcement Learning

  • Learning to behave in unfamiliar environment
  • Environment is a Markov Decision Process
    • relies on Markov Property (but doesn’t know what it looks like)
  • Unknown transition function, unknown reward function
  • Use observed rewards to learn optimal policy

Types of RL#

Model-based vs. Model-free#

  • Model-based: Learn transition model to solve problem
  • Model-free: Doesn’t learn the transition model, directly solve the problem

Passive vs. Active learning agent#

  • Passive: given policy → evaluate the value & find utility function
  • Active: Learns the optimal policy & utility function.
    • Needs to balance between exploitation & exploration

Passive RL#

  • Execute a set of trials using fixed (given) policy π\pi
  • Learn expected utility Uπ(s)U^{\pi}(s) for each non-terminal state ss
    • Uπ(s)=E[tγtR(st,π(s),st+1]U^{\pi}(s) = E[\sum_{t} γ^t R(s_t, \pi(s), s_{t+1}]

Model-based Passive RL#

  1. Adaptive Dynamic Programming (ADP)
  • Learn transition function through experience
    • Count outcomes of each state, action
    • Normalize to estimate P(ss,a)P(s'|s,a)
  • e.g ADP example
    • 𝑃(𝑠=(3,2)𝑠=(3,3),𝑎=𝑅𝑖𝑔h𝑡)=13𝑃(𝑠′=(3,2)|𝑠 = (3,3), 𝑎 = 𝑅𝑖𝑔ℎ𝑡) = \frac {1}{3}
    • 𝑃(𝑠=(4,3)𝑠=(3,3),𝑎=𝑅𝑖𝑔h𝑡)=23𝑃(𝑠′=(4,3)|𝑠 = (3,3), 𝑎 = 𝑅𝑖𝑔ℎ𝑡) = \frac {2}{3}
  • Now we have the transition functions, we can obtain rewards for each (s,a)(s, a) with given policy π\pi
    • Uπ(s)=sP(ss,π(s))[R(s,π(s),s)+γUπ(s)]U^{\pi}(s) = \sum_{s'} P(s'|s, \pi(s))[R(s, \pi(s),s') + γU^{\pi}(s')]
    • Learn reward function R(s,a,s)R(s,a,s') upon entering state ss'
    • Solve SS linear equations with SS unknowns requires O(S3)O(S^3) time
  • Adjusts the state to agree with ALL successors
  • ADP algorithm

Model-free Passive RL#

Directly learn the utility, doesn’t learn the transition and reward function

  1. Monte Carlo Learning
    • Maintain running average of expected return for each state.
    • If run many times → converge to true expected value
    • e.g MC example
      • For the first trial,
        • Expected reward of (1,1) = 0.72
        • Expected reward of (1,2) = 0.76, 0.84
      • Overall (average),
        • Uπ(1,1)=0.72+0.721.243=0.067U^{\pi}(1,1) = \frac {0.72+0.72-1.24}{3} = -0.067
        • Uπ(1,2)=0.76+0.84+0.763=0.79U^{\pi}(1,2) = \frac {0.76+0.84+0.76}{3} = 0.79
    • Different variences: first visit vs. every visit (of a trial)
    • Slow convergence
      • Learning starts only at the end of each trial (episode)
  2. Temporal Difference (TD) Learning
    • TD target: R(s,π(s),s)+γUπ(s)R(s, \pi(s),s') + γU^{\pi}(s')
    • TD error (term): TD target Uπ(s)- U^{\pi}(s)
    • Uπ(s)Uπ(s)+α(U^{\pi}(s) ← U^{\pi}(s) + α( TD Error ))
    • N-step TD: do estimate up to nn steps
    • TD(λ): combine returns from different TD-steps

Active RL#

Act optimally based on prediction of utility of states

  • Goal: learn optimal π\pi^* that obeys Bellman Equation

Model-based Active RL#

Learn the transition and reward function, then solve the MDP (π\pi^*)

  1. Active Adaptive Dynamic Programming
    • Learn model with outcome probability for all actions
    • π(s)=argmaxasP(ss,a)[R(s,a,s)+γU(s)]\pi^*(s) = argmax_{a} \sum_{s'} P(s'|s,a) [R(s,a,s')+ γ U(s')]
    • Resulting algorithm is greedy
      • Pure exploitation may not be optimal
    • We need tradeoff between exploration & exploitation

Greedy in the Limit of Infinite Exploration (GLIE)#

  • a learning policy
  • scheme for balancing exploration & exploitation
  • ε-greedy exploration
    • choose greedy action with p=(1ε)p=(1-ε)
    • choose exploit (random) action with p=εp=ε
    • lower ε over time ε=1tε = \frac {1}{t}, eventually becomes greedy (takes the optimal action)

Model-free Active RL#

Directly learn π\pi^*

  1. Monte Carlo Control

    • Q(s,a)=avg(Returns(s,a))Q(s, a) = avg( Returns(s,a))
    • π(s)=argmaxaf(Q(s,a),N(s,a))\pi(s) = argmax_{a}f(Q(s,a), N(s,a))
      • Exploration function f(u,n)f(u,n)
        • how greed is trade off against curiosity
        • should increase with uu and decrease with nn
        • e.g Exploration function
          • R+R^+: best possible reward
          • NeNe: agent tries each state-action pair NeNe times
  2. Active TD / TD Control

    • Q(s,a)Q(s,a)+α(R(s,a,s)+γQ(s,π(s))Q(s,a))Q(s,a) ← Q(s,a) + α(R(s,a,s') + γQ(s', \pi(s'))-Q(s,a))

    Q-Learning#

    • Use optimal π\pi to estimate Q
    • Q(s,a)Q(s,a)+α(R(s,a,s)+γmaxaQ(s,a)Q(s,a))Q(s,a) ← Q(s,a) + α(R(s,a,s') + γmax_{a'} Q(s,a')-Q(s,a))
    • off-policy: estimate policy ≠ behavior policy
    • Q-function: expected total discounted reward if aa is taken in ss with optimal behavior
      • U(s)=maxaQ(s,a)U(s) = max_a Q(s,a)

    SARSA (State-Action-Reward-State-Action)#

    • Uses TD for prediction, ε-greedy for action selection
    • Q(s,a)Q(s,a)+α(R(s,a,s)+γQ(s,a)Q(s,a))Q(s,a) ← Q(s,a) + α(R(s,a,s') + γQ(s',a')-Q(s,a))
    • on-policy
    • Waits until an action is taken, then update Q function

Comparison#

PassiveActive
Model-BasedADPActive ADP
Model-FreeMonte-Carlo; TDMonte-Carlo Control; Active TD: Q-Learning, SARSA
[CS5446] Reinforcement Learning Foundation
https://itsjeremyhsieh.github.io/posts/cs5446-5-reinforcement-learning-foundation/
Author
Jeremy H
Published at
2024-09-11