190 words
1 minutes
[CS5242] Training Deep Networks

Activation Function g()g()#

  • perform non-linear feature transformation
    • e.g activation function
  • non-linear because linear is too limited
    • many linear layers can collapse into one single layer
  • Zi=wihi1+biZ^i = w^ih^{i-1} + b^i
    hi=gi(Zi)h^i = g^i(Z^i), where gig^i is the non-linear activation function

Stochastic Gradient Discent#

  • Problems with GD:

    • Local Optimum: cannot update weights, so next gradients are also 0, each iteration use same (whole) dataset, so there’s no way it can escape
    • Efficiency: GD has to load all training samples
  • randomly pick a (single)training sample to perform BP

    • mode efficient
    • but much slower, needs more iterations to train
  • Mini-batch SGD: randomly pick bb training samples

  • Converge rate: RGD>RminiSGD>RSGDR_{GD} > R_{mini-SGD} > R_{SGD}

  • Time per iteration: TGD>TminiSGD>TSGDT_{GD} > T_{mini-SGD} > T_{SGD}

  • Training time = T/RT/R

  • AdaGrad: change learning rate according to gradient

  • RMSProp: rescale learning rate to remove effect of gradient size, balance between different directions

    • S=βs+(iβ)g2S = \beta s + (i- \beta) g^2, ss has history information,
      w=wαgS^+ϵw = w - \frac {\alpha g}{\sqrt {\hat S + \epsilon}}, gg has current information, β\beta to balance
  • Adam: combining momentum and RMSProp

    • v=β1v+(1β1)gv = \beta_1 v + (1- \beta_1)g
      s=β2s+(1β2)gs = \beta_2 s + (1- \beta_2)g
      w=wαv^s^+ϵ,v^=v1β1,s^=s1β2w = w - \alpha \frac {\hat v}{\sqrt {\hat s + \epsilon}}, \hat v = \frac {v}{1-\beta_1}, \hat s = \frac {s}{1- \beta_2}
[CS5242] Training Deep Networks
https://itsjeremyhsieh.github.io/posts/cs5242-3-training-deep-networks/
Author
Jeremy H
Published at
2024-08-27