204 words
1 minutes
[CS5242] Attention and Transformers

Limitations of RNN & LSTM#

Why use self attention?

  • poor sequence parallelism
    • LSTM relies on a sequential computation process, difficult to achieve sequence parallelism
  • difficulty handling long-distance dependencies
    • LSTM struggles with captureing long-range dependencies as sequences grow longer -gradient vanishing or exploding can occur during backpropagation lstm
  • inefficient parameter usage
    • LSTM requires maintaining a large number of parameters

Self attention#

self attention

  • a.k.a global attention
  • qq: query (to match others)
    • qi=Wqaiq^i = W^q a^i
  • kk: key (to be matched)
    • ki=Wkaik^i = W^k a^i
  • vv: value (to be extracted)
    • vi=Wkaiv^i = W^k a^i
  • ai,j=qikjda_{i,j} = \frac {q^i \cdot k^j}{\sqrt d}
  • b1=iα^1,i×vib^1 = \sum_i \hat \alpha_{1,i} \times v^i: all the attentions of x1x^1 on other inputs and itself b
    • b1,b2...b^1, b^2 ... can be computed in parallel
  • can reduce computational cost by local attention
    • focuses only on the positions near the current position within the input sequence
    • restricts the attention to a local neighborhood

Multi-Head self attention#

  • i.e. 2 heads 2 head
  • can only communicate within the same head group
  • merge the outputs of multiple heads into one single output (concat together)
  • can capture diverse pattern and relationships
  • high scalable & parallelizable

Transformer#

transformer

  • Self-attention enhances multi-head self attention
    • in multi-head attention, we use several attention heads in parallel.
    • each head learns different representations of Query, Key & Value

Encoder#

encoder

Layer norm#

  • same for all feature dimensions
    • Batch norm is same for all training examples layer norm

Decoder#

decoder

Application#

  • Vision: ViT, Sora
  • Language: BERT, GPTs
  • Vision-Language: CLIP
[CS5242] Attention and Transformers
https://itsjeremyhsieh.github.io/posts/cs5242-7-attention-and-transformer/
Author
Jeremy H
Published at
2024-10-01