195 words

1 minutes

[CS5242] BERT

2024-10-08

/

Neural Network & Deep Learning

BERT (Bidirectional Encoder Representations from Transformers)#

Limitations of traditional NLP models#

Why do we need BERT?

unidirectional context: may miss out on full context
task-specific conplexity: designed for specific task, may increase complexity to generalize
scarse pre-trained resources: 這邊老師上課的解釋是小公司沒有資源用大數據做大模型，但是Google可以
limited semantic grasp because of unidirectional relationship

BERT#

unsupervised learning: pre-trained on extensive data, boost performance for various language understanding tasks
bidirectional understanding: understand words by considering both left and right context
transformer: BERT uses encoder-only transformer

BERT structure

Pretraining#

Mask LM (mask token prediction)
- mask 15% of the input tokens at random, then predict the token
Next Sentence Prediction
- predicts whether the given sentence is the next sentence of the current one
- builds understanding of the relationship of sentence

Finetune#

We have the pretrained model, BERT can be finetuned to solve downstream tasks

ViT (Vision Transformer)#

ViT

apply BERT directly to images
partition image into patches, use the sequence of linear embeddings of these patches as an inputs of transformer (tokens)

ALBERT (A Lite BERT)#

reduces parameter techniques
- factorized embedding layer
  - decomposing large vocabulary embedding matrix into two small matrices
- cross-layer parameter sharing
  - the parameter size doesn’t grow with the depth of the network

[CS5242] BERT

https://itsjeremyhsieh.github.io/posts/cs5242-8-bert/

Author

Jeremy H

Published at

2024-10-08

License

CC BY-NC-SA 4.0

Sponsor

☕ Buy me a coffee

[CS5340] Hidden Markov Models

[CS5340] Mixture Models & EM Algorithm