Image Generation#

Image Generation

Diffusion Model#

Process

Forward Process:
- build a random Gaussian noise image
- generate training data & labels for reverse process
Reverse Process:
- train noise predictor (Unet) to generate denoised image
- input: noise image & step number (to indicate the amount of noise per image step)
- output: denoised image
- Why need T steps?
  - break down a hard problem
  - easier to inject conditions (i.e text prompt) gradually
- Noise predictor loss function: distance between predicted image and ground truth (from forward process)

Text-to-Image: in reverse process, add text prompt as an input.
- 3 inputs: noise image, step number, text prompt

Why need?

arbutrary encoder that produces a sequence of tokens to generate embeddings for denoiser

dalle

4 parts:
- text encoder & image encoder (pretrained from CLIP)
- prior training (transform text embedding to image embedding)
- decoder training (produces image conditioned on CLIP image embeddings and text captions)
- inference (delete CLIP image encoder, save CLIP text encoder)