559 words
3 minutes
[CS5340] Mixture Models & EM Algorithm

Motivation#

  • How to find unknown parameters of clusters in unsupervised learning?
  • Which cluster does each data belong to?

K- Means Algorithm#

Non-probabilistic, 100% assign to that single cluster

  • Given:
    • data set {x1,...xN}\{ x_1, ...x_N \}
    • number of clusters KK
  • Find:
    • KK cluster centers {μ1,...,μK}\{ \mu_1, ..., \mu_K \}
    • assign each data xx to a cluster center

1-of-K coding:#

  • for each data xnx_n, rnk0,1 s.tKrnk=1r_{nk} \in {0, 1} \ s.t \sum_K r_{nk} = 1
    • if data xnx_n is in cluster kk, then rnk=1r_{nk} = 1, otherwise 00.
    • Hard assignment
    • each data is assigned to only 1 cluster
  • goal is to minimize J=NKrnkxnμk2  s.tkrnk=1J=\sum_N \sum_K r_{nk} ||x_n-\mu_k||^2 \ \ s.t \sum_k r_{nk}=1

Algorithm:#

  1. Initialization: randomly choose {μk}\{ \mu_k\}
  2. Assignment: fixed {μk}\{ \mu_k\}, minimize JJ w.r.t {rnk}\{ r_{nk} \}
    • argminrnKrnkxnμk2  s.tkrnk=1argmin_{r_n} \sum_K r_{nk} ||x_n - \mu_k||^2 \ \ s.t \sum_k r_{nk} = 1
  3. Update: fixed {rnk}\{ r_{nk} \}, minimize JJ w.r.t {μk}\{ \mu_k\}
    • Jμk=2Nrnk(xnμk)=0\frac {\partial J}{\partial \mu_k} = 2 \sum_N r_{nk} (x_n - \mu_k) = 0
    • μk=NrnkxnNrnk\mu_k = \frac {\sum_N r_{nk} x_n}{\sum_N r_{nk}} (mean of all points assigned to cluster kk)
  • Repeat step 2 and 3 until converge

Can we use probability to assign? (Soft assignment)

Probabilistic Approach#

Gaussian Mixture Model#

  • linear superposition of multiple Gaussians
  • probability distribution:
    • p(x)=KπKN(xμK,K)p(x) = \sum_K \pi_K N(x| \mu_K, \sum_K) (KK Gaussians)
      • μK\mu_K: mean
      • K\sum_K: covariance
      • KπK=1\sum_K \pi_K = 1: mixing coefficient (weight of each Gaussians in the model)
    • 1-of-K representation: ZK=1Zjk=0Z_K = 1 \Rightarrow Z_{j \neq k} = 0
      • assignment of xx to kthk^{th} Gaussian
      • ZK{0,1}, KZK=1Z_K \in \{ 0, 1 \}, \ \sum_K Z_K = 1
    • p(z)p(z) is categorical distribution
      • p(z) KπKZk=catZ[π]p(z) \ \prod_K \pi_K^{Z_k} = cat_Z[\pi]
        • if Zk=1Z_k = 1 (ZZ is in kthk^{th} Gaussian), πKZk=πK, 0\pi_K^{Z_k} = \pi_K, \ 0 otherwise
        • 0πk1,KπK=10 \leq \pi_k \leq 1, \sum_K \pi_K = 1
    • probability of xx belonging to each distribution
      • p(xzk=1)=N(xμk,k)p(x | z_k = 1) = N(x | \mu_k, \sum_k)
        p(xZ)=KN(xμk,k)Zk\Rightarrow p(x|Z) = \prod_K N(x | \mu_k, \sum_k)^{Z_k}
      • p(x,z)=p(z)p(xz)p(x,z) = p(z)p(x|z)
        p(x)=Kp(z)p(xz)=KKπKZkN(xμK,K)Zk=KπKN(xμK,K)\Rightarrow p(x) = \sum_K p(z)p(x|z) = \sum_K \prod_K \pi_K^{Z_k} N(x| \mu_K, \sum_K)^{Z_k} = \sum_K \pi_K N(x| \mu_K, \sum_K)
  • Responsibility Responsibility

Maximum log-likelihood#

  • we try to get the partial derivative w.r.t μk,k,πk\mu_k, \sum_k, \pi_k
  • argmaxθlnp(x1,...,xNθ)=argmaxπ,ν,NlnKπKN(xnμk,k)argmax_\theta ln p(x_1, ..., x_N| \theta) = argmax_{\pi, \nu, \sum} \sum_N ln \sum_K \pi_K N(x_n | \mu_k, \sum_k), \sum is inside ln ln \ \Rightarrow no closed-form

EM for Gaussian Mixtures#

  • θ=πk,μk,k\theta = {\pi_k, \mu_k, \sum_k}
  1. Initialize πk,μk,k\pi_k, \mu_k, \sum_k
  2. Expectation step: evaluate responsibilies r(Z)r(Z)
    • r(Znk)=πkN(xnμk,k)kπkN(xnμk,k)r(Z_{nk}) = \frac {\pi_k N(x_n | \mu_k, \sum_k)}{\sum_k \pi_k N(x_n | \mu_k, \sum_k)}
  3. Maximization step: update πk,μk,k\pi_k, \mu_k, \sum_k
    • μk=1NkNr(Znk)xn\mu_k = \frac {1}{N_k} \sum_N r(Z_{nk})x_n
    • k=1NkNr(Znk)(xnμk)(xnμk)T\sum_k = \frac {1}{N_k} \sum_N r(Z_{nk}) (x_n - \mu_k)(x_n - \mu_k)^T
    • πk=NkN\pi_k = \frac {N_k}{N}
    • where Nk=Nr(Znk)N_k = \sum_N r(Z_{nk})
  4. Evaluate log-likelihood
    • ln p(xμ,,π)=Nln(KπKN(xnμn,n))ln \ p(x| \mu, \sum, \pi) = \sum_N ln (\sum_K \pi_K N(x_n | \mu_n, \sum_n))
    • check for convergence
  • Can run K_Means before initializing for EM algorithm

General EM algorithm#

  • Goal: find maximum likelihood solution for models with latent variables
    • ln p(xθ)=ln(Zp(x,Zθ))ln \ p(x|\theta) = ln (\sum_Z p(x,Z| \theta))
    • no closed form
  • we don’t have complete {X,Z}\{ X,Z \}, we can consider maximize the expected value of p(x,zθ)p(x, z|\theta) w.r.t p(zx,θ)p(z|x, \theta)
  1. Initialize μ,,π\mu, \sum, \pi
  2. Expectation step
    • Expectation Q(θ,θold)=Zp(zx,θold)ln p(x,zθ)=ezx,θold[ln p(x,zθ)]Q(\theta, \theta_{old}) = \sum_Z p(z| x, \theta_{old}) ln \ p(x, z| \theta) = e_{z | x, \theta_{old}}[ln \ p(x, z | \theta)]
  3. Mazimization step
    • θnew=argmaxθQ(θ,θold)=argmaxθZp(Zx,θold)ln p(x,Ztheta)\theta_{new} = argmax_\theta Q(\theta, \theta_{old}) = argmax_\theta \sum_Z p(Z|x, \theta_{old}) ln \ p(x, Z| theta)
    • log is now insize  \sum \ \Rightarrow closed form
  4. Check for convergence of either log likelihood or parameter value
    • θoldθnew\theta_{old} \leftarrow \theta_{new}

Theory behind EM algorithm#

  • maximize ln p(xθ)=lnZp(x,zθ)ln \ p(x| \theta) = ln \sum_Z p(x, z| \theta) is difficult, we can maximize its lower bound L(q,θ)L(q, \theta) instead
  • lnp(xθ)=L(q,θ)+KL(qp)ln p(x| \theta) = L(q, \theta) + KL (q || p) where
    • L(q,θ)=Zq(z)lnp(x,ztheta)q(z)L(q, \theta) = \sum_Z q(z) ln \frac {p(x,z |theta)}{q(z)}

    • KL(qp)=Zq(Z)lnp(zx,θ)q(z)0KL(q || p) = - \sum_Z q(Z) ln \frac {p(z|x, \theta)}{q(z)} \geq 0

[CS5340] Mixture Models & EM Algorithm
https://itsjeremyhsieh.github.io/posts/cs5340-7-mixture-models-and-em-algorithm/
Author
Jeremy H
Published at
2024-10-07