779 words
4 minutes
[CS5340] Parameter Learning with Complete Data
  • Goal: How to get unknown parameter θ\theta of p(x1,...,xMθ)p(x_1, ..., x_M | \theta) from fully observed data (x1,...,xMx_1, ... ,x_M are observed (known))?

  • Given: a set of NN independent and identically distributed (i.i.d) complete observation of each random variable X:{x1,1,...x1,N,...xM,1,...,xM,N}X: \{ x_{1,1}, ... x_{1,N}, ...x_{M,1}, ..., x_{M,N}\}

    • MM random variables, each observe NN times
  • Method:

    1. Maximum likelihood estimate (MLE)
    2. Maximum a posteriori (MAP)

Maximum Likelihood Estimate (MLE)#

find the unknown parameters θ\theta that maximize the likelihood p(x1,...,xMθ)p(x_1, ..., x_M| \theta)

  • θ^=argmaxθ[p(x1,...,xMθ)]\hat \theta = argmax_\theta [p(x_1, ..., x_M| \theta)]
    =argmaxθ[i=1Np(x1,i,...xM,iθ)]= argmax_\theta [\prod_{i=1}^N p(x_{1,i}, ... x_{M,i} | \theta)] (i.i.d)
    • i=1Np(x1,i,...xM,iθ)=p(x1,1,...xM,1θ)×p(x1,2,...xM,2θ)×...×p(x1,N,...xM,Nθ)\prod_{i=1}^N p(x_{1,i}, ... x_{M,i} | \theta) = p(x_{1,1}, ... x_{M,1} | \theta) \times p(x_{1,2}, ... x_{M,2} | \theta) \times ... \times p(x_{1,N}, ... x_{M,N} | \theta)
      can do this because each set of observed random variable is independent of each other

Maximum a Posteriori (MAP)#

find the unknown parameters θ\theta that maximize the a posterior probability p(θx1,...,xM)p(\theta | x_1, ..., x_M)

  • θ^=argmaxθ[p(θx1,...xM)]=argmaxθ[p(x1,...xMθ)p(θ)p(x1,...xM)]\hat \theta = argmax_\theta [p(\theta | x_1, ... x_M)] \\ = argmax_\theta [\frac {p(x_1, ...x_M| \theta)p(\theta)}{p(x_1, ...x_M)}] (Baye’s rule)
    =argmaxθ[i=1Np(x1,i,...,xm,iθ)[(θ)]p(x1,...,xM)]= argmax_\theta [\frac {\prod_{i=1}^N p(x_{1,i}, ..., x_{m,i} | \theta) [(\theta)]}{\cancel{p(x_1, ..., x_M)}}] (p(x1,...,xM)p(x_1, ..., x_M) can be removed because it is independent of θ\theta)
    =argmaxθ[i=1Np(x1,i,...,xM,iθ)p(θ)]]= argmax_\theta [\prod_{i=1}^N p(x_{1,i}, ..., x_{M,i} | \theta)p(\theta)]]

Example: Single Random Variable DGM#

single

Continuous DGM#

  • Univariate Normal Distribution

    • p(xθ)=Normx[μ,σ2]=12πσ2e(xμ)22σ2p(x|\theta) = Norm_x [\mu, \sigma^2] = \frac{1}{\sqrt {2 \pi \sigma^2}} e^{- \frac {(x-\mu)^2}{2 \sigma^2}}
    • goal is to find the two unknown parameters θ=(μ,σ2)\theta = (\mu, \sigma^2)
  • M1: Maximum Likelihood Estimation (MLE)

    • θ^=argmaxθ[p(x theta)]=argmaxθ[i=1Np(x[i]θ)]\hat \theta = argmax_\theta [p(x\ theta)] \\ = argmax_\theta [\prod_{i=1}^N p(x[i] | \theta)]
    • p(xθ)=p(xμ,σ2)=i=1NNormx[i][μ,σ2]p(x| \theta) = p(x| \mu, \sigma^2) = \prod_{i=1}^N Norm_{x[i]} [\mu, \sigma^2]
    • maximize the logarithm
      • μ^,σ^2=argmaxμ,σ2i=1Nlog[Normx[i][μ,σ2]]=argmaxμ,σ2[0.5Nlog[2π]0.5Nlogσ20.5i=1N(x[i]μ)2σ2]\hat \mu, \hat \sigma^2 = argmax_{\mu, \sigma^2} \sum_{i=1}^N log[Norm_{x[i]}[\mu, \sigma^2]] \\ = argmax_{\mu, \sigma^2}[-0.5N log[2\pi] - 0.5Nlog \sigma^2 - 0.5\sum_{i=1}^N \frac {(x[i]-\mu)^2}{\sigma^2}]
      • Lμ=i=1N(x[i]μ)2σ2=i=1Nx[i]σ2Nμσ2=0\frac {\partial L}{\partial \mu} = \sum_{i=1}^N \frac {(x[i]-\mu)^2}{\sigma^2} = \frac{\sum_{i=1}^N x[i]}{\sigma^2} - \frac {N \mu}{\sigma^2}= 0 (set to 00)
      • Lσ2=Nσ2+i=1N(x[i]μ)2σ4=0\frac {\partial L}{\partial \sigma^2} = - \frac{N}{\sigma^2} + \sum_{i=1}^N \frac{(x[i]-\mu)^2}{\sigma^4} = 0
      • we get μ^=i=1Nx[i]N=xˉ\hat \mu = \frac {\sum_{i=1}^N x[i]}{N} = \bar x, σ^2=i=1N(x[i]μ)2N\hat \sigma^2 = \frac {\sum_{i=1}^N (x[i] - \mu)^2}{N}
    • Maximum likelihood for the normal distribution gives least squares fitting
      • μ^=argmaxμi=1N(x[i]μ)2=argminμi=1N(x[i]μ)2\hat \mu = argmax_\mu -\sum_{i=1}^N (x[i] - \mu)^2 = argmin_\mu \sum_{i=1}^N (x[i] - \mu)^2
  • M2: Maximum a Posteriori (MAP)

    • θ^=argmaxθ[i=1Np(x[i]θ)p(θ)]\hat \theta = argmax_\theta [\prod_{i=1}^N p(x[i]|\theta)p(\theta)]
      • p(x[i]θ)p(x[i]|\theta): likelihood, univariate normal distribution
        • p(xμ,σ2)=i=1NNormx[i][μ,σ2]p(x|\mu, \sigma^2) = \prod_{i=1}^N Norm_{x[i]} [\mu, \sigma^2]
      • p(θ)p(\theta): prior, conjugate prior, normal inverse gamma distribution
        • p(μ,sigma2)=NormInvGamμ,σ2[α,β,γ,δ]p(\mu, sigma^2) = NormInvGam_{\mu, \sigma^2}[\alpha, \beta, \gamma, \delta]
    • μ^,σ^2=argmaxμ,σ2[i=1NNormx[i][μ,σ2]NormInvGamμ,σ2[α,β,γ,δ]]\hat \mu, \hat \sigma^2 = argmax_{\mu, \sigma^2} [\prod_{i=1}^N Norm_{x[i]} [\mu, \sigma^2] NormInvGam_{\mu, \sigma^2}[\alpha, \beta, \gamma, \delta]]
    • Maximize the logarithm
      • μ^,σ^2=argmaxμ,σ2[i=1Nlog[Normx[i][μ,σ2]]+log[NormInvGamμ,σ2[α,β,γ,δ]]\hat \mu, \hat \sigma^2 = argmax_{\mu, \sigma^2} [\sum_{i=1}^N log [Norm_{x[i]} [\mu, \sigma^2]] + log[ NormInvGam_{\mu, \sigma^2}[\alpha, \beta, \gamma, \delta]]
    • Take the derivatives of μ\mu and σ2\sigma^2 and set to 00

Discrete DGM#

  • p(X=ekθ)=Catx[λ]=k=1Kλkxk=λkp(X=e_k|\theta) = Cat_x[\lambda] = \prod_{k=1}^K \lambda_k^{x_k} = \lambda_k
    • kk parameters (categories)
  • Goal: find the KK unknown parameters θ={λ1,...,λK}\theta = \{ \lambda_1, ..., \lambda_K \} where λk[0,1]\lambda_k \in [0,1] and kλk=1\sum_k \lambda_k = 1
  • M1: Maximum Likelihood Estimation (MLE)
    • θ^=argmaxθ[p(xθ)]=argmaxθ[i=1Np(x[i]theta)]\hat \theta = argmax_\theta [p(x|\theta)] = argmax_\theta[\prod_{i=1}^N p(x[i]|theta)]
    • p(x lambda)=i=1NCatx[i][λ1...K]=i=1Nk=1Kλkxikp(x\ lambda) = \prod_{i=1}^N Cat_{x[i]} [\lambda_{1...K}] = \prod_{i=1}^N \prod_{k=1}^K \lambda_k^{x_{ik}}
    • MLE
    • apply the Lagrange multipier vv on the constraint, we get L=k=1KNklog[λk]+v(k=1Kλk1)L = \sum_{k=1}^K N_k log[\lambda_k] + v (\sum_{k=1}^K \lambda_k-1)
    • take the derivative of LL w.r.t λk\lambda_k and vv and set to 00, solve for λk\lambda_k
  • M2: Maximum a Posteriori (MAP)
    • Likelihood [(xλ)[(x|\lambda): categorical distribution
    • Prior: Dirichlet distribution
      • p(λ1,...,λK)=Dirλ1...K[α1,... αK]p(\lambda_1, ..., \lambda_K) = Dir_{\lambda_{1...K}} [\alpha_1, ...\ \alpha_K]
    • MAP
    • apply the Lagrange multiplier vv on the constraint, we get L=k=1K(Nk+αk1)logλk+v(k=1Kλk1)L = \sum_{k=1}^K (N_k + \alpha_k -1) log \lambda_k + v (\sum_{k=1}^K \lambda_k -1)
    • take the derivative of LL w.r.t λk\lambda_k and vv and set to 00, solve for λk\lambda_k

Parameter Learning in UGM (MRF)#

  • for example, a MRF in log-linear form
    • p(yθ)=1Z(θ)ecθcTøc(y)p(y| \theta) = \frac{1}{Z(\theta)}e^{\sum_c \theta_c^T ø_c(y)}
    • scaled log-likelihood: l(θ)=1NiNlogp(yiθ)=1NiN[cθcTøc(yi)logZ(θ)]l(\theta) = \frac{1}{N} \sum_i^N log p(y_i|\theta) = \frac {1}{N} \sum_i^N [\sum_c \theta_c^T ø_c(y_i) - log Z(\theta)]
  • Since MRFs are in the exponential family, this function is convex in θ\theta
    • it has a unique global maximum, we can use the gradient-based optimizers
  • the derivative of the log partition function w.r.t θc\theta_c is the expectation of the cthc^{th} feature under the model
    • logZ(θ)θc=E[øc(y)θ]=yøc(y)p(yθ)\frac {\partial log Z(\theta)}{\partial \theta_c} = E[ø_c(y)|\theta] = \sum_y ø_c(y) p(y|\theta)
    • gradient of the log-likelihood is lθc=1NiNøc(yi)E[øc(y)]=0\frac {\partial l}{\partial \theta_c} = \frac{1}{N} \sum_i^N ø_c(y_i) - E[ø_c(y)] = 0 =Epemp[øc(y)]Ep(yθ)[øc(y)]=0= E_{p_{emp}} [ø_c(y)] - E_{p(y|\theta)}[ø_c(y)] = 0 but cannot be evaluated in closed form
    • Sol: use gradient-based optimizers + approximate inference
  • M1: Stochastic Maximum Likelihood
    • stochastic gradient descent method
    • iteratively update the parmeter θk+1\theta_{k+1} at the kk step
      • θk+1θkηgk\theta_{k+1} \leftarrow \theta_k - \eta g_k
    • stochastic gradient descent algorithm
  • M2: Iterative Proportional Fitting (IFP)
    • alternative derivation of the gradient in term of ψc(yc)\psi_c(y_c)
      IFP1 IFP2 algo

Parameter learning in UGM (CRF)#

  • scaled log-likelihood: l(w)=1NiNlogp(yixi,w)=1NiN[cwcTøc(yi,xi)logZ(w,xi)]l(w) = \frac{1}{N} \sum_i^N log p(y_i|x_i,w) = \frac{1}{N} \sum_i^N [\sum_c w_c^T ø_c(y_i,x_i) - log Z(w,x_i)]
  • gradients: lwc=1NiN[øc(yi,xi)E[øc(y,xi)]]\frac {\partial l}{\partial w_c} = \frac {1} {N} \sum_i^N [ø_c(y_i, x_i) - E[ø_c(y, x_i)]]
  • approximate the gradient with MCMC samples mcmc
  • stochastic
  • we can also do a MAP estimation of the unknown parameter in CRF
    • argmaxw[ilogp(yixi,w)+logp(w)]argmax_w[\sum_i log p(y_i|x_i, w) + log p(w)]
    • Gaussian prior is often userd for p(w)p(w)
[CS5340] Parameter Learning with Complete Data
https://itsjeremyhsieh.github.io/posts/cs5340-6-parameter-learning-with-complete-data/
Author
Jeremy H
Published at
2024-09-16