\(\def \d \ensuremath{\mathbf{d}}\) \(\def \e \ensuremath{\mathbf{e}}\) \(\def \g \ensuremath{\mathbf{g}}\) \(\def \I \ensuremath{\mathbf{I}}\) \(\def \l \ensuremath{\mathbf{l}}\) \(\def \M \ensuremath{\mathbf{M}}\) \(\def \W \ensuremath{\mathbf{W}}\) \(\def \Omegab \ensuremath{\boldsymbol{\Omega}}\)

1 Introduction

The EM algorithm is a general-purpose optimization algorithm widely considered as a “workhorse” computing tool in statistics.

While it was originally introduced for the purpose of optimization problems in the presence of missing data, we will see that it can also be adapted to broader problems in statistics by a reformulation of the original problem. In such settings, the EM algorithm may offer a convenient alternative to the methods introduced in the previous lecture, where the analytical forms of the derivatives may be complex or difficult to evaluate.

Examples of such broader applications include random effect, finite mixture, hidden markov, and latent class models. The relative ease of its implementation to perform maximum likelihood estimation in such complex models is also an attractive feature of EM.

1.1 Example: Finite Mixture Models

In HW5 we look at an example of a finite mixture model. There, we assumed that the number of fish caught by 4075 individuals in a park, \(y_1\ldots y_{4075}\), came from a mixture of two subpopulations, for example, the non-fishing population (with probability \(\pi\) when \(y_i=0\)) and the fishing population (with probability \((1-\pi)\)). The PMF was given by \[p(Y = y_i|\pi,\lambda) = (\pi + (1-\pi)e^{-\lambda})^{I[y_i=0]}\left((1-\pi)\frac{e^{-\lambda}\lambda^{y_i}}{y_i!}\right)^{I[y_i>0]}\]

This model, called a zero-inflated Poisson model, accounts for the excess zeros that are due to the non-fishing population, and, assuming the model correctly represents the observed data, should provide a less biased estimate of \(\lambda\) when fitted to the data relative to the regular Poisson model. Here, the probability of observing a zero is simply the probability that an individual came from the non-fishing population (with probability \(\pi\)) or came from the fishing population and caught zero fish (with probability \((1-\pi)e^{-\lambda}\)).

Let’s take this problem a step further. Let us assume that at a bass fishing tournament, a statistician took a survey of participants at the end of the week, asking them how many fish they had caught during the tournament The statistician also noted the age of each individual, the length of their boat in feet, and the approximate volume of the cooler they had brought in liters (\(l \times w \times h\)). One of the things that the statistician wanted to determine was how these covariates were related to the number of fish an individual caught during the week.

For the purposes of illustration, we simulate this data. We provide the simulation code in the RMD file for this lecture, which you can run if you want to follow along. We will come back to this simulation setup later.

She ran a Poisson GLM on the data and found that none of the variables except for cooler seemed to associate with outcome:

fit = glm(y ~ X - 1, family = poisson())
summary(fit)
## 
## Call:
## glm(formula = y ~ X - 1, family = poisson())
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.8333  -1.6685  -0.1109   1.4540   4.7097  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## X(Intercept)  2.9391058  0.0874783  33.598  < 2e-16 ***
## Xage         -0.0008318  0.0013301  -0.625 0.531751    
## Xboat_length  0.0001603  0.0006356   0.252 0.800826    
## Xcooler       0.0045485  0.0013213   3.442 0.000577 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 47411  on 500  degrees of freedom
## Residual deviance:  1813  on 496  degrees of freedom
## AIC: 4233.7
## 
## Number of Fisher Scoring iterations: 4

When plotting the data, something seemed a little off with the distribution:

hist(y, breaks =  100)

Lets take a look at the pearson residuals to get a better look

hist(residuals(fit, type="pearson"), breaks = 20)

Looks like the distribution may be bimodal even after accounting for the predictor effects. Lets look at each of the predictors individually with respect to the log number of fish caught.

# create combined matrix object
mat = cbind(log(y + 1), X[, -1])

# name the first column y
colnames(mat)[1] = "y"

# plot each column against each other using pairs and smoothscatter
pairs(
  mat,
  panel = function(...)
    smoothScatter(..., nrpoints = 0, add = TRUE),
  gap = 0.2
)

In the first row we can see evidence that there may be different relationships between the cooler variable (x-axis) and the number of fish caught (y-axis) in two possible subgroups of participants. Each group also seems to have a different baseline numbers of fish caught and different trend with cooler. This could make sense, as both amateur and professional fisherman were allowed to register for the tournament, where we would assume the professional fisherman would catch more than the amateur ones.

The statistician wanted to go ahead and fit a mixture model to the data, again assuming there are two latent populations at hand. Let’s write the general form of a mixture model with \(K\) mixture components, sometimes referred to as “states”, where each component corresponds to a particular assumed subpopulation in the data. We can then write the general form for the density of this mixture as the following:

\[ f(y_i|\boldsymbol{\theta}) = \sum_{k =1}^K \pi_kf_k(y_i|\boldsymbol{X}_{ik}, \boldsymbol{\beta}_k) \]

where \(\boldsymbol{\theta} = (\boldsymbol{\pi}^T, \boldsymbol{\beta}^T)^T\), \(\boldsymbol{\pi} = (\pi_1,\ldots,\pi_K)^T\), \(\sum_{k=1}^K \pi_k = 1\), \(\boldsymbol{\beta} = (\boldsymbol{\beta}_1^T,\ldots,\boldsymbol{\beta}_K^T)^T\), and \(\boldsymbol{\beta}_k = (\beta_{1k},\ldots,\beta_{pk})^T\). Here, \(f_k(y_i|\boldsymbol{X}_{ik}, \boldsymbol{\beta}_k) = \frac{e^{-\lambda_{ik}}\lambda_{ik}^{y_i}}{y_i!}\) and \(\lambda_{ik} = \exp(\boldsymbol{X}_{ik}\boldsymbol{\beta}_k)\). That is, we assume that the data has been generated from a mixture of poisson regression models each with \(p_k\) dimensional predictor vector \(\boldsymbol{X}_{ik}\) and regression coefficients \(\boldsymbol{\beta}_k\). When \(\boldsymbol{\beta}_k = \beta_{1k}\) and \(\boldsymbol{X}_{ik} = 1\), then this problem reduces to a simple Poisson finite mixture model, where the mean of each mixture component \(\lambda_{ik} = \lambda_{k} = \exp(\beta_{1k})\).

One way to interpret this model is that we assume that the prior probability that an individual comes from component (subpopulation) \(k\) is \(\pi_k\), and, given that the individual is from subpopulation \(k\), the density of \(y_i\) conditional on membership to component \(k\) is \(f_k(y_i|\boldsymbol{X}_{ik}, \boldsymbol{\beta}_k)\). Since we do not know exactly which component an observation actually came from, we take the product of \(\pi_k\) and \(f_k(y_i|\boldsymbol{X}_{ik}, \boldsymbol{\beta}_k)\) across all \(k\) to get the overall density. In a sense, this can be thought of as taking a weighted average of all \(K\) component densities, as we do not know the component membership in advance, where the weights are the \(\pi_k\)’s.

If we somehow knew the component membership of each subject in advance, then we would not need a finite mixture model to begin with. Instead, we could just use each poisson regression model \(f_k\) to model the observations pertaining to component \(k\). When we discuss the “complete data log likelihood” later, we will come back to this idea.

The log likelihood can be written as \[l(\boldsymbol{\theta}) = \sum_{i=1}^n \log\left( \sum_{k =1}^K \pi_kf_k(y_i|\boldsymbol{X}_{ik}, \boldsymbol{\beta}_k)\right)\]

In the previous homework it was obvious that fitting these types of mixture models using NR was very tedious, as that the first and second derivatives become very complicated. The sum within the log complicates such derivatives especially.

We can show, however, the application of the EM algorithm can simplify the maximization procedure.

1.2 The Basic Idea

The general intuition behind the EM algorithm involves the maximization of a surrogate function in lieu of the original function/likelihood, which may be more amenable to maximization with standard approaches such as NR or BFGS.

In this manner, the problem at hand is transformed from a missing or “incomplete” data problem to a “complete” data problem, where the missing data is assumed to be known.

Assuming the missing data to be known reduces the complexity of the maximization problem and oftentimes has a much nicer form than the original likelihood. In cases where there is no actual “missing” data in the original likelihood (like in the FMR example above), we may introduce some missing data to make it amenable for maximization via EM.

One natural question is how one can actually maximize such a complete data model. Obviously we do not know what the actual values of the missing data are, so how can we actually work with a CDLL?

1.3 Algorithm Strategy

The surrogate function being maximized here is the expectation of the complete data log likelihood with respect to the “missing” or “latent” data, conditional on the observed data and the current estimates of the model parameters. This function is often simpler in form than the original log likelihood, and is more amenable to maximization.

The EM algorithm alternates between two main steps, the “Expectation step” or “E-step”, and the “Maximization step” or “M-step”. In a very general sense, during EM we “fill in” the missing data with an educated guess (E step). We then maximize the complete data log likelihood with the missing data “filled in” using standard optimization methods (M-step). We iterate between the E and M steps until model convergence in terms of the parameters or likelihood.

This way, we do not directly deal with missing data in the M-step, facilitating the application of existing maximization routines in the M-step. In this lecture we will first start with the formulation of the EM approach, its general properties, variants of this approach, and finally finish with some examples.

2 Algorithm Setup

2.1 General formulation

Let \(\boldsymbol{y}\) be the \(n\)-dimensional random vector pertaining to the vector of the observed data \(\boldsymbol{y}_o\). Let us assume that \(\boldsymbol{y}\) is distributed with PDF \(g(\boldsymbol{y} | \boldsymbol{\theta})\), where \(\boldsymbol{\theta} = (\theta_1,\ldots,\theta_d)\) is a \(d\)-dimensional vector of unknown parameters to be estimated and \(\boldsymbol{\theta} \in \Omegab\), \(\Omegab\) being some \(d\)-dimensional space for \(\boldsymbol{\theta}\).

To be clear, we are treating \(\boldsymbol{y}_o\) here in the general sense in that it pertains to all the observed data for a particular model. This is exactly how it sounds like, in that it pertains to the data that we can actually collect and have on hand in our problem.

In some situations, it may be helpful to reformulate a problem with no missing data into a missing data problem. This may be done, for example, by introducing a latent variable that may simplify the likelihood and thus computation. Such latent variables may be considered as hypothetical and never observable in some sense, but as we will see later, leads to nice form of the complete data likelihood that is more amenable to maximization.

In either case, we may term the observed data \(\boldsymbol{y}_o\) as the “incomplete data” in this setting, and the “complete” or “augmented” data as as \(\boldsymbol{y}_c\), where \(\boldsymbol{y}_c = (\boldsymbol{y}_o^T, \boldsymbol{z}^T)^T\), where \(\boldsymbol{z}\) pertains to the vector of missing or unobservable data.

2.1.1 Simple examples of “missing” data in this context

Examples of \(\boldsymbol{z}\) in the former case may pertain to the actual (unobserved) survival times of patients in the study who were censored in the clinical trial. For example, those patients who did not pass away by the end of the study or were lost to follow up. We only observe their survival time up until their last follow up (\(\boldsymbol{y}_o\)).

However, if we wait long enough, we should be able to observe the survival times for all patients in the trial, without any censoring. Or, if we were able to track down those patients that were lost to follow up, we could observe how long they actually had survived. Here, the data is truly missing in that in some scenarios there is a possibility that we can observe the data.

Examples of \(\boldsymbol{z}\) in the latter case may pertain to the set of class memberships in a finite mixture model, or state-membership in a hidden markov model. Unlike the survival example, there is no possibility to observe such states in reality. Assuming that each observation belongs to a single class/state greatly simplifies the problem and facilitates the maximization of the model. We will give examples of these types of models in a bit. Connecting this with the example at the beginning of this lecture, the statistician cannot not observe whether an individual is from the amateur or professional fishing group after the fact.

2.1.2 Defining the Complete Data Log Likelihood

Let us assume that the distribution for the complete data vector \(\boldsymbol{y}_c\) is given by the pdf \(g(\boldsymbol{y}_c | \boldsymbol{\theta})\). Given this setup, we can define the complete data log likelihood function as \[\log L_c(\boldsymbol{\theta}) = \log g(\boldsymbol{y}_c| \boldsymbol{\theta}).\] From this, it is clear that the likelihood can be simply obtained by integrating out the missing data from the complete data likelihood, such that \[g(\boldsymbol{y}_o,\boldsymbol{\theta}) = \int g(\boldsymbol{y}_c| \boldsymbol{\theta})d\boldsymbol{z} \]

2.1.3 Q-function and Initialization

The objective function to be maximized over can be considered a surrogate function of the likelihood, termed the “Q-function”. This function is defined at the \(t\)th step as \[Q(\boldsymbol{\theta}|\boldsymbol{\theta}^{(t)}) = E\left[ \log L_c(\boldsymbol{\theta}) | \boldsymbol{y}_o,\boldsymbol{\theta}^{(t)}\right],\] the expectation of the complete data log likelihood with respect to the missing data \(\boldsymbol{z}\), given the observed data and the current value of the parameter estimates.

Similar to the algorithms introduced in the prior lecture, the EM algorithm is iterative and begins at some starting value for \(\boldsymbol{\theta}\), which we denote as \(\boldsymbol{\theta}^{(0)}\). Alternatively, in some cases it may make sense to start with an initial value of the E-step, and then proceed to the M-step. We will give an example of this situation later in the finite mixture regression model example.

Better starting values may result in faster convergence, as well as a higher chance of converging to the global maximum as opposed to a local one. In a later section will be discuss some of the properties of the EM, one of which is that it does not guarantee global but only local convergence.

2.1.4 E-step

In the E-step, the expected value of the complete data likelihood is updated, given the current value of the parameter estimates and the observed data. In some sense, we fill in the missing data in the complete data log likelihood with their expected values at that step, given the observed data and current parameter estimates from the M-step. That is, at iteration \(t\), we calculate \(Q(\boldsymbol{\theta}|\boldsymbol{\theta}^{(t)})\), where \(Q(\boldsymbol{\theta}|\boldsymbol{\theta}^{(t)}) = E\left[ \log L_c(\boldsymbol{\theta}) | \boldsymbol{y}_o,\boldsymbol{\theta}^{(t)}\right].\)

Now, in many cases this analogy of “filling in” the missing data with their conditional expections may be an accurate description. However, in a more general sense we are simply computing the conditional expectation of the complete data log likelihood with respect to the missing data. For example, if we denote the missing data in a problem was some variable \(q\), and in the complete data log likelihood both the terms \(q\) and \(q^2\) appear, \(E[q^2| \boldsymbol{y}, \boldsymbol{\theta}] \neq E[q, \boldsymbol{y}, \boldsymbol{\theta}]^2\). That is, we cannot simply “fill in” the guess for \(q^2\) in the CDLL with the square of the guess for \(q\). We will still have to formally evaluate the conditional expectation for \(q^2\) and fill the value in for that particular term.

Another way to think about this is that we are integrating out the missing data from the complete data log likelihood weighted by the posterior distribution of the missing data, given the observed data and current parameter estimates. If complex functions of the missing data are present in the CDLL this can complicate the evaluation of this integral, and hence the E-step.

Luckily, in many cases this integral (such as in the current example), can reduce to simple forms. We will show examples of both common simple cases and more complex ones, and how one can address the latter cases with extensions of the EM algorithm later in this lecture.

2.1.5 M-step

At step \(t\), the M-step maximizes the Q-function \(Q(\boldsymbol{\theta}|\boldsymbol{\theta}^{(t)})\) with respect to \(\boldsymbol{\theta}\) over the parameter space \(\Omegab\). In other words, \(\boldsymbol{\theta}^{(t+1)}\) is chosen such that \(Q(\boldsymbol{\theta}^{(t+1)}|\boldsymbol{\theta}^{(t)}) > Q(\boldsymbol{\theta}|\boldsymbol{\theta}^{(t)})\) \(\forall \boldsymbol{\theta} \in \Omegab\).

Optimization methods such as those discussed in the prior lecture are now applicable in the M-step, simplifying the optimization procedure relative to before.

In this sense, the EM algorithm is modular, where one can apply existing maximization procedures to maximize the Q-function even in situations where the likelihood is quite complicated, for example necessitating the evaluation of multidimensional integrals (such as GLMMs) or require recursive computation (such as HMMs).

2.1.6 Convergence Criteria

The algorithm iterates between the E and M steps until the value of the likelihood, Q-function, or parameter estimates converge (one can pick any of the three). In cases where the likelihood is difficult to compute, such as in GLMMs, using the Q-function or parameter estimates may be preferable over using the likelihood function for convergence. The same principles regarding choosing informative starting points and convergence criteria apply to the EM algorithm as well.

We will see that some of the properties of the EM algorithm enables it to be quite robust regardless of selected starting points, but can be slower to converge relative to other methods. It is also not immune to converging to local maxima.

2.1.7 General comments

In the E-step one only needs to know the condition density of the missing data given the observed data. In cases where this density is unknown or intractable, approaches such as the monte-carlo EM may be utilized to approximate the E-step using sampling-based approaches.

2.2 Application of EM to the Poisson Finite Mixture Regression Model Example

So we introduced a fair bit of notation here, and to really illustrate how this approach works let’s start our original example. Here we reformulate the problem at the beginning of this lecture by introducing a latent variable that allows for maximization via EM.

In each case, we will break down the application of the EM algorithm using the same structure introduced above:

  • Define the log likelihood (observed data log likelihood)
  • Define the Complete Data Log Likelihood
  • Derive the Q-function
  • Compute the E-step
  • Compute the M-step
  • Define initialization and convergence criteria

When you are applying the EM algorithm in process, it is helpful to break down the problem into the above steps to clarify how each portion of the algorithm is being carried out.

2.2.1 Log Likelihood

From the general form of the FMR model given earlier, the log likelihood can be written as \[l(\boldsymbol{\theta}) = \sum_{i=1}^n \log\left( \sum_{k =1}^2 \pi_kf_k(y_i|\boldsymbol{X}_{ik}, \boldsymbol{\beta}_k)\right),\] where \(f_k(y_i|\boldsymbol{X}_{ik}, \boldsymbol{\beta}_k) = \frac{e^{-\lambda_{ik}}\lambda_{ik}^{y_i}}{y_i!}\), \(\lambda_{ik} = \exp(\boldsymbol{X}_{ik}\boldsymbol{\beta}_k)\), and \(\sum_{k = 1}^K \pi_k = 1\).

In this example \(\boldsymbol{X}_{ik} = \boldsymbol{X}_i\), where \(\boldsymbol{X}_i\) is a \(1 \times 4\) vector pertaining to the intercept, age, log boat length in feet, and log cooler size in liters, and \(\boldsymbol{\beta}_k = (\beta_{0k}, \beta_{1k}, \beta_{2k}, \beta_{3k})^T\) are the regression coefficients component \(k\) pertaining to \(\boldsymbol{X}_i\). Therefore, we assume that the mean of each mixture component is modeled with different sets of regression coefficients but share the same set of predictors. This allows for the same predictor to potentially have different effects in each component. We also assume here that an observation may come from either the amateur or professional fishing populations, however the component membership of each observation is not known or observed.

2.2.2 Complete Data Log Likelihood

Given that the derivatives based on this log likelihood are complicated, the application of NR or even gradient-based methods may be tedious.

Let us instead construct the CDLL in the following manner, assuming the original component membership of each observation was known:

\[\begin{align} L_c(\theta) &= \sum_{i=1}^n \log( \prod_{k =1}^2 \left[\pi_kf_k(y_i|\boldsymbol{X}_i, \boldsymbol{\beta}_k)\right]^{I[z_i = k]}) \\ &= \sum_{i=1}^n \sum_{k =1}^2 I[z_i = k]\left[\log(\pi_k)+\log(f_k(y_i|\boldsymbol{X}_i, \boldsymbol{\beta}_k))\right]\\ \end{align}\]

where \(\boldsymbol{z} = (z_1,\ldots,z_n)\) and \(z_i\), where \(z_i=1,2\), represents the component membership pertaining to observation \(i\), \(i = 1,\ldots,n\). By assuming the component membership is known, we can simplify the expression in the CDLL to the above, which no longer contains the log of a sum.

In reality, we do not know which component an individual came from. Thus, we still have the problem that we do not actually observe \(\boldsymbol{z}\), and we need to maximize the above expression. Lets now move on to the E and M-steps.

2.2.3 E-step

Now we must evaluate the expectation of the CDLL, \(Q(\boldsymbol{\theta}|\boldsymbol{\theta}^{(t)}) = E\left[ \log L_c(\boldsymbol{\theta}) | \boldsymbol{y}_o,\boldsymbol{\theta}^{(t)}\right]\). Here, this simply reduces to

\[\begin{align} Q(\boldsymbol{\theta}|\boldsymbol{\theta}^{(t)}) &=\sum_{i=1}^n \sum_{k =1}^2 E[I[z_i = k] | \boldsymbol{y}_o, \boldsymbol{\theta}^{(t)}]\left[\log(\pi_k)+\log(f_k(y_i|\boldsymbol{X}_i, \boldsymbol{\beta}_k))\right]\\ &= \sum_{i=1}^n \sum_{k =1}^2 p(z_i = k | \boldsymbol{y}_o, \boldsymbol{\theta}^{(t)})\left[\log(\pi_k)+\log(f_k(y_i|\boldsymbol{X}_i, \boldsymbol{\beta}_k))\right] \end{align}\]

We can show that

\[\begin{align} p(z_i = k | \boldsymbol{y}_o, \boldsymbol{\theta}^{(t)}) &= \frac{p(z_i = k,\boldsymbol{y}_o | \boldsymbol{\theta}^{(t)})}{p(\boldsymbol{y}_o | \boldsymbol{\theta}^{(t)})} \\ &= \frac{p(z_i = k,\boldsymbol{y}_o| \boldsymbol{\theta}^{(t)})}{\sum_{k =1}^2p(z_i = k,\boldsymbol{y}_o | \boldsymbol{\theta}^{(t)})}\\ &= \frac{\pi_k^{(t)}f_k(y_i|\boldsymbol{X}_i, \boldsymbol{\beta}^{(t)}_k)}{ \sum_{k =1}^2\pi_k^{(t)}f_k(y_i|\boldsymbol{X}_i, \boldsymbol{\beta}^{(t)}_k)} \end{align}\]

using the definition of conditional probability. So, evaluating the Q-function is facilitated by simply computing the expectation above. It is easy to see here that we are “filling in” the function of the latent data \(I[z_i = k]\) with \(p(z_i = k | \boldsymbol{y}_o, \boldsymbol{\theta}^{(t)})\), making it possible to maximize this function. This quantity is sometimes referred to as the “posterior probability” of subject \(i\) belonging to component \(k\) at iteration \(t\).

Upon convergence, we will see that we can actually use these values to classify subjects into components.

2.2.4 M-step

Here we now maximize \(Q(\boldsymbol{\theta}|\boldsymbol{\theta}^{(t)}) = \sum_{i=1}^n \sum_{k =1}^2 p(z_i = k | \boldsymbol{y}_o, \boldsymbol{\theta}^{(t)})\left[\log(\pi_k)+\log(f_k(y_i|\boldsymbol{X}_i, \boldsymbol{\beta}_k))\right]\). We can simplify this to maximizing \[\sum_{i=1}^n p(z_i = k | \boldsymbol{y}_o, \boldsymbol{\theta}^{(t)})\log(f_k(y_i|\boldsymbol{X}_i, \boldsymbol{\beta}_k))\] with respect to \(\boldsymbol{\beta}_k\), \(k = 1,2\), and \[\sum_{i=1}^np(z_i = k | \boldsymbol{y}_o, \boldsymbol{\theta}^{(t)})\log(\pi_k)\] with respect to \(\pi_k\), \(k = 1,2\). This is because the Q-function nicely separates with respect to each set of parameters.

For each \(\boldsymbol{\beta}_k\), we can implement one of the techniques described in the previous lecture to maximize the portion of the Q-function above that pertains to \(\boldsymbol{\beta}_k\). However, we can also simplify this step by recognizing that \(\sum_{i=1}^n p(z_i = k | \boldsymbol{y}_o, \boldsymbol{\theta}^{(t)})\log(f_k(y_i|\boldsymbol{X}_i, \boldsymbol{\beta}_k))\) takes on a familiar form.

For example, if we consider \(p(z_i = k | \boldsymbol{y}_o, \boldsymbol{\theta}^{(t)})\) as “prior weights” in the GLM framework, we can obtain \(\boldsymbol{\beta}_k^{(t+1)}\) by maximizing a weighted Poisson GLM with prior weights \(p(z_i = k | \boldsymbol{y}_o, \boldsymbol{\theta}^{(t)})\) for each observation, responses \(y_1\ldots y_n\), and predictor matrix \(\boldsymbol{X}\). From prior coursework, you may remember that these prior weights are incorporated into the IRLS algorithm by multiplying them with the IRLS weights. For example, we can define new IRLS weights \(w_i^{*(t)} =p(z_i = k | \boldsymbol{y}_o, \boldsymbol{\theta}^{(t)})w_i^{(t)}\). This can be easily done with off the shelf methods, or simply with the code given in the prior lecture. If \(p(z_i = k | \boldsymbol{y}_o, \boldsymbol{\theta}^{(t)}) = 1\) for all \(i = 1,\ldots,n\), then this problem just reduces to fitting a regular (unweighted) poisson regression model.

For \(\pi_k\), we can show that \(\pi_k^{(t+1)} = \sum_{i=1}^n p(z_i = k | \boldsymbol{y}_o, \boldsymbol{\theta}^{(t)})/n\), that is, the mean of the posterior probabilities of each individual belonging to component \(k\). How do we get this? We have the constraint that \(\sum_{i=1}^{K} \pi_k = 1\), therefore we can compute the Langrangian for this problem and set its partial derivative to zero (don’t need to worry about this). Another way to think about this is that if we actually did observe the states, we could simply calculate \(\hat{\pi}_k = \sum_{i = 1}^nI[z_i = k]/n\). Since they are not observed, we can take the conditional expectation of each \(z_i\) to give \(\pi_k^{(t+1)} = \sum_{i=1}^n p(z_i = k | \boldsymbol{y}_o, \boldsymbol{\theta}^{(t)})/n\).

Therefore, in the M-step maximizing the Q-function with respect to \(\boldsymbol{\beta}_k\) reduces to the problem of computing a weighted GLM in each component. For \(\pi_k\) the computation reduces to simply computing the mean of the component posterior probabilities of membership in each component. This is drastically simpler than the alternative approaches that try to maximize the log likelihood directly. Also, as we will see, this also simplifies the coding implementation by using existing maximization routines from the last lecture.

2.2.5 Initialization and Convergence

As we will see later, the EM is not guaranteed to converge to a global optimum but to a local optimum or saddle point. As a result, choosing multiple initializations is helpful, keeping the one with the best final likelihood after convergence.

In mixture models it may not be clear how to initialize \(\boldsymbol{\pi}\) and \(\boldsymbol{\beta}\), but it may be easier to initialize \(\boldsymbol{z}\) instead. For example, we may randomly assign the cluster assignment to either component 1 or component 2, and then initialize \(p(z_i = 1 | \boldsymbol{y}_o, \boldsymbol{\theta}^{(0)}) = 1\) for subjects assigned to component 1 and 0 otherwise, and set \(p(z_i = 2 | \boldsymbol{y}_o, \boldsymbol{\theta}^{(t)}) = 1- p(z_i = 1 | \boldsymbol{y}_o, \boldsymbol{\theta}^{(t)})\).

If an informative guess may be utilized for an initial partition, for example, assuming those with higher fish counts are more likely to be professional rather than amateur fishers, one may pick an arbitrary count cutoff for the count threshold for initial assignment. Since we do not know what threshold is best, we can vary this threshold over a range of possible values and rerun the algorithm.

For convergence, we can use parameter-based criteria as before, or simply use the updated value of the log likelihood at each iteration.

2.2.6 And finally, the code implementation

We wont wrap this in a function just yet so that its easier to explore and inspect some of the interior components of the algorithm.

### Initialization and setup

## starting parameters
# convergence threshold
tol = 10 ^ -5
# max number of EM iterations
maxit = 50
# iteration counter
iter = 0
# arbitrary large to avoid triggering the while loop for delta
eps = Inf
# arbitrary large value to set the inital ll to
ll = -10000
# list object for M-step
fit = list()
# Number of mixture components, K
K = 2

## create posterior probability matrix
# First col is p(z_i = 1 | y_o, theta^{(t)}, amateur
# Second col is p(z_i = 2 | y_o, theta^{(t)}, professional
pp = matrix(0, n, K)

# name the columns
colnames(pp) = c("amateur", "prof")

## use informative initialization,  assuming lower counts --> amateur
#  choose prop of samples that are amateur first
#  should vary this and re-run to evaluate multiple starting points/partitions
prop_ama = 0.6

# set everything greater than prop_ama'th percentile to be professional
pp[, 2] = (y > quantile(y, prop_ama)) ^ 2
pp[, 1] = 1 - pp[, 2]

### now start the EM algorithm
start = Sys.time()
while (eps > tol & iter < maxit) {
  ## save old ll
  ll0 = ll
  
  
  ## start M-step (Sec. 2.2.4)
  # pi, mean of each pp column
  pi = colMeans(pp)
  
  # beta_k, weighted glm's based on pp
  for (k in 1:K)
    fit[[k]] = glm(y ~ X - 1, family = poisson(), weights = pp[, k])
  
  
  ## start E-step (Sec. 2.2.3)
  # calculate numerator, fitted values are exp(X %*% beta_hat)
  for (k in 1:K)
    pp[, k] = pi[k] * dpois(y, lambda = fit[[k]]$fitted)
  
  # divide by denominator, the sum across components in each i
  pp = pp / rowSums(pp)
  
  
  ## calculate LL (Sec 2.2.1)
  # Calculate sum across K inside the log for each subject
  interior_sum = rep(0, length(y))
  for (k in 1:K)
    interior_sum = interior_sum + pi[k] * dpois(y, lambda = fit[[k]]$fitted)
  
  # Take the log of each, then sum across the n subjects to get the ll
  ll = sum(log(interior_sum))
  
  ## calculate relative change in log likelihood
  eps  = abs(ll - ll0) / abs(ll0)
  
  ## update iterator
  iter = iter + 1
  if (iter == maxit)
    warning("Iteration limit reached without convergence")
  
  
  ## print out info to keep track
  cat(sprintf("Iter: %d logL: %.2f pi1: %.3f  eps:%f\n", iter, ll, pi[1], eps))
}
## Iter: 1 logL: -1739.73 pi1: 0.618  eps:0.826027
## Iter: 2 logL: -1730.73 pi1: 0.581  eps:0.005174
## Iter: 3 logL: -1727.27 pi1: 0.556  eps:0.001998
## Iter: 4 logL: -1725.76 pi1: 0.539  eps:0.000870
## Iter: 5 logL: -1725.11 pi1: 0.529  eps:0.000379
## Iter: 6 logL: -1724.83 pi1: 0.522  eps:0.000164
## Iter: 7 logL: -1724.70 pi1: 0.517  eps:0.000071
## Iter: 8 logL: -1724.65 pi1: 0.514  eps:0.000030
## Iter: 9 logL: -1724.63 pi1: 0.512  eps:0.000013
## Iter: 10 logL: -1724.62 pi1: 0.511  eps:0.000005
end = Sys.time()
print(end - start)
## Time difference of 0.07759905 secs

Now lets looks at the results

# print the mixture proportion estimates
print(pi)
##   amateur      prof 
## 0.5108772 0.4891228
# print the coefficients pertaining to the first mixture component
print(fit[[1]]$coef)
## X(Intercept)         Xage Xboat_length      Xcooler 
##  3.114441448 -0.001714080  0.002282324 -0.012192209
# print the coefficients pertaining to the second mixture component
print(fit[[2]]$coef)
## X(Intercept)         Xage Xboat_length      Xcooler 
## 2.9890427275 0.0005340096 0.0002173138 0.0092431205

Lets also plot the fitted values for each component on a plot looking at fish vs cooler:

plot(cooler, log(y + 1), ylab = "Log Fish Caught", xlab = "Cooler Size (L)")
points(cooler, log(fit[[1]]$fitted + 1), col = "blue")
points(cooler, log(fit[[2]]$fitted + 1), col = "red")
legend(
  c("Amateur Fisher", "Professional Fisher"),
  col = c("blue", "red"),
  x = "bottomright",
  lty = c(1, 1)
) 

Looks like the relationship of cooler size is inversely related with number of fish caught among the amateur fishermen, but positively related with the number from professional fishermen. Hmm…

We can also use the final set of posterior probabilities to classify subjects:

# boxplot
par(mfrow = c(1, 2))
boxplot(pp[, 2] ~ factor(group_indicators, labels =  c("Amateur", "Professional")),
        ylab = "Post. Prob of Professional",
        xlab = "True Membership")
abline(h = 0.5)

# scatterplot
colvec = rep("lightblue", length(y))
colvec[pp[, 2] > .5] = "lightcoral"
plot(cooler,
     log(y + 1),
     ylab = "Log Fish Caught",
     xlab = "Cooler Size (L)",
     col = colvec)

# add fitted values by subpopulation
points(cooler, log(fit[[1]]$fitted + 1), col = "blue")
points(cooler, log(fit[[2]]$fitted + 1), col = "red")

In addition, given a new sample’s vector of covariates, say \(\boldsymbol{X}_{i,new}\), we can also compute their posterior probability of component membership as well, given the fitted model estimates.

3 General Properties of the EM Algorithm

In this section we will not focus on the proofs behind the results shown but instead highlight their implications on the properties of the EM.

3.1 Monotone Increase of the Likelihood

The seminal 1977 paper on the EM algorithm by Dempster, Laird, and Rubin demonstrated that the observed or incomplete data likelihood function (referred to as the likelihood function in other contexts) is guaranteed not to decrease with each EM iteration such that \(L(\boldsymbol{\theta}^{(t+1)}) \geq L(\boldsymbol{\theta}^{(t)})\) for each iteration \(t \geq 0\). Therefore, for some bounded sequence of likelihood values \({L(\boldsymbol{\theta}^{(t+1)})}\), we know that the likelihood at each iteration converges to some likelihood value \(L^*\), which may or may be either a local or global maximum.

The “self-consistency” property of the EM comes from this result, where we can show that if the MLE \(\hat{\boldsymbol{\theta}}\) is the global maximizer for the likelihood, this implies that it is also the global maximizer for the Q-function (the surrogate for the likelihood). In other words, if \(\hat{\boldsymbol{\theta}}\) is the global maximizer for \(L(\boldsymbol{\theta})\), then this implies that \[Q(\hat{\boldsymbol{\theta}}|\hat{\boldsymbol{\theta}}) \geq Q(\boldsymbol{\theta}|\hat{\boldsymbol{\theta}})\] and that \(\hat{\boldsymbol{\theta}}\) is the solution to the equation \[\frac{d}{d\boldsymbol{\theta}} Q(\boldsymbol{\theta}|\hat{\boldsymbol{\theta}})\|_{\boldsymbol{\theta} = \hat{\boldsymbol{\theta}}} = 0.\] The latter statement basically says that the optimal \(\boldsymbol{\theta}\) with respect to the likelihood function is also the optimal value with respect to the Q-function (hence showing why we can work with the easier Q-function for maximizing the likelihood).

In a more general sense, \(\boldsymbol{\theta}^{(t+1)}\) is chosen to maximize \(Q(\boldsymbol{\theta}|\boldsymbol{\theta}^{(t)})\) at iteration \(t\) rather than globally. Using similar reasoning as the prior result, we can show that this results in a sequence of estimates across iterations satisfying \(L(\boldsymbol{\theta}^{(t+1)}) \geq L(\boldsymbol{\theta}^{(t)})\), and therefore the likelihood will not decrease as the algorithm progresses.

Under certain conditions, such as if the likelihood function is unimodal in \(\boldsymbol{\theta}\) (one global maximum with no local maximums), then any sequence \({\boldsymbol{\theta}^{(t)}}\) will converge to the unique MLE of \(\boldsymbol{\theta}\), \(\boldsymbol{\theta}^*\). In cases where the parameter space is not unrestricted and may be constrained, such convergence may not be guaranteed but depends on the specific scenario.

3.2 Convergence Rates

Using similar arguments at the last lecture, we can use a Taylor Series Expansion of \(\boldsymbol{\theta}^{(t)}\) around \(\boldsymbol{\theta}^{(t)}\). We can show that around \(\boldsymbol{\theta}^*\), the rate of convergence of the EM depends on the problem at hand. Theoretically, the rate of convergence can be determined from the eigenvalues of the Jacobian Matrix at \(\theta^*\), where the component with the largest eigenvalues (indicating the slowest rate) determines the overall rate of convergence.

In general, we can view the EM algorithm as a conservative algorithm, guaranteeing improvements in the likelihood but at rates oftentimes slower than other algorithms. Again, how much slower is dependent on the problem at hand. A more in-depth discussion on this topic in the gaussian mixture model context can be found here

4 Computing Standard Errors of Parameter Estimates

For the NR algorithm, the hessian matrix is estimated with each update of the algorithm, and one nice byproduct of this is that standard errors can be directly computed for the parameter estimates after convergence using the hessian. For methods such as BFGS where no hessian is computed, its approximation (often used during maximization) can similarly be utilized for standard error computation. That is, the covariance matrix of the parameter estimates can be determined with relative ease.

For the EM algorithm, due to the fact that we do not directly take second derivatives of the log likelihood or approximate it, we do not have a ready-made hessian matrix as in the previous methods. As a result, quantities such as standard errors are harder to obtain. However, several approaches exist to obtain such estimates post-convergence.

4.1 Direct evaluation

Methods for direct evaluation of the covariance matrix have been developed, but in some cases may be tedious to derive and implement (Louis’ method, etc). Such approaches require one to derive the second derivatives of the CDLL, which in some cases may be tedious to calculate but much easier than working with the actually log likelihood. Here, after convergence of the EM algorithm, the MLE is utilized to calculate the observed Fisher Information matrix from these forms, which is then inverted to obtain the standard errors.

We will not go into too much detail on this topic, but further information on this approach in given in McLachlan and Krishnan (MK) Section 4.2.2, and an approximation for the i.i.d. case is given in Section 4.3, and for non-i.i.d data in 4.5. Several specific algorithms for implementing these direct calculation approaches are given in MK Section 4.7.

4.2 Parametric Bootstrap

A very general and popular approach that can be applied to all models is the bootstrap. There are multiple versions of this approach, however one approach commonly used is the parametric bootstrap. Essentially, one first fits the model to the data to obtain \(\hat{\boldsymbol{\theta}}\), and then simulates \(B\) bootstrap datasets of size \(n\) from the fitted model. For each of the \(B\) datasets, we refit the model and save the quantities of interest (for example, model parameter estimates). We can then utilize these values to approximate related quantities of interest that may be difficult to do analytically.

For the purpose of covariance matrix estimation for the model parameters, we simply calculate the covariance of the parameter estimates from each of the \(B\) fitted models. Simple, but also takes some additional computational expense to apply the EM algorithm \(B\) times.

4.3 Non-parametric Bootstrap

An alternative version is the nonparametric bootstrap, where instead of simulating \(B\) bootstrap samples from the fitted model, we instead resample observations with replacement from the original dataset to get out \(B\) bootstrap datasets. Then, similar to the parametric approach, we fit our model on each of the \(B\) datasets, and then calculate the covariance matrix of the \(B\) vectors of parameter estimates across the cases.

The nonparametric version may be preferred in cases where we have “true missing” data, as indicated earlier in the lecture. For example, in cases where we have censoring in our dataset. In such cases it is difficult to know and simulate the mechanism and factors driving the missingness in the data, and therefore resampling the original data is one robust way to avoid making such assumptions. Overall, it is relatively simpler to implement.

While we are using this in the context of mixture models fitted with EM, we can use the principle to estimate quantities in many other contexts, as the approach is quite general and simple. More detail on bootstrapping is given in GH Chapter 9. We may prefer to use direct methods for standard error computations in cases when the application of EM to a specific problem is computational burdensome, and therefore applying EM to each resampled datasets would be very time-prohibitive.

4.4 Application of nonparametric bootstrap to standard error estimation

Here we created a function that implements the while loop from the prior EM example to simplify the code below. The function returns the coefficient estimates from a given bootstrap dataset

pois.two.mix.reg.em = function(y, X, tol, maxit, prop1, trace = 0) {
  ## starting parameters
  # iteration counter
  iter = 0
  # arbitrary large to avoid triggering the while loop for delta
  eps = Inf
  # arbitrary large value to set the inital ll to
  ll = -10000
  # list object for M-step
  fit = list()
  # Number of mixture components, K
  K = 2
  
  ## create posterior probability matrix
  # First col is p(z_i = 1 | y_o, theta^{(t)}, amateur
  # Second col is p(z_i = 2 | y_o, theta^{(t)}, professional
  pp = matrix(0, n, K)
  
  # name the columns
  colnames(pp) = c("amateur", "prof")
  
  ## use informative initialization,  assuming lower counts --> amateur
  #  choose prop of samples that are amateur first
  #  should vary this and re-run to evaluate multiple starting points/partitions
  prop_ama = prop1
  
  # set everything greater than prop_ama'th percentile to be professional
  pp[, 2] = (y > quantile(y, prop_ama)) ^ 2
  pp[, 1] = 1 - pp[, 2]
  
  ### now start the EM algorithm
  start = Sys.time()
  while (eps > tol & iter < maxit) {
  ## save old ll
  ll0 = ll
  
  
  ## start M-step (Sec. 2.2.4)
  # pi, mean of each pp column
  pi = colMeans(pp)
  
  # beta_k, weighted glm's based on pp
  for (k in 1:K)
    fit[[k]] = glm(y ~ X - 1, family = poisson(), weights = pp[, k])
  
  
  ## start E-step (Sec. 2.2.3)
  # calculate numerator, fitted values are exp(X %*% beta_hat)
  for (k in 1:K)
    pp[, k] = pi[k] * dpois(y, lambda = fit[[k]]$fitted)
  
  # divide by denominator, the sum across components in each i
  pp = pp / rowSums(pp)
  
  
  ## calculate LL (Sec 2.2.1)
  # Calculate sum across K inside the log for each subject
  interior_sum = rep(0, length(y))
  for (k in 1:K)
    interior_sum = interior_sum + pi[k] * dpois(y, lambda = fit[[k]]$fitted)
  
  # Take the log of each, then sum across the n subjects to get the ll
  ll = sum(log(interior_sum))
  
  ## calculate relative change in log likelihood
  eps  = abs(ll - ll0) / abs(ll0)
  
  ## update iterator
  iter = iter + 1
  if (iter == maxit)
    warning("Iteration limit reached without convergence")
 
    ## print out info to keep track
    if (trace > 0)
      cat(sprintf("Iter: %d logL: %.2f pi1: %.3f  eps:%f\n", iter, ll, pi[1], eps))
  }
  
  ### return coefficients
  res = c(pi, fit[[1]]$coef, fit[[2]]$coef)
  names(res) = c(
    "pi1",
    "pi2",
    "int1",
    "age1",
    "boat_length1",
    "cooler1",
    "int2",
    "age2",
    "boat_length2",
    "cooler2"
  )
  return(res)
}

Now lets verify it works properly:

em_fit = pois.two.mix.reg.em(y=y, X= X,tol = tol,maxit = maxit, prop1 = prop_ama, trace = 1)
## Iter: 1 logL: -1739.73 pi1: 0.618  eps:0.826027
## Iter: 2 logL: -1730.73 pi1: 0.581  eps:0.005174
## Iter: 3 logL: -1727.27 pi1: 0.556  eps:0.001998
## Iter: 4 logL: -1725.76 pi1: 0.539  eps:0.000870
## Iter: 5 logL: -1725.11 pi1: 0.529  eps:0.000379
## Iter: 6 logL: -1724.83 pi1: 0.522  eps:0.000164
## Iter: 7 logL: -1724.70 pi1: 0.517  eps:0.000071
## Iter: 8 logL: -1724.65 pi1: 0.514  eps:0.000030
## Iter: 9 logL: -1724.63 pi1: 0.512  eps:0.000013
## Iter: 10 logL: -1724.62 pi1: 0.511  eps:0.000005
print(em_fit)
##           pi1           pi2          int1          age1  boat_length1 
##  0.5108771578  0.4891228422  3.1144414479 -0.0017140802  0.0022823240 
##       cooler1          int2          age2  boat_length2       cooler2 
## -0.0121922086  2.9890427275  0.0005340096  0.0002173138  0.0092431205

Everything looks correct, so lets move on to implementing the nonparametric bootstrap

## set number of bootstrap samples of size n to draw
B = 100

## create matrix to save estimates
est  = matrix(NA, B, length(em_fit))

# start loop
for (i in 1:B) {
  # get indices for resampled observations with replacement
  index = sample(1:n, replace = T)
  
  # fit EM to resampled data
  res = pois.two.mix.reg.em(
    y = y[index],
    X = X[index,],
    tol = tol,
    maxit = maxit,
    prop1 = prop_ama,
    trace = 0
  )
  
  # save values
  est[i,] = res
  
  # print first 5 parameter estimates every 200 iterations for illustration
  if (floor(i / 20) == ceiling(i / 20))
    print(round(est[i, 1:5], 4))
}
## [1]  0.5563  0.4437  3.0795 -0.0037  0.0000
## [1]  0.4739  0.5261  3.1739 -0.0036  0.0014
## [1]  0.5108  0.4892  3.2189 -0.0037  0.0026
## [1]  0.5429  0.4571  3.3750 -0.0019 -0.0003
## [1]  0.5610  0.4390  3.2977 -0.0030  0.0023

Now that we have the coefficient estimates for each nonparametric bootstrap sample, we can calculate the approximate covariance matrix for the parameters. Obviously, as we increase \(B\) we can increase the accuracy of the approximation.

cov(est)[1:5,1:5]
##               [,1]          [,2]          [,3]          [,4]          [,5]
## [1,]  9.295747e-04 -9.295747e-04  2.762876e-04  3.372238e-07  8.448758e-07
## [2,] -9.295747e-04  9.295747e-04 -2.762876e-04 -3.372238e-07 -8.448758e-07
## [3,]  2.762876e-04 -2.762876e-04  3.074595e-02 -3.399235e-04 -2.154657e-05
## [4,]  3.372238e-07 -3.372238e-07 -3.399235e-04  6.903614e-06 -4.357349e-08
## [5,]  8.448758e-07 -8.448758e-07 -2.154657e-05 -4.357349e-08  1.382719e-06

Given this, we can calculate the standard error of each parameter by square root of the diagonal of the bootstrap covariance matrix. Prior work has shown \(B\) = 50 to 100 is sufficient for standard error estimation.

print(sqrt(diag(cov(est))))
##  [1] 0.030488928 0.030488928 0.175345228 0.002627473 0.001175891
##  [6] 0.003183849 0.127560308 0.001768703 0.001193103 0.001795067

4.5 Pros and Cons of EM

  • Pros
    • Numerically stable, each iteration increases likelihood
    • Reliable convergence, depending on starting point
    • Easy to implement, modular
    • Avoids direct evaluation of likelihood and derivatives of the likelihood
    • In general memory efficient (does not need to store information matrix or its inverse)
    • M-step can be maximized with standard packages or simplified using extensions (ECM, etc) if needed
  • Cons
    • No direct way to obtain covariance matrix of parameter estimates as can be done with NR (strategies to do this later)
    • Can have relatively slow convergence, especially when there is a lot of missing information
    • Does not guarantee convergence to the global maximum if multiple maxima are present, but this is the case for other approaches as well.
    • In some cases, the E-step may be difficult to evaluate or intractable (for example when the evaluation of multidimensional integrals are needed). The MCEM algorithm that uses Monte Carlo sampling to approximate the E-step is one way around this.

5 Extensions of the EM Algorithm

Since the initial DLR paper, several extensions of the EM algorithm have been published. Each approach addresses potential issues and shortcomings of the EM algorithm in certain settings and aim to speed up its application or get around intractable numerical problems. As we will see, these modifications may occur in the M-step, where the maximization procedure can be simplified through various approaches, or in the E-step, where the conditional expectation may be difficult or impossible to evaluate analytically.

5.1 Expectation Condition Maximization (ECM, M-step Modification)

5.1.1 Why should we consider

In some cases, the M-step may be computationally complex or difficult to evaluate. For example, in cases when many parameters exist, or when the off-the-shelf maximization routine for the M-step is computationally intense of difficult to implement.

One solution to this problem: instead of maximizing all of the model parameters simultaneously, we instead maximize each parameter, or groups or parameters, in sequence, conditional on the prior values in the M-step. In contrast to the regular EM algorithm, we call this approach Expectation Conditional Maximization (ECM), as in the M step is now a series of sequential updates in terms of the parameters. These maximizations may require iterative approaches, such as those discussed in the previous lecture, or have closed forms.

We will see that in many cases this approach may result in more iterations of the EM algorithm, but less overall runtime, as the amount of time spent on each M-step is less. Other benefits include greater stability in maximizing over a simpler parameter space in each CM step. In general, we may find that utilizing smaller conditional maximization steps in lieu of simultaneous maximization may be easier in terms of implementation and may provide better stability for convergence. Examples of this include coordinate descent/ascent applications in high dimensional regression problems. We will touch in this in a later lecture.

5.1.2 Formulation

Let us now consider that we replace the original M-step with \(S\) CM steps. As we mentioned before, this may pertain to the maximization of \(S\) individual parameters or \(S\) groups of parameters.

Then, let us define \(\boldsymbol{\theta}^{(k+s/S)}\) as the value of \(\boldsymbol{\theta}\) on the \(s\)th step of the M-step, \(s = 1\ldots S\). Then, at each CM step \(s\), we maximize \(Q(\boldsymbol{\theta}|\boldsymbol{\theta}^{(k+(s-1)/S)})\) defined as the Q-function at the \(t\)th step following the \((s-1)\)th conditional maximization step with respect to \(\boldsymbol{\theta}\).

5.1.3 Speed of Convergence

It has been shown that the ECM reduces to a simple CM step in the absence of missing data. Prior work has shown that if a CM approach is expected to converge in the absence of missing data in a model, then we should expect similar convergence for an ECM approach.

It has been shown that the global speed of convergence is simply \[s_{ECM} = s_{EM}s_{CM},\] reflecting the product of the speed of the convergence of the regular EM and the CM step of the function in question in the absence of missing data. The speed on convergence of the ECM is in general slower than the EM (in terms of the number of iterations), as the Q-function is being improved conditionally only bit by bit in the M-step of the ECM. However, given that each CM in the M-step is much faster than the standard M-step, the total time it takes for all \(S\) CM steps to complete may be much faster.

Therefore, while the ECM may take more iterations to converge than EM, the faster and simper CM steps may result in lower overall run time, as the M-step will be faster to execute in each iteration. This is reflective of the main reason why one would pursue such an approach over standard EM when the M-step is complicated. Another benefit is that it may be easier to implement and apply this series of CM steps compared to the original maximization strategy, especially when the number of parameters is large.

5.2 Multicycle ECM

This variant is an extension of the ECM algorithm above, except that we perform an E-step update after each CM step. It can be shown that the multicycle ECM retains similar properties as the ECM algorithm in terms of convergence and the ascent property after each EM iteration. In some cases we may see a larger increase in the likelihood per EM iteration due to the Q function being updated more often. However, this may not always be the case, and in some cases the algorithm may converge more slowly than the ECM.

In general, this approach is best applied when the E-step computation is very quick and simple, as the additional computational burden of multiple E-step evaluations per EM iterations will be relatively low.

5.3 Example: Negative binomial mixture model

As mentioned in the previous lecture, the negative binomial distribution can be considered only part of the exponential family with fixed dispersion parameter. As a result, approaches based off of IRLS for fitting negative binomial regression models are iterative, first maximizing the set of regression coefficients \(\boldsymbol{\beta}\), and then the dispersion parameter \(\phi\). We assume the same log link used to model \(\lambda_{i}\) as a function of the linear predictor. Setting \(\phi = \infty\) (no overdispersion, depending on the parameterization of the model) will reduce the problem to standard Poisson regression.

Let’s generate some data similar to that from the fish example from earlier, but now adding in some overdispersion into each mixture component. We will then fit a two-component negative binomial mixture model, following the same blueprint of the earlier example. We will also up the sample size to 10,000 to better highlight the runtime differences between regular EM and the ECM version.

First, let’s generate the data.

## generate the data
set.seed(10)

## set sample size
n = 10000

## create design matrix
cooler = round(rt(n, 15, 35), 2)
boat_length =  round(rt(n, 5, 30), 2)
age = round(rt(n, 25, 50))
X = model.matrix( ~ 1 + age + boat_length + cooler)


## set coefficients
  # amateur proportion
  pi1 = 0.5 
  # professional proportion
  pi2 = 1 - pi1 
  # Ama coefs, larger boat cooler = less fish
  beta_1 = matrix(c(3, 0, 0,-.01), ncol = 1) 
  # Pro coefs, larger boat cooler = more fish
  beta_2 = matrix(c(3, 0, 0, .01), ncol = 1) 
  # set dispersion
  phi = 10


## simulate data
  # generate group membership indicator vector
  group_indicators = rbinom(n, 1, pi2) # 1 = P, 0 = NP
  
  # create empty y vector
  y = rep(0, n)
  
  # simulate responses for Amagroup
  y[group_indicators == 0] = rnbinom(
    n = sum(group_indicators == 0),
    mu = exp(X[group_indicators == 0, ] %*% beta_1),
    size = phi
  )
  
  # simulate responses for Pro group 
  y[group_indicators == 1] = rnbinom(
    n = sum(group_indicators == 1),
    mu = exp(X[group_indicators == 1, ] %*% beta_2),
    size = phi
  )

5.3.1 Vanilla EM

Now that we have the data, lets fit a regular EM using the glm.nb function in the M step. Again, we show the code here without wrapping it into a function so that its easier to inspect each element. Afterwards, we will rewrite this in a more modular form by splitting each step into a series of functions.

### Initialization and setup
library (MASS)

## starting parameters
  # convergence threshold
  tol = 10 ^ -10
  # max number of EM iterations
  maxit = 1000
  # iteration counter
  iter = 0
  # arbitrary large to avoid triggering the while loop for delta
  eps = Inf
  # arbitrary large value to set the inital ll to
  ll = -10000
  # list object for M-step
  fit = list()
  # Number of mixture components, K
  K = 2

## create posterior probability matrix
  # First col is p(z_i = 1 | y_o, theta^{(t)}, amateur
  # Second col is p(z_i = 2 | y_o, theta^{(t)}, professional
  pp = matrix(0, n, K)

  # name the columns
  colnames(pp) = c("amateur", "prof")

## use informative initialization,  assuming lower counts --> amateur
  #  choose prop of samples that are amateur first
  #  should vary this and re-run to evaluate multiple starting points/partitions
  prop_ama = 0.6

  # set everything greater than prop_ama'th percentile to be professional
  pp[, 2] = (y > quantile(y, prop_ama)) ^ 2
  pp[, 1] = 1 - pp[, 2]


### now start the EM algorithm
start = Sys.time()
while (eps > tol & iter < maxit) {
  ## save old ll
    ll0 = ll
  
  ## start M-step
    # pi, mean of component pp's
    pi = colMeans(pp)
  
    # beta_k and theta_k, weighted glm's based on pp
    for (k in 1:K)
      fit[[k]] = glm.nb(y ~ X - 1, weights = pp[, k])
  
  
  ## start E-step
    # calculate numerator
    for (k in 1:K)
      pp[, k] = pi[k] * dnbinom(y, mu = fit[[k]]$fitted, size = fit[[k]]$theta)
    
    # divide by denominator, the sum across components in each i
    pp = pp / rowSums(pp)
  
  
  ## calculate LL
    # Calculate sum across K inside the log for each subject
    interior_sum = rep(0, length(y))
    for (k in 1:K)
      interior_sum = interior_sum + pi[k] * dnbinom(y, mu = fit[[k]]$fitted, size = fit[[k]]$theta)
    
    # Take the log of each, then sum across the n subjects to get the ll
    ll = sum(log(interior_sum))
  
  ## calculate relative change in log likelihood
    eps  = abs(ll - ll0) / abs(ll0)
  
  ## update iterator
    iter = iter + 1
    if (iter == maxit)
      warning("Iteration limit reached without convergence")
  
  ## print out info to keep track
    if (floor(iter / 20) == ceiling(iter / 20))
      cat(
        sprintf(
          "Iter: %d logL: %.2f pi1: %.3f theta1: %.3f glm.nb1, eps:%f\n",
          iter,
          ll,
          pi[1],
          fit[[1]]$theta,
          eps
        )
      )
}
## Iter: 20 logL: -37528.39 pi1: 0.585 theta1: 8.470 glm.nb1, eps:0.000003
## Iter: 40 logL: -37527.09 pi1: 0.568 theta1: 8.632 glm.nb1, eps:0.000001
## Iter: 60 logL: -37526.55 pi1: 0.556 theta1: 8.764 glm.nb1, eps:0.000000
## Iter: 80 logL: -37526.32 pi1: 0.549 theta1: 8.849 glm.nb1, eps:0.000000
## Iter: 100 logL: -37526.23 pi1: 0.544 theta1: 8.904 glm.nb1, eps:0.000000
## Iter: 120 logL: -37526.19 pi1: 0.541 theta1: 8.940 glm.nb1, eps:0.000000
## Iter: 140 logL: -37526.17 pi1: 0.539 theta1: 8.963 glm.nb1, eps:0.000000
## Iter: 160 logL: -37526.17 pi1: 0.538 theta1: 8.979 glm.nb1, eps:0.000000
## Iter: 180 logL: -37526.16 pi1: 0.537 theta1: 8.988 glm.nb1, eps:0.000000
## Iter: 200 logL: -37526.16 pi1: 0.537 theta1: 8.995 glm.nb1, eps:0.000000
## Iter: 220 logL: -37526.16 pi1: 0.536 theta1: 8.999 glm.nb1, eps:0.000000
## Iter: 240 logL: -37526.16 pi1: 0.536 theta1: 9.002 glm.nb1, eps:0.000000
end = Sys.time()
print(end - start)
## Time difference of 56.75355 secs

In the M-step, the following process is being utilized

  • Initialization: Obtain \(\boldsymbol{\beta}^{(0)}\) using poisson regression
  • Obtain \(\phi^{(0)}\) given \(\boldsymbol{\beta}^{(0)}\) via NR
  • Obtain \(\boldsymbol{\beta}^{(1)}\) given \(\phi^{(1)}\) via IRLS
  • Repeat until convergence in Deviance

Lets simplify this a bit, leveraging the modularity of the EM, by splitting up the Initialization, E, and M steps into separate functions:

# function to perform above initialization steps to initialized pp matrix
pp.init = function(K, col_names, prop) {
  
  ## create posterior probability matrix
    # First col is p(z_i = 1 | y_o, theta^{(t)}, amateur
    # Second col is p(z_i = 2 | y_o, theta^{(t)}, professional
    n = length(y)
    pp = matrix(0, n, K)
  
    # name the columns
    colnames(pp) = col_names
  
  ## use informative initialization,  assuming lower counts --> amateur
    # note that this is only specific for K=2, but can generalize for higher K cases
    # also could just use random initilization
    pp[, 2] = (y > quantile(y, prop)) ^ 2
    pp[, 1] = 1 - pp[, 2]
  
  return(pp)
}


# M-step function
m.step = function(y, X, pp, fit, K = 2) {
  
  ## pi, mean of component pp's
  pi = colMeans(pp)
  
  ## beta_k and theta_k, weighted glm's based on pp
  for (k in 1:K)
    fit[[k]] = glm.nb(y ~ X - 1, weights = pp[, k])
  
  return(list(pi = pi, fit = fit))
}


# E-step function
e.step = function(y, pp, m) {
  
  ## extract m step objects from m
  pi = m$pi
  fit = m$fit
  
  ## calculate numerator
  for (k in 1:K)
    pp[, k] = pi[k] * dnbinom(y, mu = fit[[k]]$fitted, size = fit[[k]]$theta)
  
  ## divide by denominator, the sum across components in each i
  pp = pp / rowSums(pp)
  
  return(pp)
}


# log likelihood function
ll.nb.mix = function(y, m) {
  
  ## extract m step objects from m
  pi = m$pi
  fit = m$fit
  
  ## Calculate sum across K inside the log for each subject
  interior_sum = rep(0, length(y))
  for (k in 1:K)
    interior_sum = interior_sum + pi[k] * dnbinom(y, mu = fit[[k]]$fitted, size = fit[[k]]$theta)
  
  ## Take the log of each, then sum across the n subjects to get the ll
  ll = sum(log(interior_sum))
  
  return(ll)
}

Now lets rewrite the prior EM code into a modular function

nb.two.mix.reg.em = function(y,
                             X,
                             tol = 10 ^ -10,
                             maxit = 1000,
                             prop,
                             trace = 0) {
  ## starting parameters
    iter = 0
    eps = Inf
    ll = -10000
    fit = list()
    K = 2
  
  ## initialize posterior probability matrix
    pp = pp.init(
      prop = prop,
      K = K,
      col_names = c("amateur", "prof")
    )
  
  ## now start the EM algorithm
  start = Sys.time()
  while (eps > tol & iter < maxit) {
    
    ## save old ll
      ll0 = ll
    
    ## start M-step
      m = m.step(y = y,
                 X = X,
                 pp = pp,
                 fit = fit)
      
    ## start E-step
      pp = e.step(y = y, m = m, pp = pp)
    
    ## calculate LL
      ll = ll.nb.mix(y = y, m = m)
    
    ## calculate relative change in LL
      eps  = abs(ll - ll0) / abs(ll0)
    
    ## update iterator
      iter = iter + 1
      if (iter == maxit)
        warning("Iteration limit reached without convergence")
      
    ## print out info to keep track
      if (floor(iter / 20) == ceiling(iter / 20) &
          trace > 0)
        cat(
          sprintf(
            "Iter: %d logL: %.2f pi1: %.3f theta1: %.3f glm.nb1, eps:%f\n",
            iter,
            ll,
            m$pi[1],
            m$fit[[1]]$theta,
            eps
          )
        )
  }
  
  end = Sys.time()
  print(end - start)
}

Now lets run it

em_fit = nb.two.mix.reg.em(y = y,
                           X = X,
                           prop = prop_ama,
                           trace = 1)
## Iter: 20 logL: -37528.39 pi1: 0.585 theta1: 8.470 glm.nb1, eps:0.000003
## Iter: 40 logL: -37527.09 pi1: 0.568 theta1: 8.632 glm.nb1, eps:0.000001
## Iter: 60 logL: -37526.55 pi1: 0.556 theta1: 8.764 glm.nb1, eps:0.000000
## Iter: 80 logL: -37526.32 pi1: 0.549 theta1: 8.849 glm.nb1, eps:0.000000
## Iter: 100 logL: -37526.23 pi1: 0.544 theta1: 8.904 glm.nb1, eps:0.000000
## Iter: 120 logL: -37526.19 pi1: 0.541 theta1: 8.940 glm.nb1, eps:0.000000
## Iter: 140 logL: -37526.17 pi1: 0.539 theta1: 8.963 glm.nb1, eps:0.000000
## Iter: 160 logL: -37526.17 pi1: 0.538 theta1: 8.979 glm.nb1, eps:0.000000
## Iter: 180 logL: -37526.16 pi1: 0.537 theta1: 8.988 glm.nb1, eps:0.000000
## Iter: 200 logL: -37526.16 pi1: 0.537 theta1: 8.995 glm.nb1, eps:0.000000
## Iter: 220 logL: -37526.16 pi1: 0.536 theta1: 8.999 glm.nb1, eps:0.000000
## Iter: 240 logL: -37526.16 pi1: 0.536 theta1: 9.002 glm.nb1, eps:0.000000
## Time difference of 57.25656 secs

5.3.2 ECM, Passing prior M-steps estimates as starting values

Let’s do an ECM version, where instead of iterating until convergence in glm.nb in the M-step, we just do a single round. One thing that should always be done using EM is to start the next iteration’s M-step from the previous iteration’s parameter estimates. This will increase the stability of the fitting of subsequent iterations, and speed up the time to convergence.

First we grab the function for the NR maximization for \(\phi\), phi.ml() from the MASS package (will insert later into the m-step). This function basically applies NR to obtain the updated estimate of the dispersion parameter.

phi.ml <-
  function(y, mu, n = sum(weights), weights, limit = 10,
           eps = .Machine$double.eps^0.25,
           trace = FALSE,
           p0 = NULL #  added this option new to pass starting values
           ){
    lambda = 1E-50    
    # score 
    score <- function(n, ph, mu, y, w){
      sum(w*(digamma((1/ph) + y) - digamma(1/ph) + log(1/ph) +
               1 - log((1/ph) + mu) - (y + (1/ph))/(mu + (1/ph))))*(-1/ph^2) + 2*lambda*ph
    }
    # hessian
    info <- function(n, ph, mu, y, w){
      sum(w*( - trigamma((1/ph) + y) + trigamma((1/ph)) - ph +
                2/(mu + (1/ph)) - (y + (1/ph))/(mu + (1/ph))^2))*(1/ph^4) + 2*lambda
    }
    if(inherits(y, "lm")) {
      mu <- y$fitted.values
      y <- if(is.null(y$y)) mu + residuals(y) else y$y
    }
    
    if(missing(weights)) weights <- rep(1, length(y))
    #t0 <- n/sum(weights*(y/mu - 1)^2)
    if(is.null(p0)){ ## added this in new
      p0 <- sum(weights*(y/mu - 1)^2)/n
    }
    
    it <- 0
    del <- 1
    if(trace) message(sprintf("phi.ml: iter %d 'phi = %f'",
                              it, signif(p0)), domain = NA)
    while((it <- it + 1) < limit && abs(del) > eps) {
      p0 <- abs(p0)
      del <- score(n, p0, mu, y, weights)/(i <- info(n, p0, mu, y, weights))
      p0 <- p0 + del
      if(trace) message("phi.ml: iter", it," phi =", signif(p0))
    }
    
    if(p0 < 0) {
      p0 <- 0
      warning("estimate truncated at zero")
      attr(p0, "warn") <- gettext("estimate truncated at zero")
    }
    
    if(it == limit) {
      warning("iteration limit reached")
      attr(p0, "warn") <- gettext("iteration limit reached")
    }
    attr(p0, "SE") <- sqrt(1/i)
    res <- list(p0=p0)
    return(res)
  }

Then we embed this into the code below for the M step:

m.step = function(y, X, pp, fit, K = 2) {
  
  ## pi, mean of component pp's
    pi = colMeans(pp)
  
  
  ## beta_k given prior theta and beta, weighted glm's based on pp
    if (length(fit) == 0) {
      
      # if fit is uninitialized (no starting estimate for beta or theta), then
      # no e-step update
      
      # fit a simple poisson model for beta
        for (k in 1:K)
          fit[[k]] = glm(y ~ X - 1, weights = pp[, k], family = poisson())
        
    } else{
        
      #  fit a NB regression model for beta given prior theta_k and beta_k
        for (k in 1:K) {
          fit[[k]] = glm(
            y ~ X - 1,
            weights = pp[, k],
            family = negative.binomial(theta = fit[[k]]$theta),
            start = fit[[k]]$coefficients
          )
      }
  }
  
  ## theta_k given update for beta_k and prior theta_k
    # returns 1/phi actually so need to invert
    for (k in 1:K) {
      
      if (is.null(fit[[k]]$theta)) {
        
        # if no prior estimate of theta, then do regular phi.ml update
          fit[[k]]$theta = 1 / phi.ml(y = y,
                                      mu = fit[[k]]$fitted,
                                      weights = pp[, k])$p0
          
      } else{
        
        #otherwise, pass prior value of theta as starting
          fit[[k]]$theta = 1 / phi.ml(
            y = y,
            mu = fit[[k]]$fitted,
            weights = pp[, k],
            p0 = 1 / fit[[k]]$theta ## here
          )$p0
        
      }
    }
  
  return(list(pi = pi, fit = fit))
}



# now lets rerun
em_fit = nb.two.mix.reg.em(y = y,
                           X = X,
                           prop = prop_ama,
                           trace = 1)
## Iter: 20 logL: -37528.43 pi1: 0.585 theta1: 8.472 glm.nb1, eps:0.000003
## Iter: 40 logL: -37527.12 pi1: 0.568 theta1: 8.632 glm.nb1, eps:0.000001
## Iter: 60 logL: -37526.58 pi1: 0.557 theta1: 8.763 glm.nb1, eps:0.000000
## Iter: 80 logL: -37526.35 pi1: 0.550 theta1: 8.847 glm.nb1, eps:0.000000
## Iter: 100 logL: -37526.25 pi1: 0.545 theta1: 8.902 glm.nb1, eps:0.000000
## Iter: 120 logL: -37526.21 pi1: 0.542 theta1: 8.937 glm.nb1, eps:0.000000
## Iter: 140 logL: -37526.19 pi1: 0.540 theta1: 8.960 glm.nb1, eps:0.000000
## Iter: 160 logL: -37526.18 pi1: 0.539 theta1: 8.975 glm.nb1, eps:0.000000
## Iter: 180 logL: -37526.17 pi1: 0.538 theta1: 8.985 glm.nb1, eps:0.000000
## Iter: 200 logL: -37526.17 pi1: 0.537 theta1: 8.992 glm.nb1, eps:0.000000
## Iter: 220 logL: -37526.17 pi1: 0.537 theta1: 8.996 glm.nb1, eps:0.000000
## Iter: 240 logL: -37526.17 pi1: 0.537 theta1: 8.999 glm.nb1, eps:0.000000
## Iter: 260 logL: -37526.17 pi1: 0.536 theta1: 9.001 glm.nb1, eps:0.000000
## Iter: 280 logL: -37526.17 pi1: 0.536 theta1: 9.002 glm.nb1, eps:0.000000
## Time difference of 22.92006 secs

So, the ECM implementation here where we first maximize each parameter sequentially was much faster. This is because the original maximization routine was quite complex. The cycling within glm.nb to convergence is necessary to get accurate estimates of \(\beta\) and \(\phi\) in standard applications, but when embedded within the EM is not entirely necessary. If we think about it, the estimates for \(\beta\) and \(\phi\) change so much between the earlier EM iterations that it’s a lot of computational time trying to get accurate estimates for them at those points.

5.3.3 Multicycle ECM

Let’s if we can speed this up further with the multicycle ECM variant. Here we update the m.step function to include e-step updates in between each conditional maximization:

m.step = function(y, X, pp, fit, K = 2) {
  
  ## pi, mean of component pp's
    pi = colMeans(pp)
  
  
  ## beta_k given prior theta, weighted glm's based on pp
    if (length(fit) == 0) {
      
      # if fit is uninitialized (no starting estimate for beta or theta), then
      # no e-step update
      
      # fit a simple poisson model for beta
        for (k in 1:K)
          fit[[k]] = glm(y ~ X - 1, weights = pp[, k], family = poisson())
        
    } else{
      
      # estep update after pi_k, since prior fit exists
        pp = e.step(y = y,
                    m = list(pi = pi, fit = fit),
                    pp = pp)
        
      #  fit a NB regression model for beta given prior theta
        for (k in 1:K) {
          fit[[k]] = glm(
            y ~ X - 1,
            weights = pp[, k],
            family = negative.binomial(theta = fit[[k]]$theta),
            start = fit[[k]]$coefficients
          )
          
          # estep update after each beta_k
          pp = e.step(y = y,
                    m = list(pi = pi, fit = fit),
                    pp = pp)
      }
    
      
  }
  
  ## theta_k given update for beta_k
    # returns 1/phi actually so need to invert
    for (k in 1:K) {
      if (is.null(fit[[k]]$theta)) {
        
        # if no prior estimate of theta, then do the usual
          fit[[k]]$theta = 1 / phi.ml(y = y,
                                      mu = fit[[k]]$fitted,
                                      weights = pp[, k])$p0
          
      } else{
        #otherwise, pass prior value of theta as starting
          fit[[k]]$theta = 1 / phi.ml(
            y = y,
            mu = fit[[k]]$fitted,
            weights = pp[, k],
            p0 = 1 / fit[[k]]$theta
          )$p0
          
        # estep update after each theta_k
        pp = e.step(y = y,
                    m = list(pi = pi, fit = fit),
                    pp = pp)
      }
    }
  
  return(list(pi = pi, fit = fit))
}


# now lets rerun
em_fit = nb.two.mix.reg.em(y = y,
                           X = X,
                           prop = prop_ama,
                           trace = 1)   
## Iter: 20 logL: -37528.43 pi1: 0.585 theta1: 8.472 glm.nb1, eps:0.000003
## Iter: 40 logL: -37527.12 pi1: 0.568 theta1: 8.632 glm.nb1, eps:0.000001
## Iter: 60 logL: -37526.58 pi1: 0.557 theta1: 8.763 glm.nb1, eps:0.000000
## Iter: 80 logL: -37526.35 pi1: 0.550 theta1: 8.847 glm.nb1, eps:0.000000
## Iter: 100 logL: -37526.25 pi1: 0.545 theta1: 8.902 glm.nb1, eps:0.000000
## Iter: 120 logL: -37526.21 pi1: 0.542 theta1: 8.937 glm.nb1, eps:0.000000
## Iter: 140 logL: -37526.19 pi1: 0.540 theta1: 8.960 glm.nb1, eps:0.000000
## Iter: 160 logL: -37526.18 pi1: 0.539 theta1: 8.975 glm.nb1, eps:0.000000
## Iter: 180 logL: -37526.17 pi1: 0.538 theta1: 8.985 glm.nb1, eps:0.000000
## Iter: 200 logL: -37526.17 pi1: 0.537 theta1: 8.992 glm.nb1, eps:0.000000
## Iter: 220 logL: -37526.17 pi1: 0.537 theta1: 8.996 glm.nb1, eps:0.000000
## Iter: 240 logL: -37526.17 pi1: 0.537 theta1: 8.999 glm.nb1, eps:0.000000
## Iter: 260 logL: -37526.17 pi1: 0.536 theta1: 9.001 glm.nb1, eps:0.000000
## Iter: 280 logL: -37526.17 pi1: 0.536 theta1: 9.002 glm.nb1, eps:0.000000
## Time difference of 22.83824 secs

In this setting we do not see much benefit, despite the cheapness of the E-step update. As mentioned before, this may be helpful in certain settings. In my person experience, this is helpful when there the dimensionality of the missing data is larger, and when it takes much time/effort to perform an entire M-step update consisting of many conditional maximizations.

5.4 Monte Carlo EM (MCEM)

5.4.1 Why should we consider

In some cases, the E-step may be analytically or computationally intractable. For example, the expectation does not have a clear closed form (unlike the examples given prior), or hard to evaluate (for example involving multidimensional integrals).

To mitigate this, we replace the expectation in the E-step with an approximation using a “Monte Carlo” E-step, yielding the term MCEM or Monte Carlo EM.

5.4.2 Formulation

In a general sense, the modification over the standard EM approach is largely in the E-step, where we approximate the expectation using monte carlo sampling-based approaches. The M-step then maximizes the CDLL with the missing data filled in with these \(M\) drawn samples, weighting each filled-in case by \(1/M\). In other words, we are essentially maximizing the M-step as in the regular EM case, except we are averaging over each drawn sample. This will become clear in the setup below.
We can rewrite the expectation of the Q-function as an integral, where the integrand can be factored into two parts: \[ Q(\boldsymbol{\theta}|\boldsymbol{\theta}^{(t)}) = E\left[ \log L_c(\boldsymbol{\theta}) | \boldsymbol{y}_o,\boldsymbol{\theta}^{(t)}\right] = \int g(\boldsymbol{y}_c| \boldsymbol{\theta})d\boldsymbol{z} = \int f(\boldsymbol{y}_o| \boldsymbol{\theta}^{(t)}, \boldsymbol{z})f(\boldsymbol{z} | \boldsymbol{\theta}^{(t)}, \boldsymbol{y}_o) d\boldsymbol{z}.\]

That is, the conditional expectation of the CDLL can be factored into the form above, where the first part is the pdf of the observed data conditional on the missing data and current parameter estimates. The second part pertains to the posterior distribution of the missing data, conditional on the observed data and current parameter estimates.

As one can imagine, this integral may be difficult to evaluate analytically, especially in cases where \(g(\boldsymbol{y}_c| \boldsymbol{\theta})\) does not simplify to some standard distribution (and thus would have a known closed form for the integral). In addition, if the dimension of \(\boldsymbol{z}\) is large, it may be difficult to evaluate multiple integrals, especially if no closed forms for evaluating the integrals exist. Examples of this in the Bayesian setting are when non-conjugate priors are utilized in analyses.

5.4.3 A segway into the next lecture

One way to avoid this is issue is to approximate the integral by using numerical integration. There are multiple approaches to do this, and we cover these approaches in the next lecture. One flexible approach is to use monte carlo integration, where if we can say draw \(M\) samples from \(f(\boldsymbol{z} | \boldsymbol{\theta}^{(t)} , \boldsymbol{y}_o)\), say \(z^{1}\ldots z^{M}\), then it can be shown that \(\int f(\boldsymbol{y}_o| \boldsymbol{\theta}^{(t)}, \boldsymbol{z})f(\boldsymbol{z} | \boldsymbol{\theta}^{(t)}, \boldsymbol{y}_o) d\boldsymbol{z} = \frac{1}{M}\sum_{m = 1}^{M}f(\boldsymbol{y}_o| \boldsymbol{\theta}^{(t)}, \boldsymbol{z}^{m})\). That is, we approximate the integral by drawing many samples from the posterior distribution of the missing data at step \(t\), and then average over \(f(\boldsymbol{y}_o| \boldsymbol{\theta}^{(t)}, \boldsymbol{z}^{m})\), filling in the missing data with each draw (weighted by \(1/M\)). We will go into more detail about why this works in the next lecture.

Therefore, in the E-step, we can approximate the Q-function using the approach above. In application, the majority of the effort here is drawing samples from \(f(\boldsymbol{z} | \boldsymbol{\theta}^{(t)} , \boldsymbol{y}_o)\), which also may not have a good closed form itself. If we do not know what the form of this conditional distribution is, or if we know the form but do not know how to sample from it, how exactly can we compute this integral? We cover several of these approaches in our lecture on Numerical Integration and MCMC. For now, you just need to know that we can sample from this posterior in each step of the MCEM algorithm.

5.4.4 Some weaknesses

One thing that is clear in this case is that the larger the value of \(M\) is, the greater accuracy we will have in evaluating this expectation. Therein lies one of the weaknesses of this approach - many samples may need to be drawn, which increases the computational burden of the E-step The procedure for obtaining samples from this distribution through monte carlo approaches may also be computationally expensive as well (rejection sampling, metropolis-hastings, etc).

In addition, a large number of samples may also be needed for convergence. Given that the E-step is being approximated, and that this approximation is based upon draws of independent (or approximately independent in some cases) samples, the Likelihood, Q-function, or model parameters may vary randomly about some value for smaller values of \(M\) even if the model has truly converged. Increasing \(M\) decreases the monte carlo error of these values, and the random variability decreases as well, increases your chances of convergence. Some authors have put forward approaches for estimating this monte carlo error at each step to determine by how much we may need to increase \(M\) by at each step of the MCEM algorithm to facilitate eventual convergence. Others simply increase \(M\) by a predictable amount each step. In either case, we would like to use a smaller \(M\) at the beginning of the MCEM algorithm and a larger \(M\) closer to convergence to ensure the model convergences.

Various convergence criteria as defined earlier can be used, but given the random variability in the parameter estimates, likelihood, or Q-function in this case, oftentimes one may simply terminate the algorithm if the convergence criteria has been met say three times in a row so that it is likely not due to random chance. It may also be easier to compare estimates from possibly 5 to 10 iterations back to smooth our iteration to iteration variations (perhaps more depending on convergence rates).

Given that we do not have the tools yet to draw samples from the posterior in these cases described above, we will revisit this topic in the next lecture.

6 For more detail:

GH Chapter 4

McLachlan and Krishnan (The EM algorithm and Extensions), Chapter 1