I want to give a sense of how the EM algorithm can be used in a variety of computational biology problems, beyond the common mixture of distributions model. Here I will show how EM can be used to find the sequence motifs associated with DNA binding proteins, like transcription factors. A sequence motif is a series of DNA letters (nucleotides) that the DNA binding proteins can “recognize”. As we discussed in class, when we refer loosely to such “recognition”, we’re actually talking about the protein and the DNA sequence in a particular region having electrostatic interactions, producing favorable, low energy states. Recall the image of the amino acid residuals “reaching in” to the major and minor groove of the DNA double helix, and interacting with the 4 possible nucleotides.

The following will be an incredibly oversimplified simulation and model to show how EM might help us find sequence motifs. For a real reference on how EM might be used, please consult the following paper, which proposed an EM-based methodology for motif finding:

MEME: discovering and analyzing DNA and protein sequence motifs

For homework, you will work with ChIP-seq data for a particular transcription factor with a known motif, and you will try to adapt the following code to find the motif of this transcription factor from data.

Simulation of motif finding

The over-simplifications we will make here, which result in an unrealistic simulation and worse method, but are useful for pedagogical purposes:

I begin by defining some parameters and simulating our motif that we want to find. Here I use numbers 1-4 to represent nucleotides A,C,G,T.

n <- 300 # number of DNA sequences
l <- 100 # number of nucleotides for each sequence
motif.len <- 10 # the length of the true motif
set.seed(5)
motif <- sample(1:4, motif.len, TRUE)
motif
##  [1] 2 3 1 3 3 1 1 1 2 3

Now we will embed the motif into some DNA sequences, which here are rows of a matrix X. The Z variable keeps track of the start position of the motif in the DNA sequences in X. We therefore setup a problem that can be tackled with EM: if we knew the motif, we could find the locations Z, and if we knew the locations Z, we could easily determine the motif. We start with neither.

Before we get started, note that EM is not the only way to find these motifs. Another, seemingly more direct approach, would be to tabulate all the k-mers for a value of k that is large, but less than the length of the motif we are looking for. If we look across sequences which should occasionally contain the motif, we will likely find sets of k-mers that occur more than we would expect in “background” DNA sequences.

x <- matrix(sample(1:4, n * l, TRUE),nrow=n)
z <- sample(l - motif.len + 1, n, TRUE)
for (i in seq_len(n)) {
  x[i, z[i] + 0:(motif.len-1)] <- motif
}
head(z)
## [1] 39 82 34 86 31 63
x[1,]
##   [1] 3 3 1 1 4 4 3 4 4 4 3 1 4 1 1 1 4 1 2 1 4 4 3 4 2 4 2 4 3 2 3 3 2 1 1 2 3 2 2 3 1 3 3 1 1 1 2
##  [48] 3 4 1 3 3 3 4 4 3 3 1 1 4 3 2 2 3 3 3 3 1 4 3 4 4 4 4 4 3 4 2 3 2 4 2 4 3 4 2 2 2 4 3 2 3 1 2
##  [95] 4 1 3 1 2 4
x[1,z[1] + 0:(motif.len-1)]
##  [1] 2 3 1 3 3 1 1 1 2 3
motif
##  [1] 2 3 1 3 3 1 1 1 2 3

Finally, we describe the parameters \(\theta\) we will use to describe the motif. We will specify a matrix which has four rows for each of the DNA nucleotides, and is wider than the motif itself. This extension of the size of the matrix will reduce the chance that we hit upon the motif, but some of the nucleotides in the motif fall outside of the nucleotides modeled in \(\theta\). A column of the matrix \(\theta\) will represent the probability of observing a given nucleotide (A,C,G,T down the rows) at a given position in the motif. Note again that the positions in the motifs we recover are arbitrarily shifted: if the true motif starts at position 3 or position 7 of \(\theta\) will both get us to a local maximum of the log likelihood.

set.seed(5)
theta.len <- 20
theta <- matrix(runif(theta.len*4),ncol=theta.len)
theta <- sweep(theta, 2, colSums(theta), "/")
dim(theta)
## [1]  4 20
theta[,1:5]
##           [,1]       [,2]       [,3]      [,4]      [,5]
## [1,] 0.0959475 0.04886533 0.52246317 0.2372528 0.1450051
## [2,] 0.3283730 0.32735177 0.06033207 0.4166572 0.3322247
## [3,] 0.4393886 0.24652564 0.14927475 0.1956664 0.2076419
## [4,] 0.1362910 0.37725726 0.26793000 0.1504236 0.3151282

Let’s first step through one iteration before we run the EM algorithm. We start by calculating the conditional probabilities using X and our initial guess \(\theta^0\). We store these in a matrix, with one row per DNA sequence, and as many columns as potential starting positions for the motif. As I said above, for simplicity, here we just focus on the probability of the motif sequence, and ignore the background sequence. Let’s just calculate the first row, i=1, and the first position, p=1.

i <- 1
p <- 1
# move a window along row i of x
snip <- x[i,p + 0:(theta.len-1)]
snip
##  [1] 3 3 1 1 4 4 3 4 4 4 3 1 4 1 1 1 4 1 2 1
# calculate the conditional probability given X and current theta
prod(theta[1,snip == 1]) * prod(theta[2,snip == 2]) *
  prod(theta[3,snip == 3]) * prod(theta[4,snip == 4])
## [1] 5.149132e-14

Once we’ve stored these condition probabilities in a matrix, z.mat, we can use these to maximize our expectation of the log likelihood with respect to the condition distribution of Z given X and \(\theta^t\) and then update to \(\theta^{t+1}\). We will add the contribution from each sequence out of n, and from all positions p. As z.mat will be normalized so that each row sums to 1 (it forms a density along the positions), we only need to normalize the contribution of each sequence and position by n. Our update for \(\theta^{t+1}\) for each nucleotide is the weighted average of the number of times we saw that nucleotide at that position. Here we show the line of code, but it is not evaluated, as we haven’t yet computed z.mat.

1/n * z.mat[i,p] * (x[i,p + 0:(theta.len-1)] == j)

Now we will put these two pieces together, within a loop. We will also add some code which will plot the guesses at each iteration.

First define a function to draw the estimates at each iteration, t:

drawTheta <- function(theta, t) {
  theta.max <- apply(theta, 2, which.max)
  theta.max.p <- apply(theta, 2, max)
  cols.alpha <- sapply(1:theta.len, function(i) {
    adjustcolor(cols[theta.max[i]],(theta.max.p[i] - .25)/.75)
  })
  points(seq_len(theta.len), rep(t,theta.len), col=cols.alpha, pch=15, cex=2)
}

The following code chunk is too long for an Rmarkdown, but I put it here as one long chunk so that you can step through the code at each iteration and examine the values of Z and theta, as the EM is progressing. It has to be one long chunk to draw the iterations on the same plot.

library(rafalib)
niter <- 10
nullplot(0, theta.len, -1, niter,
         xlab="position in motif", ylab="iterations")
cols <- c("green3","blue","orange","red")
# draw the true motif at the bottom
points(seq_len(motif.len) + (theta.len-motif.len)/2, 
       rep(-1,motif.len), col=cols[motif], pch=15, cex=2)
# draw theta^0
drawTheta(theta, 0)

### the EM algorithm ###
for (t in seq(1,niter)) {
  # initialize conditional distribution of Z
  z.mat <- matrix(0,nrow=n,ncol=l-theta.len+1)
  for (i in seq_len(n)) {
    for (p in seq_len(l-theta.len+1)) {
      # move a window along row i of x
      snip <- x[i,p + 0:(theta.len-1)]
      # calculate the conditional distribution of Z given X and current theta
      z.mat[i,p] <- prod(theta[1,snip == 1]) * prod(theta[2,snip == 2]) *
        prod(theta[3,snip == 3]) * prod(theta[4,snip == 4])
    }
  }
  # normalize z.mat to make it a density along positions
  z.mat <- z.mat / rowSums(z.mat)
  
  # re-estimate theta by maximizing over E_{Z|X,theta^t} (log lik)
  theta <- matrix(0,nrow=4,ncol=theta.len)
  for (j in 1:4) {
    for (i in seq_len(n)) {
      for (p in seq_len(l-theta.len+1)) {
        # for nucleotide j, add the contribution of sequence i,
        # averaging over all positions p, weighted by Z.
        # MLE is just the frequency of when the sequence = j here
        theta[j,] <- theta[j,] + 1/n * z.mat[i,p] * (x[i,p + 0:(theta.len-1)] == j)
      }
    }
  }
  drawTheta(theta, t) # theta for iteration 't'
}

Looking at the estimates of \(\theta\) over iterations, we see that the random initialization contained four positions that aligned with the true motif. Over the next three iterations, the motif filled in, as the density in Z concentrated on the true positions. Finally, after iteration 5, we see the motif come into strong relief, with the other positions evening out to equal probability of all nucleotides.

round(theta, 2)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17]
## [1,] 0.24 0.27 0.22 0.24 0.19 0.28 0.02 0.01 0.95  0.01  0.01  0.97  0.98  0.96  0.01  0.02  0.30
## [2,] 0.23 0.23 0.27 0.25 0.26 0.26 0.94 0.02 0.02  0.02  0.02  0.02  0.01  0.01  0.97  0.01  0.23
## [3,] 0.25 0.25 0.24 0.23 0.28 0.25 0.01 0.95 0.01  0.94  0.96  0.01  0.00  0.01  0.01  0.96  0.22
## [4,] 0.28 0.25 0.27 0.28 0.26 0.21 0.03 0.01 0.02  0.03  0.01  0.00  0.02  0.02  0.01  0.01  0.25
##      [,18] [,19] [,20]
## [1,]  0.17  0.22  0.25
## [2,]  0.29  0.26  0.24
## [3,]  0.32  0.27  0.26
## [4,]  0.22  0.25  0.24

Note that, the four nucleotides are not in equal abundance in “background” sequence in our genomes, and so a better motif finding method would additionally model the “background” sequence.

We can also see that the positions with highest density in Z correspond linearly to the true locations (although with a shift). Note that we did not locate the motif for every sequence. In particular, when the motif started on the far left or far right side, the \(\theta\) matrix was too wide.

plot(z, apply(z.mat, 1, which.max), ylab="max density in z.mat")
abline(0,1)