Here we will present the Hidden Markov Model (HMM), and apply it to detect copy number variants in arrayCGH data. Array comparative genomic hybridization (arrayCGH) is a microarray technology which compares, at a set of regions tiling the genome, the amount of DNA from a sample compared to a reference. Glossing over the technological details, if the signal is around 0 it means the sample has the same number of copies of DNA as the reference; consistent stretches of higher or lower values than 0 indicate amplifications and deletions, respectively. Let’s dive in a take a look. We load an arrayCGH dataset from a Bioconductor package, DNAcopy. (Aside: DNAcopy implements another useful signal processing algorithm, called circular binary segmentation. If we have time we will also cover this algorithm.)

We selected a subset of the data set presented in Snijders et al. (2001). We are calling this data set coriell. The data correspond to two array CGH studies of fibroblast cell strains. In particular, we chose the studies GM05296 and GM13330. After selecting only the mapped data from chromosomes 1-22 and X, there are 2271 data points.

If we draw all the values, lined up along the genome, it looks like this:

library(DNAcopy)
data(coriell)
plot(coriell$Coriell.05296)

One can clearly see two amplifications and one deletion by eye. There are also many stray points, which are likely just noise. The authors of the DNAcopy have a simple method to smooth away the singleton points. We ignore the warning about repeated positions…

cna <- CNA(cbind(coriell$Coriell.05296),
           coriell$Chromosome,coriell$Position,
           data.type="logratio", sampleid="c05296")

## Warning in CNA(cbind(coriell$Coriell.05296), coriell$Chromosome, coriell$Position, : array has repeated maploc positions

smo <- smooth.CNA(cna)
y <- smo$c05296
y <- y[!is.na(y)]
plot(y)

As I said previously, we can clearly identify the three regions that correspond to large variants in this dataset. So why do we need to use a sophisticated model? A couple of reasons: (1) we would like to have an automated procedure, so we don’t have to check such plots by eye. Always imagine, suppose the dataset grows from one sample to 10,000 samples. Will the check-by-eye procedure still work? (2) Suppose the samples have different noise level. We want a procedure which works fairly well for various signal-to-noise ratios. Even given the current, fairly high signal-to-noise data, we cannot simply draw a line to separate, for example, the amplification in the middle of the sequence. If we try a line roughly in between, it works for the local region, but would also call lots of singletons outside of the region, as shown in the following two plots.

plot(y, xlim=c(1080,1220))
abline(h=.25)

plot(y)
abline(h=.25)

Markov Models

One intuitive model for segmenting this data into regions is the Hidden Markov Model (HMM). If you haven’t yet seen Markov chains, they are very useful and not too hard to understand. An order-1 Markov chain is a sequence, \(X_1, X_2, ...\) such that the probability the chain is in a given state at time t given all of the previous time points is equal to the probability the chain is in that state at time t given the only the t-1 time point. We can discard all the information before time t-1 and still have the complete information for knowing the probabilities the chain will be in a given state at time t.

A Hidden Markov Model then consists of a Markov chain, which we do not observe directly. The chain has some discrete set of states, and some probabilities for transitioning from one state to another. At time t, we write that the hidden chain is in state i by writing \(X_t = i\). Instead of \(X_t\), what we observe is an outcome \(Y_t = y_t\), and the probability density function for \(Y_t\) will depend on which state \(X_t\) is in. The most common formulation is that \(Y_t\) also takes on discrete values, but here we will have \(Y_t\) take on continuous values, with Normal distributions indexed by i.

As a reference, we note that the following paper proposed an HMM to segment arrayCGH data, although we will use different notation, and a more simple HMM approach here.

Hidden Markov models approach to the analysis of array CGH data - Fridlyand et al

The key components to modeling data as an HMM are: a starting distribution \(\pi\) for \(X_1\) over the states, a transition matrix A, with entries \(A_{i,j}\) indicating the probability that X will transition from state i to state j in one step in the chain, and the emission probabilities B. If Y takes on discrete values, then B can be represented as a matrix, but here we will write \(B_{i,y_t}\) to be the value of a density evaluated at \(y_t\) using a series of probability density functions indexed by i. This can be seen clearly in the code below. We specify three states for X: n=3. We will encapsulate all variables as \(\theta = (\pi, A, B)\).

The final line in the code chunk below only makes sense for this particular application, where we enforce a probability of 0 for transitioning from state 1 to 3 or vice versa. This means that, we expect the chain will go first to state 2 (normal copy number).

T <- length(y)
n <- 3
pi <- rep(1/n, n) # equal starting probabilities
A <- matrix(.01, nrow=n, ncol=n) # the transition matrix
diag(A) <- .99 # high density on diagonal usually stay in same state
A[2,2] <- .98
A[1,3] <- A[3,1] <- 0 # only makes sense for CNV here
rowSums(A)

## [1] 1 1 1

##      [,1] [,2] [,3]
## [1,] 0.99 0.01 0.00
## [2,] 0.01 0.98 0.01
## [3,] 0.00 0.01 0.99

The following encodes the emission probabilities. Note that we make some really bad guesses for the means of the three probability density functions. We do this on purpose, to show the improvement when we later use the HMM to update these parameters using the EM algorithm.

mu <- c(-.4,0,.4) # bad guesses! to show improvement later
sigma <- rep(.1, n)
B <- function(j, y) {
  dnorm(y,mu[j],sigma[j])
}

Useful variables for analyzing the HMM

Here we will use the definitions of variables from An Introduction to hidden Markov models by Lawrence Rabiner and Biing-Hwang Juang. We use a small notation change, using Y for the observed variables and X for the hidden variables instead of O and I, as it is difficult to see the difference between I and the condition bar. The Y and X notation can be found on the Wikipedia page for the Baum-Welch algorith.

The following variables denote (with phrasing from Rabiner and Juang):

\(\alpha_{i,t}\) - the probability of the partial observation sequence until time t and state i, given the model \(\theta\)
\(\beta_{i,t}\) - the probability of the partial observation sequence from t+1 to the end, given state i and the model \(\theta\)
\(\gamma_{i,t}\) - the probability of state i at time t given the entire observation sequence and the model \(\theta\)

It’s useful to see these as formula:

\[ \alpha_{i,t} = P(Y_1=y_1,...,Y_t=y_t,X_t=i|\theta) \]

\[ \beta_{i,t} = P(Y_{t+1}=y_{t+1},...,Y_T=y_T|X_t=i,\theta) \]

\[ \gamma_{i,t} = P(X_t=i|Y,\theta) \]

Note that the \(\alpha\) and \(\beta\) do not have a correspondence to the A and B matrices, these are just the fairly universal symbols that are used for the forward and backward probabilities, and the transition and emission matrices. \(\alpha\) and \(\beta\) will include both A and B in their calculation.

The magic of the HMM is that, we can calculate these probabilities quite efficiently using dynamic programming, and relying on the key property of the Markov chain. If we consider \(\alpha_{i,t}\), we need only to examine the probabilities at time t-1 to evaluate the probabilities at time t. All of the information about the chain that we need to calculate probabilities at time t is encoded in the probabilities one step behind in time. And likewise for \(\beta_{i,t}\), but running time forward instead of backward. These two sets of equations are called the Forward-Backward algorithm.

Let’s step through how we will calculate each column of \(\alpha_{i,t}\). We start with the initial probabilities \(\pi\), and the observed value \(Y_1 = y_1\). Then at time t=2 and for state i, we take the sum of the following vector: the pre-computed probabilities from all possible states at time t=1, multiplied by the probability of transitioning to state i, and then multiplied by the probability of having observed \(Y_2 = y_2\). Because of the Markov property, we can repeat this step until the end of the sequence, only looking back one time point at each step. This property, of only having to look back at the previous stage to know how to move forward, is part of dynamic programming. Typical dynamic programs use this efficiency to calculate, for example, the shortest path through a graph by first solving the shortest path through a smaller part of the graph.

alpha <- matrix(0,nrow=n,ncol=T)
alpha[,1] <- pi * B(1:n,y[1])
for (t in 2:T) {
  for (i in 1:n) {
    alpha[i,t] <- sum(alpha[,t-1] * A[,i]) * B(i,y[t])
  }
  alpha[,t] <- alpha[,t] / sum(alpha[,t]) # scale to 1 to prevent small probs
}

We can see the same pattern at work for \(\beta_{i,t}\)

beta <- matrix(0,nrow=n,ncol=T)
beta[,T] <- 1
for (t in (T-1):1) {
  for (i in 1:n) {
    beta[i,t] <- sum(beta[,t+1] * A[i,] * B(1:n,y[t+1]))
  }
  beta[,t] <- beta[,t] / sum(beta[,t])
}

Finally, note that we can obtain \(\gamma_{i,t}\) using \(\alpha_{i,t}\) and \(\beta_{i,t}\).

\[ \gamma_{i,t} = P(X_t=i|Y,\theta) \]

\[ = \frac{P(X_t=i,Y|\theta)}{P(Y|\theta)} \]

\[ = \frac{P(Y_1=y_1,...,Y_t=y_t,X_t=i|\theta) P(Y_{t+1}=y_{t+1},...,Y_T=y_T|X_t=i,\theta)}{P(Y|\theta)} \]

gamma <- alpha * beta
gamma <- t( t(gamma) / colSums(gamma) )

This is interesting to plot, the probability that we are in a given state at a given time point. Note that this does not tell us the most likely path of states, which is described below.

par(mfrow=c(2,1),mar=c(3,3,3,1))
plot(y, xlab="", ylab="", main="data: X_t")
plot(seq_len(T), gamma[2,], type="l", xlab="", ylab="",
     main="P(X_t = state 2 | Y, theta)")

The Viterbi sequence

The Viterbi sequence can be used to give the most likely path of states. We write \(V_{i,t}\) as the probability of the most probable sequence up to t, having i as its final state. While we calculate these probabilities of most probable sequences, we will also keep another variable Ptr which points us back along that path.

V <- matrix(0,nrow=n,ncol=T)
V[,1] <- pi * B(1:n,y[1])
Ptr <- matrix(0,nrow=n,ncol=T)
for (t in 2:T) {
  for (i in 1:n) {
    V[i,t] <- max(V[,t-1] * A[,i]) * B(i,y[t])
    Ptr[i,t] <- which.max(V[,t-1] * A[,i])
  }
  V[,t] <- V[,t] / sum(V[,t])
}

Once we are done, we can run backwards from T to 1, and re-construct the most probably sequence of states, here stored as x.hat.

x.hat <- numeric(T)
x.hat[T] <- which.max(V[,T])
for (t in (T-1):1) {
  x.hat[t] <- Ptr[x.hat[t+1],t+1]
}

Note that, even though we had pretty bad guesses for the mean values, we still accurately segment the amplifications and deletions, apart from the normal copy number state.

par(mfrow=c(2,1),mar=c(3,3,3,1))
plot(y, xlab="", ylab="", main="data: X_t")
plot(seq_len(T), x.hat, type="l", xlab="", ylab="",
     main="The Viterbi sequence")

Updating parameters: the Baum-Welch algorithm

Finally, we show the algorithm used for updating our estimates of the parameters \(\theta = (\pi,A,B)\), which is called the Baum-Welch algorith, a special case of the EM algorithm.

We already have all we need to update \(\pi\) and the parameters of B, but we need to define a new variable to give us the conditional probabilities of transitions given our current estimates for the model \(\theta\).

We define a variable, \(\xi_{i,j,t}\) as the probability of a path being in state i, and transitioning to state j at time t+1, given the observation sequence and the estimates for \(\theta\).

\[ \xi_{i,j,t} = P(X_t=i,X_{t+1}=j|Y,\theta) \]

\[ = \frac{\alpha_{i,t} A_{i,j} B_{j,y_{t+1}} \beta_{j,t+1}}{P(Y|\theta)} \]

We can compute these probabilities, for all i,j,t, using the Forward algorithm, multiplying one set of transition probabilities with A, an emission probability with B, followed by the Backward algorithm. As shown above, all of these values are already available, we just need to iterate through all the values for i,j,t, and then scale appropriately.

xi <- array(0, dim=c(n,n,T-1))
for (t in seq_len(T-1)) {
  for (i in 1:n) {
    for (j in 1:n) {
      xi[i,j,t] <- alpha[i,t]*A[i,j]*B(j,y[t+1])*beta[j,t+1]
    }
  }
  xi[,,t] <- xi[,,t] / sum(xi[,,t])
}

Finally, we can update parameters in the M step by maximizing their values using \(\gamma_{i,t}\) and \(\xi_{i,j,t}\) as the conditional probabilities for the chain of states and transitions. For means and variances, we use weighted sample mean and weighted sample variance calculations. If Y took discrete values, we would just count up the number of times it took each value, weighted by the probability for a given state.

pi.hat <- gamma[,1]
A.hat <- apply(xi, c(1,2), sum) / rowSums(gamma[,-T])
mu.hat <- rowSums(t(t(gamma) * y)) / rowSums(gamma)
sigma.hat <- sapply(1:n, function(i) {
  sqrt(sum(gamma[i,] * (y - mu.hat[i])^2) / sum(gamma[i,]))
})

We can see how we did with our new estimates for \(\mu\), which are greatly improved. Note that, we could improve even more by specifying a fourth state for the amplification on the far right.

library(RColorBrewer)
palette(brewer.pal(2*n, "Paired"))
par(mfrow=c(1,2))
plot(y, col=x.hat * 2 - 1)
abline(h=mu, col=1:n * 2, lwd=3)
plot(y, col=x.hat * 2 - 1)
abline(h=mu.hat, col=1:n * 2, lwd=3)

Our initial guess for \(\sigma\) has also been improved, see how close it is to the variance of the y values for the normal copy number state.

sigma

## [1] 0.1 0.1 0.1

sigma.hat[2]

## [1] 0.07975736

sd(y[x.hat == 2])

## [1] 0.08026075

Finally, the transition probabilities also make sense, and roughly correspond to what we would expect given the length of the runs in the predicted segmentation: the run length for state i can be considered a geometric random variable with success probability \(A_{i,i}\).

sum(x.hat == 1)

## [1] 15

1/A.hat[1,2]

## [1] 14.31937

sessionInfo()

## R version 4.0.4 (2021-02-15)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] RColorBrewer_1.1-2 DNAcopy_1.64.0     testthat_3.0.2     rmarkdown_2.7      devtools_2.3.2    
## [6] usethis_2.0.1     
## 
## loaded via a namespace (and not attached):
##  [1] highr_0.8           compiler_4.0.4      bslib_0.2.4         BiocManager_1.30.10 jquerylib_0.1.3    
##  [6] prettyunits_1.1.1   remotes_2.2.0       tools_4.0.4         digest_0.6.27       pkgbuild_1.2.0     
## [11] pkgload_1.2.0       jsonlite_1.7.2      memoise_2.0.0       evaluate_0.14       lifecycle_1.0.0    
## [16] rlang_0.4.10        cli_2.3.1           yaml_2.2.1          xfun_0.22           fastmap_1.1.0      
## [21] withr_2.4.1         stringr_1.4.0       knitr_1.31          desc_1.3.0          fs_1.5.0           
## [26] sass_0.3.1          rprojroot_2.0.2     glue_1.4.2          R6_2.5.0            processx_3.4.5     
## [31] sessioninfo_1.1.1   callr_3.5.1         purrr_0.3.4         magrittr_2.0.1      codetools_0.2-18   
## [36] ps_1.6.0            ellipsis_0.3.1      htmltools_0.5.1.1   assertthat_0.2.1    stringi_1.5.3      
## [41] cachem_1.0.4        crayon_1.4.1

Hidden Markov Models and Dynamic Programming

Michael Love

Markov Models

Useful variables for analyzing the HMM

The Viterbi sequence

Updating parameters: the Baum-Welch algorithm