$\def \A \ensuremath{\mathbf{A}}$ $\def \b \ensuremath{\mathbf{b}}$ $\def \tA \ensuremath{\mathbf{A$^{T}$}}$ $\def \d \ensuremath{\mathbf{d}}$ $\def \e \ensuremath{\mathbf{e}}$ $\def \g \ensuremath{\mathbf{g}}$ $\def \I \ensuremath{\mathbf{I}}$ $\def \l \ensuremath{\mathbf{l}}$ $\def \M \ensuremath{\mathbf{M}}$ $\def \W \ensuremath{\mathbf{W}}$ $\def \y \ensuremath{\mathbf{y}}$ $\def \Y \ensuremath{\mathbf{Y}}$ $\def \X \ensuremath{\mathbf{X}}$ $\def \x \ensuremath{\mathbf{x}}$ $\def \tx \ensuremath{\mathbf{x$^{T}$}}$ $\def \z \ensuremath{\mathbf{z}}$ $\def \betab \ensuremath{\boldsymbol{\beta}}$ $\def \Omegab \ensuremath{\boldsymbol{\Omega}}$ $\def \pib \ensuremath{\boldsymbol{\pi}}$ $\def \thetab \ensuremath{\boldsymbol{\theta}}$ $\def \epsilonb \ensuremath{\boldsymbol{\epsilon}}$

1 Introduction

In this lecture we will discuss several specific MCMC algorithms that address some of the short comings in the general methods discussed in the last lecture. Namely, we will focus on adaptive MCMC methods, as well as reversible jump MCMC methods for bayesian variable selection.

2 Adaptive MCMC

In the previous lecture we discussed examples of MH and Gibbs Sampling where the variance of the proposal distributions was fixed across iterations. For example, in the MH independence sampler we drew from a proposal distribution $g \sim N(0,\sigma^2)$ where $\sigma^2$ was fixed across iterations. For the MH random walk sampler, we drew $\epsilon$ from a distribution $h$ where $h \sim U(0,1)$ or $N(0,1)$.

As one can imagine, in certain situations the rejection rate in such samplers may be much higher or much lower than the recommended range, leading to decreased performance. We discussed manual tuning methods in the last lecture to update the proposal distributions after several runs of MCMC and checking the output.

An alternative, and perhaps more convenient, approach is to tune the proposal distributions while the chain is running, given the observed rejection rate. For example, narrowing the proposal distributions if the rejection rate is too low, or widening it if the rejection rate is too high. Such approaches are called Adaptive MCMC (AMCMC) methods.

Implementations of AMCMC methods have been around for some time, however, as one can imagine, proving that the stationarity of such chains result from adaptive methods was difficult. More recently we have seen some theoretical work that discusses this point, and also the conditions under which they we can expect stationary behavior of such chains. We will see how we must craft AMCMC algorithms to satisfy such conditions.

In general, we must take care in terms of setting up such chains, as we do not want the chain to be too heavily influenced by the tuning. With the Markov assumption we assume that the current state is only on the previous one, however we run the risk with too much tuning that the current state may depend on the full history of the chain itself. Another issue we need to worry about is whether the dependence on the previous iteration with tuning is too strong, resulting in the sampler not fully exploring the state space. In general, it is best to ease the amount of tuning being performed as the chain progresses to allow it to reach its stationary distribution.

2.1 Conditions for AMCMC algorithms to ensure stationarity

We can assume that an AMCMC algorithm is ergodic with respect to its target distribution $f$ if the the algorithm satisfies two conditions.

Ths first is that of diminishing adaptations, that is, as $t\rightarrow \infty$, the parameters in the chain will depend less and less on the earlier states of the chain. Practically speaking, this is met by ensuring that as the chain progresses, the modifications to the proposal distribution parameters occurs less and less frequently, or occurs in smaller and smaller amounts. This is an example of easing off of the tuning as the chain progresses. One can imagine that frequent tuning updates as $t \rightarrow \infty$ would heavily influence in the chain and interfere with the reaching of its stationary distribution.

The second condition is that of bounded convergence, or containment. Let $D^{(t)}$ represent the largest posssible distance between the target stationary distribution and the stationary distribution of the transition kernel of the MCMC algorithm at time $t$. Let $M^{(t)}(\epsilon)$ be the smallest $t$ such that $D^{(t)} < \epsilon$. The bounded convergence condition states that the stochastic process $M^{(t)}(\epsilon)$ is bounded in probability for any $\epsilon > 0$. In other words, the time it takes for the two distibutions to get sufficiently close is bounded in probability. In practice these conditions lead to simpler, verifiable conditions that are relatively easy to check (we will talk more about this later).

2.2 Metropolis-within-Gibbs

In the last lecture we alluded to the scenario when one may not have the necessary conditional distributions in closed form to implement the Gibbs sampler, or that it may not be easy to sample from such conditional distributions. In such cases, we can apply the Metropolis-within-Gibbs (MwG) algorithm, where we replace each Gibbs step with a Metropolis step.

If we recall, we defined $\boldsymbol{X} = (X_1,\ldots,X_p)^T$ and $\boldsymbol{X}_{-i} = (X_1,\ldots,X_{i-1},,X_{i+1},\ldots,X_p)^T$, the elements of $\boldsymbol{X}$ minus the $i$th component. Suppose that the univariate conditional density of $X_i|\boldsymbol{X}_{-i}$, denoted as $f(X_i|\boldsymbol{X}_{-i})$, is easily sampled for $i = 1,\ldots,p$. Then, we can describe a general Gibbs sampling procedure as follows:

Select starting values $\boldsymbol{x}^{(0)}$, and set $t = 0$.
Generate sequentially

\[\begin{align} X_1^{(t+1)} &| \cdot ∼ f(x_1 | x_2^{(t)},\ldots,x_p^{(t)}),\\ X_2^{(t+1)} &| \cdot ∼ f(x_2 | x_1^{(t+1)},x_3^{(t)},\ldots,x_p^{(t)}),\\ X_p^{(t+1)} &| \cdot ∼ f(x_p | x_1^{(t+1)},\ldots,x_{p-1}^{(t+1)}), \end{align}\] where $|\cdot$ denotes conditioning on the most recent updates to $\boldsymbol{X}$. 3. Set $t = t+1$ and go to step 2.

For MwG, we replace one or more steps in the gibbs cycle in (2) above with a Metropolis step. We can rewrite the Gibbs sampling algorithm with MwG using the following approach, where we assume for the sake of example that we wish to utilize a random walk sampler in each MH step:

Select starting values $\boldsymbol{x}^{(0)}$, and set $t = 0$.
MwG update: Obtain $X_1^{(t+1)}$ via MH random walk update
1. Generate proposal $X_1^{*} = x_1^{(t)} + \epsilon$, where $\epsilon \sim N(0, \sigma^2)$
2. Compute the MH ratio \[R({x}^{(t)},{X}^{*}) = \frac{f(X_1^{*})}{f(x_1^{(t)})}\]
3. Then we sample a value for $X_1^{(t+1)}$ by choosing
\[\begin{align} {X}^{(t+1)} &= \begin{cases} {X}^{*} \text{ with probability min} (R({x}^{(t)},{X}^{*}),1)\\ {x}^{(t)} \text{ otherwise.} \end{cases} \end{align}\]
Then, we perform the Gibbs updates for the remaining parameters that we have closed form univariate conditionals for, where we generate in turn

Here we assume that the parameters were arranged such that the only parameter without the closed form conditional was first in the cycle. Clearly, we can generalize this to the case where more or all of the parameters lack closed form univariate conditional distributions.

Lets illustrate this with an example, where we extend the fitting of the poisson GLMM problem via MCEM to higher dimensional setting, where we assume our model has four fixed effects predictors, each with their own subject-level random slope, and also a random intercept.

2.2.1 Example: Fitting a higher dimensionsional poisson GLMM via MCEM

For this example, let us assume a similar setup as the motivating problem in the last lecture, except now we assume that $\log(\lambda_{ij}) = \boldsymbol{x}_{ij}\boldsymbol{\beta} + \boldsymbol{z}_{ij}\boldsymbol{\gamma}_i$, where $\boldsymbol{x}_{ij} = (1, x_{ij1}, x_{ij2}, x_{ij3}, x_{ij4})$ with corresponding $\boldsymbol{\beta} = (\beta_0,\beta_1,\beta_2, \beta_3,\beta_4)^T$, and $\boldsymbol{\gamma}_i = (\gamma_{i0},\gamma_{i1},\gamma_{i2},\gamma_{i3},\gamma_{i4})$. Given that the random effects are now multivariate,we replace $\phi(\boldsymbol{\gamma}_i| 0, \sigma^2_{\gamma})$ with $\phi(\boldsymbol{\gamma}_i| 0, \Sigma_{\gamma}) \sim MVN(0,\Sigma_{\gamma})$, where $\Sigma_{\gamma}$ is a $q \times q$ covariance matrix pertaining to the random effects vector $\boldsymbol{\gamma}_i$ to be estimated. Since we are assuming that each predictor has their own random slope, and also assume a random intercept, here $\boldsymbol{z}_{ij} = \boldsymbol{x}_{ij}$.

That is, relative to the previous problem, we are increasing the dimension of both the fixed and random effects. The new likelihood takes a similar form

\[L(\boldsymbol{\beta}, \sigma^2_{\gamma} | \boldsymbol{y}) = \prod_{i = 1}^{22}\int \left[f(\boldsymbol{y}_{i} | \boldsymbol{\lambda}_{i})\phi(\boldsymbol{\gamma}_i| 0, \Sigma_{\gamma}) \right] d\boldsymbol{\gamma}_i.\]

Clearly the integral here is multidimensional with respect to the random effect vector $\boldsymbol{\gamma}_i$, which has in this setting dimension $q = 5$. In contrast, in the previous problem $q = 1$. The Q-function pertaining to the $i$th subject will then take the following form and approximation:

\[\begin{align} Q_i(\boldsymbol{\theta}|\boldsymbol{\theta}^{(t)}) &= \int \left[\log (f(\boldsymbol{y}_{i} | \boldsymbol{\lambda}_{i}) + \log\left(\phi(\boldsymbol{\gamma}_i| 0, \Sigma_{\gamma})\right)\right]f(\boldsymbol{\gamma}_i | \boldsymbol{y}_o, \boldsymbol{\theta}^{(t)})d\boldsymbol{\gamma}_i\\ &= \int \left[\sum_{j = 1}^{5}\log(f(y_{ij} | \lambda_{ij}) + \log\left(\phi(\boldsymbol{\gamma}_i| 0, \Sigma_{\gamma})\right)\right]f(\boldsymbol{\gamma}_i | \boldsymbol{y}_o, \boldsymbol{\theta}^{(t)})d\boldsymbol{\gamma}_i\\ &=\frac{1}{M}\sum_{k = 1}^M\left[\sum_{j = 1}^{5}\log\left(f(y_{ij} | \lambda_{ijk})\right) + \log\left(\phi(X_k| 0, \Sigma_{\gamma})\right)\right] \end{align}\]

Lets simulate some data from such a model below. We we will a similar structure as the previous Poisson GLMM example, assuming $n = 22$ subjects, $n_i = 5$ measurements per subjects, and a random intercept. We will also add on 4 fixed effects predictors, each with their own random slope as well. Giving the increase dimenson however we will bump $n$ to 100.

library(mvtnorm)
library(lme4)

## Loading required package: Matrix

## set seed
set.seed(1)

## set variables
# number of samples
n = 100

# number of timepoints measured per sample
ni = 5

# total number of measurements
N = n * ni

# pick fixed effects coefficient vector
beta = matrix(c(0.5, .25, .25, .25, .25), ncol = 1)

# pick random effect variances (assuming 0 covariance)
Sigma_gamma = diag(c(0.5, 0.5, 0.5, 0.5, 0.5))

# set dimensions of fixed effects and random effects
p = length(beta)
q = ncol(Sigma_gamma)


## simulate data for p time varying covariates, measured per time point
mat = matrix(rnorm(N * (p - 1), mean = 0, sd = 0.5), nrow = N, ncol = p -
               1)

# specify subject ID for each row in mat
ID = rep(1:n, each = ni)

# add intercept
X = cbind(rep(1, n), mat)
colnames(X) = c("int", "x1", "x2", "x3", "x4")


## create y vector to hold observation
y = rep(0, N)

# now simulate yi, i = 1,..,100
for (i in 1:n) {
  # get subject  i indices
  subjecti = which(ID == i)
  
  # get time-varying covariates for subject i, dropping ID colum
  Xi = X[subjecti, ]
  
  # here we assume random effect matrix Zi for subject i same as Xi
  Zi = Xi
  
  # draw single q dimensional random effect vector for subject i, gammai
  gammai = rmvnorm(n = 1,
                   mean = rep(0, ncol(Sigma_gamma)),
                   sigma = Sigma_gamma)
  gammai = matrix(gammai, ncol = 1)
  
  # generate vector of ni observations for subject i
  lambdai = exp(Xi %*% beta + Zi %*% gammai)
  y[subjecti] = rpois(ni, lambdai)
}

## create simulated alzheimers data.table, month only used to index rep measurements here
library(data.table)
alz = data.table(
  y = y,
  x1 = X[, 2],
  x2 = X[, 3],
  x3 = X[, 4],
  x4 = X[, 5],
  subject = ID,
  month = rep(1:5, n)
)

Now that we have simulated the data, lets check to see if we did this properly by fitting the desired model using glmer. Note the run time

start =  Sys.time()
fit.glmer = glmer(y ~ x1 + x2 + x3 + x4 + (x1 + x2 + x3 + x4 |subject) ,
                  family = poisson(),
                  data = alz)

## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl =
## control$checkConv, : Model failed to converge with max|grad| = 0.0958471
## (tol = 0.001, component 1)

## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue
##  - Rescale variables?

end =  Sys.time()

summary(fit.glmer)

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: poisson  ( log )
## Formula: y ~ x1 + x2 + x3 + x4 + (x1 + x2 + x3 + x4 | subject)
##    Data: alz
## 
##      AIC      BIC   logLik deviance df.resid 
##   2072.4   2156.7  -1016.2   2032.4      480 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.8114 -0.7263 -0.1314  0.4254  2.6370 
## 
## Random effects:
##  Groups  Name        Variance Std.Dev. Corr                   
##  subject (Intercept) 0.4148   0.6441                          
##          x1          0.6348   0.7967   -0.11                  
##          x2          0.2217   0.4709   -0.02 -0.52            
##          x3          0.3575   0.5979   -0.22  0.07  0.71      
##          x4          0.4946   0.7033   -0.22 -0.45  0.19  0.30
## Number of obs: 500, groups:  subject, 100
## 
## Fixed effects:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 0.6040093  0.0007647   789.9   <2e-16 ***
## x1          0.2727996  0.0007642   357.0   <2e-16 ***
## x2          0.3664818  0.0007644   479.5   <2e-16 ***
## x3          0.1806335  0.0007641   236.4   <2e-16 ***
## x4          0.3050672  0.0007893   386.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##    (Intr) x1    x2    x3   
## x1 0.000                   
## x2 0.000  0.000            
## x3 0.000  0.000 0.000      
## x4 0.000  0.000 0.000 0.000
## convergence code: 0
## Model failed to converge with max|grad| = 0.0958471 (tol = 0.001, component 1)
## Model is nearly unidentifiable: very large eigenvalue
##  - Rescale variables?

print(end - start)

## Time difference of 6.754664 secs

Looks good, and run time took relatively longer than our previous example! This is with the laplace approximation in glmer. We can see that performance slows down even for laplace in higher dimensions. In this case we also observed problems with the convergence of the model as well.

Now, lets try to fit the model via MCEM using an independence sampler. Here, using the MH independence sampler implies drawing from a multivariate proposal distribution. Lets keep track of the acceptace rate while it runs. We will set up the function for the independence sampler with a proposal density $g \sim MVN(\boldsymbol{0}, \Sigma_\gamma^{(t)})$, where $\Sigma_\gamma^{(t)})$ is the estimate of the $q \times q$ random effects covariance matrix at step $t$ of the MCEM algorithm. Some helper functions are below

## function for the log likelihood for ith subject, x is the proposal or current value for gammai
f = function(x, yi, Xi, betat, Sigma_gammat) {
  # here we assume random effect matrix Zi for subject i same as Xi
  Zi = Xi
  
  # calculate lambdai
  lambdai = exp(Xi %*% betat + Zi %*% matrix(x, ncol = 1))
  
  # sum across repeated observations for poisson portion
  lli = sum(dpois(yi, lambdai, log = T))
  
  # dont forget to include the MVN log likelihood
  lli = lli  + dmvnorm(x, sigma = Sigma_gammat, log = T)
  
  return(lli)
}

## log proposal density function
g = function(x, Sigma_gammat) {
  dmvnorm(x, sigma = Sigma_gammat, log = T)
}

## proposal function, MVN(0, Sigma)
g.sim = function(Sigma_gammat) {
  rmvnorm(1, sigma = Sigma_gammat)
}

## calculate MH ratio given f and g, x is the proposal, xt is the current value from the chain
R = function(xt, x, f, g, yi, Xi, betat, Sigma_gammat) {
  # log numerator - log denominator
  logR = (f(x, yi, Xi, betat, Sigma_gammat) + g(xt, Sigma_gammat)) - (f(xt, yi, Xi, betat, Sigma_gammat) + g(x , Sigma_gammat))
  R = exp(logR)
  return(R)
}

Now that thats done, let’s write the function for the MH algorithm here

mh.independence.sampler = function(yi, Xi, betat, Sigma_gammat, M, prev.gamma.i = NULL){

  # get dimension of gammai
  q = ncol(Sigma_gammat)
  
  # initialize the chain vector
  x.indep.chain = matrix(0, M, q)
  
  if(is.null(prev.gamma.i)){
    # Simulate initial draw from proposal density g
    x.indep.chain[1,] = g.sim(Sigma_gammat)
  }else{
    # if last value from previous chain avail, start there
    x.indep.chain[1,] = prev.gamma.i    
  }
  
  
  # now start chain
  accept = 0
  for(i in 1:(M-1)){
    
    # set the value at current iteration of the chain to variable xt
    xt = x.indep.chain[i,]
    
    # draw a proposal from the proposal density
    x = g.sim(Sigma_gammat)
    
    # calculate MH ratio 
    r = min(
            R(xt, x, f, g, yi, Xi, betat, Sigma_gammat),
            1
          )
    
    # Generate draw from bernoulli(p).
    # Alternatively, can directly compare ratio to 
    # a U(0,1) draw as we did with Rejection Sampling
    keep = rbinom(1, 1, r)
    
    if(keep == 1){
      # if keep = 1, then set next iteration equal to then proposal
      x.indep.chain[i+1,] = x
      #  update number of acceptacnes
      accept = accept + 1
    }else{
      # otherwise, carry over value from the current iteration
      x.indep.chain[i+1,] = xt
    }
  }
  
  return(list(gammai = x.indep.chain, ar = accept/M))
}

To simplify things, lets just roll this into an E-step function to calculate the Q-function

e.step= function(y, X, ID, betat, Sigma_gammat, M, n, ni, sampler, burn.in = 200,prev.gamma = NULL){
  
  # initialize Q-function
  Qfunction = 0
  
  # matrix to hold chains from each subject
  gamma = matrix(0, n*M, ncol(Sigma_gammat))
  
  # vector to hold subject acceptance rates
  ar = matrix(0, n, q)
  
  # vector to hold offset values Zi %*% gammai, yaug, Xaug in M step
  N = ni*n
  offset = yaug = rep(0, N*M)
  Xaug = matrix(0, nrow = N*M, ncol = ncol(X))
  
  # loop over observations
  for(i in 1:n){
      
      # subject i indices
      subjecti = which(ID == i)
      
      # grab subject i data
      yi = y[subjecti]
      Xi = X[subjecti,]
      
      # create chain of length M per observation 
      if(is.null(prev.gamma)){
        # if no previous chain available
        # start from scratch and remove burn.in
        chain = sampler(
                    yi = yi, 
                    Xi = Xi,
                    betat = betat, 
                    Sigma_gammat = Sigma_gammat,
                    M = M + burn.in)
        gammai = chain$gammai[-c(1:burn.in),]
      }else{
        # if chain available from previous EM
        # restart this chain from last draw in previous chain
        chain = sampler(
                      yi = yi, 
                      Xi = Xi,
                      betat = betat, 
                      Sigma_gammat = Sigma_gammat,
                      M = M, # no burn in
                      prev.gamma = prev.gamma[i,]
                ) 
        gammai = chain$gammai
      }

      ar[i,] = chain$ar
      
      # create augmented versions for Q function calculation
      # total length is (n*ni*M) rows
      aug = rep(1:ni, M)
      yi_aug = yi[aug]
      Xi_aug = Xi[aug,]
      Zi_aug = Xi_aug
      
      # create augmented version of gammai to aid vectorization
      # repeated ni times per replicated subject 
      # total length is (n*ni*M) rows
      augg = rep(1:M, each = ni)
      gammai_aug = gammai[augg,]
      
      # calculate Q function for subject i:  poisson portion (n*ni*M)
      XBaug = Xi_aug%*%betat
      Zgammaaug = rowSums(Zi_aug * gammai_aug)
      lambdai_aug = exp(XBaug + Zgammaaug)
      Qi = sum(dpois(yi_aug, lambda = lambdai_aug, log = T))
      
      # calculate Q function for subject i:  MVN portion (n*M)
      Qi = Qi + sum(dmvnorm(gammai_aug, sigma = Sigma_gammat, log = T))
      
      # divide by M
      Qi = Qi/M
      
      # add to overall Q estimate
      Qfunction = Qfunction + Qi
      
      # save offset, yaug, xaug for later
      a = (i-1)*M*ni + 1
      b = i*M*ni
      offset[a:b] = Zgammaaug
      yaug[a:b] = yi_aug
      Xaug[a:b,] = Xi_aug
      
      # save gammai for later
      a = (i-1)*M + 1
      b = i*M
      gamma[a:b,] = gammai
    }
      
    return(list(Qfunction = Qfunction, gamma = gamma, ar = ar, offset = offset, yaug = yaug, Xaug = Xaug))
}

Now lets start the EM algorithm

## set initial parameters
  tol = 10^-5
  maxit = 25
  iter = 0
  eps = Inf
  qfunction = -10000 # using Qfunction for convergence
  prev.gamma = NULL
  
## starting values
  beta = as.vector(glm(y ~ X-1, family = poisson())$coef)
  Sigma_gamma =  diag(rep(1, 5))

## fix chain length at 1000 in E-step
M = 1000

start = Sys.time()
while(eps > tol & iter < maxit){
  
  ## save old qfunction
  qfunction0 = qfunction
  
  ## obtain last chain value (Mth value) for each obs if iter > 0
  if(iter > 0){
    prev.gamma = gamma[seq(M,nrow(gamma), by = M),]
  }
  
  ## E-step
  estep = e.step(y = y, X = X, ID = ID, betat = beta, Sigma_gammat = Sigma_gamma, M = M, n = n, ni = ni, sampler = mh.independence.sampler, prev.gamma = prev.gamma)
  gamma = estep$gamma
  qfunction = estep$Qfunction
  offset = estep$offset
  yaug = estep$yaug
  Xaug = estep$Xaug
  
  ## Calculate relative change in qfunction from prior iteration
  eps  = abs(qfunction-qfunction0)/abs(qfunction0)
    
  ## Start M-step
    
  # s2gamma, MLE for sigma^2 from normal with mean 0, averaged over M
  # closed form derived from Q function approximation
  Sigma_gamma = t(gamma) %*% gamma/(n*M)

  aug = rep(1:n,each = M)
  fit = glm(yaug ~ Xaug -1, 
              family = poisson(), 
              weights = rep(1/M, nrow(Xaug)), 
              offset = offset,
              # use starting value from previous step
              start = beta
  )
  beta = as.vector(fit$coefficients)
    
  ## update iterator
  iter = iter + 1
  if(iter == maxit) warning("Iteration limit reached without convergence")
    
  ## print out info to keep track
  cat(sprintf("Iter: %d Qf: %.3f sigma_gamma1: %f beta0: %.3f beta1:%.3f eps:%f\n",iter, qfunction,diag(Sigma_gamma)[1], beta[1],beta[2], eps))
}

## Iter: 1 Qf: -4112.496 sigma_gamma1: 0.823670 beta0: 1.026 beta1:0.479 eps:0.588750
## Iter: 2 Qf: -3914.505 sigma_gamma1: 0.719889 beta0: 0.992 beta1:0.450 eps:0.048144
## Iter: 3 Qf: -3733.784 sigma_gamma1: 0.653014 beta0: 0.960 beta1:0.439 eps:0.046167
## Iter: 4 Qf: -3606.619 sigma_gamma1: 0.602396 beta0: 0.933 beta1:0.412 eps:0.034058
## Iter: 5 Qf: -3483.227 sigma_gamma1: 0.541872 beta0: 0.916 beta1:0.402 eps:0.034213
## Iter: 6 Qf: -3408.264 sigma_gamma1: 0.510296 beta0: 0.902 beta1:0.382 eps:0.021521
## Iter: 7 Qf: -3382.641 sigma_gamma1: 0.525423 beta0: 0.873 beta1:0.392 eps:0.007518
## Iter: 8 Qf: -3314.316 sigma_gamma1: 0.503751 beta0: 0.852 beta1:0.386 eps:0.020199
## Iter: 9 Qf: -3286.955 sigma_gamma1: 0.472633 beta0: 0.833 beta1:0.361 eps:0.008255
## Iter: 10 Qf: -3258.370 sigma_gamma1: 0.467022 beta0: 0.813 beta1:0.341 eps:0.008697
## Iter: 11 Qf: -3239.405 sigma_gamma1: 0.456918 beta0: 0.789 beta1:0.337 eps:0.005820
## Iter: 12 Qf: -3197.103 sigma_gamma1: 0.446558 beta0: 0.765 beta1:0.328 eps:0.013059
## Iter: 13 Qf: -3178.757 sigma_gamma1: 0.437421 beta0: 0.745 beta1:0.325 eps:0.005738
## Iter: 14 Qf: -3144.555 sigma_gamma1: 0.407973 beta0: 0.735 beta1:0.297 eps:0.010759
## Iter: 15 Qf: -3111.318 sigma_gamma1: 0.400836 beta0: 0.726 beta1:0.281 eps:0.010570
## Iter: 16 Qf: -3098.704 sigma_gamma1: 0.422320 beta0: 0.696 beta1:0.287 eps:0.004054
## Iter: 17 Qf: -3096.123 sigma_gamma1: 0.415256 beta0: 0.679 beta1:0.286 eps:0.000833
## Iter: 18 Qf: -3075.318 sigma_gamma1: 0.409183 beta0: 0.667 beta1:0.286 eps:0.006720
## Iter: 19 Qf: -3077.873 sigma_gamma1: 0.401676 beta0: 0.661 beta1:0.266 eps:0.000831
## Iter: 20 Qf: -3106.973 sigma_gamma1: 0.414667 beta0: 0.651 beta1:0.271 eps:0.009455
## Iter: 21 Qf: -3101.050 sigma_gamma1: 0.423399 beta0: 0.647 beta1:0.267 eps:0.001906
## Iter: 22 Qf: -3099.750 sigma_gamma1: 0.422670 beta0: 0.635 beta1:0.263 eps:0.000419
## Iter: 23 Qf: -3060.993 sigma_gamma1: 0.412330 beta0: 0.633 beta1:0.277 eps:0.012503
## Iter: 24 Qf: -3048.865 sigma_gamma1: 0.396418 beta0: 0.636 beta1:0.274 eps:0.003962

## Warning: Iteration limit reached without convergence

## Iter: 25 Qf: -3057.023 sigma_gamma1: 0.401207 beta0: 0.634 beta1:0.270 eps:0.002676

  end = Sys.time()
  print(end - start)

## Time difference of 30.79622 mins

For the sake of illustration/speed we fixed $M$ to be relatively small and terminated the algorithm after 25 iterations. In reality we may need a much larger M and more EM iterations to reach convergence.

Lets take a look at the final acceptance rates for each subject’s chains

par(mfrow = c(2,2))
hist(estep$ar)
plot(gamma[1:M,1], type = 'l')
acf(gamma[1:M,1])

This does not look that great, many of the acceptance rates pertaining to each observation’s chain are below 0.1. Ideally, this could be closer to 0.23 in the multivariate setting.

Lets try rerunning this with MwG and compare the performance. All we will be doing here is here swapping out the sampler and leaving everything else the same. The code for the MwG sampler is below. We update $g$ to reflect the random walk sampler, and recall that $g(\boldsymbol{x}^*|\boldsymbol{x}^{(t)}) = h(\boldsymbol{x}^*-\boldsymbol{x}^{(t)})$. If $h$ is symmetric, which is clearly the case for a N(0,1) density, then terms related to $g$ will cancel out in the MH ratio due to symmetry of the proposal distribution, as $h(\boldsymbol{x}^*-\boldsymbol{x}^{(t)}) = h(\boldsymbol{x}^{(t)}-\boldsymbol{x}^*)$.

# rw density
h.sim = function(){
  rnorm(1)
}

# new MH ratio function (proposal density cancels out due to symmetr\boldsymbol{y})
R = function(xt,x, f, yi, Xi, betat, Sigma_gammat){
  # log numerator - log denominator
  logR = f(x, yi, Xi, betat, Sigma_gammat) - f(xt,yi, Xi, betat, Sigma_gammat)
  R = exp(logR)
  return(R)
}


mwg.rw.sampler = function(yi, Xi, betat, Sigma_gammat, M, prev.gamma.i = NULL){

  # get dimension of gammai
  q = ncol(Sigma_gammat)
  
  # initialize the chain vector
  x.indep.chain = matrix(0, M, q)
  
  if(is.null(prev.gamma.i)){
    # Simulate initial draw from original proposal density g
    x.indep.chain[1,] = g.sim(Sigma_gammat)
  }else{
    # if last value from previous chain avail, start there
    x.indep.chain[1,] = prev.gamma.i    
  }
  
  # now start chain
  accept = rep(0,q)
  for(i in 1:(M-1)){
    
    # set the value at current iteration of the chain to variable xt
    xt = x.indep.chain[i,]
      
    #looping over chains drawing univariate proposal
    # conditions unknown here so performing MH at each gibbs step
    for(j in 1:q){  
      
      # set propsal equal to previous
      x = xt 
      
      # only update the jth component
      x[j] = x[j] + h.sim()
      
      # calculate MH ratio
      r = min(
              R(xt, x, f, yi, Xi, betat, Sigma_gammat),
              1
            )
      
      # Generate draw from bernoulli(p).
      # Alternatively, can directly compare ratio to 
      # a U(0,1) draw as we did with Rejection Sampling
      keep = rbinom(1, 1, r)
      
      if(keep == 1){
        # if keep = 1, then set next iteration equal to then proposal
        x.indep.chain[i+1,] = x
        
        # reset xt
        xt = x
        
        #  update number of acceptacnes
        accept[j] = accept[j] + 1
      }else{
        # otherwise, carry over value from the current iteration
        x.indep.chain[i+1,] = xt
      }
    }
    # end of a single gibbs cycle
  }
  # end chain 
  
  return(list(gammai = x.indep.chain, ar = accept/M))
}

Now lets see how this impacts performance. We simply swap out the sampler that we use in the E step function and rerun the same code.

## set initial parameters
  tol = 10^-5
  iter = 0
  eps = Inf
  qfunction = -10000 # using Qfunction for convergence
  prev.gamma = NULL
  
## starting values
  beta = as.vector(glm(y ~ X-1, family = poisson())$coef)
  Sigma_gamma =  diag(rep(1, 5))

start = Sys.time()
while(eps > tol & iter < maxit){
  
  ## save old qfunction
  qfunction0 = qfunction
  
  ## obtain last chain value for each obs if iter > 0
  if(iter > 0){
    prev.gamma = gamma[seq(M,nrow(gamma), by = M),]
  }
  
  ## E-step
  estep = e.step(y = y, X = X, ID = ID, betat = beta, Sigma_gammat = Sigma_gamma, M = M, n = n, ni = ni, sampler = mwg.rw.sampler, prev.gamma = prev.gamma)
  gamma = estep$gamma
  qfunction = estep$Qfunction
  offset = estep$offset
  yaug = estep$yaug
  Xaug = estep$Xaug
  
  ## Calculate relative change in qfunction from prior iteration
  eps  = abs(qfunction-qfunction0)/abs(qfunction0)
    
  ## Start M-step
    
  # s2gamma, MLE for sigma^2 from normal with mean 0, averaged over M
  # closed form derived from Q function approximation
  Sigma_gamma = t(gamma) %*% gamma/(n*M)

  aug = rep(1:n,each = M)
  fit = glm(yaug ~ Xaug -1, 
              family = poisson(), 
              weights = rep(1/M, nrow(Xaug)), 
              offset = offset,
              # use starting value from previous step
              start = beta
  )
  beta = as.vector(fit$coefficients)
    
    ## update iterator
    iter = iter + 1
    if(iter == maxit) warning("Iteration limit reached without convergence")
    
    ## print out info to keep track
    cat(sprintf("Iter: %d Qf: %.3f sigma_gamma1: %f beta0: %.3f beta1:%.3f eps:%f\n",iter, qfunction,diag(Sigma_gamma)[1], beta[1],beta[2], eps))
  }

## Iter: 1 Qf: -4120.814 sigma_gamma1: 0.829736 beta0: 1.015 beta1:0.459 eps:0.587919
## Iter: 2 Qf: -3911.302 sigma_gamma1: 0.717514 beta0: 0.978 beta1:0.456 eps:0.050842
## Iter: 3 Qf: -3758.658 sigma_gamma1: 0.632632 beta0: 0.946 beta1:0.446 eps:0.039026
## Iter: 4 Qf: -3638.550 sigma_gamma1: 0.578422 beta0: 0.916 beta1:0.432 eps:0.031955
## Iter: 5 Qf: -3533.634 sigma_gamma1: 0.545558 beta0: 0.883 beta1:0.420 eps:0.028835
## Iter: 6 Qf: -3452.523 sigma_gamma1: 0.498365 beta0: 0.859 beta1:0.403 eps:0.022954
## Iter: 7 Qf: -3392.279 sigma_gamma1: 0.478768 beta0: 0.833 beta1:0.393 eps:0.017449
## Iter: 8 Qf: -3343.845 sigma_gamma1: 0.467761 beta0: 0.810 beta1:0.381 eps:0.014278
## Iter: 9 Qf: -3302.542 sigma_gamma1: 0.454428 beta0: 0.783 beta1:0.375 eps:0.012352
## Iter: 10 Qf: -3262.480 sigma_gamma1: 0.445908 beta0: 0.763 beta1:0.362 eps:0.012131
## Iter: 11 Qf: -3232.788 sigma_gamma1: 0.441854 beta0: 0.744 beta1:0.353 eps:0.009101
## Iter: 12 Qf: -3204.194 sigma_gamma1: 0.427638 beta0: 0.725 beta1:0.348 eps:0.008845
## Iter: 13 Qf: -3181.673 sigma_gamma1: 0.423595 beta0: 0.710 beta1:0.335 eps:0.007028
## Iter: 14 Qf: -3157.246 sigma_gamma1: 0.421116 beta0: 0.698 beta1:0.325 eps:0.007678
## Iter: 15 Qf: -3142.854 sigma_gamma1: 0.425126 beta0: 0.685 beta1:0.318 eps:0.004558
## Iter: 16 Qf: -3133.293 sigma_gamma1: 0.427100 beta0: 0.675 beta1:0.313 eps:0.003042
## Iter: 17 Qf: -3114.247 sigma_gamma1: 0.422999 beta0: 0.665 beta1:0.316 eps:0.006078
## Iter: 18 Qf: -3109.811 sigma_gamma1: 0.420735 beta0: 0.656 beta1:0.312 eps:0.001425
## Iter: 19 Qf: -3096.562 sigma_gamma1: 0.416631 beta0: 0.651 beta1:0.306 eps:0.004260
## Iter: 20 Qf: -3087.211 sigma_gamma1: 0.410981 beta0: 0.647 beta1:0.300 eps:0.003020
## Iter: 21 Qf: -3076.499 sigma_gamma1: 0.407918 beta0: 0.643 beta1:0.293 eps:0.003470
## Iter: 22 Qf: -3067.363 sigma_gamma1: 0.413776 beta0: 0.637 beta1:0.291 eps:0.002969
## Iter: 23 Qf: -3064.289 sigma_gamma1: 0.405515 beta0: 0.632 beta1:0.293 eps:0.001002
## Iter: 24 Qf: -3062.285 sigma_gamma1: 0.416573 beta0: 0.627 beta1:0.291 eps:0.000654

## Warning: Iteration limit reached without convergence

## Iter: 25 Qf: -3062.491 sigma_gamma1: 0.415327 beta0: 0.620 beta1:0.291 eps:0.000067

  end = Sys.time()
  print(end - start)

## Time difference of 52.89467 mins

Again for the sake of illustration/speed, we did not increase M to facilitate convergence, a much larger value is needed to do so. Now lets compare these results with glmer

# glmer
summary(fit.glmer)

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: poisson  ( log )
## Formula: y ~ x1 + x2 + x3 + x4 + (x1 + x2 + x3 + x4 | subject)
##    Data: alz
## 
##      AIC      BIC   logLik deviance df.resid 
##   2072.4   2156.7  -1016.2   2032.4      480 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.8114 -0.7263 -0.1314  0.4254  2.6370 
## 
## Random effects:
##  Groups  Name        Variance Std.Dev. Corr                   
##  subject (Intercept) 0.4148   0.6441                          
##          x1          0.6348   0.7967   -0.11                  
##          x2          0.2217   0.4709   -0.02 -0.52            
##          x3          0.3575   0.5979   -0.22  0.07  0.71      
##          x4          0.4946   0.7033   -0.22 -0.45  0.19  0.30
## Number of obs: 500, groups:  subject, 100
## 
## Fixed effects:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 0.6040093  0.0007647   789.9   <2e-16 ***
## x1          0.2727996  0.0007642   357.0   <2e-16 ***
## x2          0.3664818  0.0007644   479.5   <2e-16 ***
## x3          0.1806335  0.0007641   236.4   <2e-16 ***
## x4          0.3050672  0.0007893   386.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##    (Intr) x1    x2    x3   
## x1 0.000                   
## x2 0.000  0.000            
## x3 0.000  0.000 0.000      
## x4 0.000  0.000 0.000 0.000
## convergence code: 0
## Model failed to converge with max|grad| = 0.0958471 (tol = 0.001, component 1)
## Model is nearly unidentifiable: very large eigenvalue
##  - Rescale variables?

Compare these to ours:

# MCEM fixed effects
print(beta)

## [1] 0.6202312 0.2914659 0.3703284 0.1792329 0.2946418

# MCEM RE variance estimates (diagonal of covariate)
print(diag(Sigma_gamma))

## [1] 0.4153266 0.6158234 0.2664291 0.3508778 0.5304137

Now lets take a look at the acceptance rates

head(estep$ar)

##       [,1]  [,2]  [,3]  [,4]  [,5]
## [1,] 0.397 0.509 0.448 0.453 0.510
## [2,] 0.276 0.471 0.348 0.289 0.456
## [3,] 0.309 0.393 0.378 0.418 0.517
## [4,] 0.203 0.388 0.394 0.280 0.350
## [5,] 0.334 0.471 0.433 0.448 0.371
## [6,] 0.339 0.453 0.366 0.417 0.425

par(mfrow = c(2,2))
hist(estep$ar[,1], xlim = c(0,1));abline(v = 0.44);
hist(estep$ar[,2], xlim = c(0,1));abline(v = 0.44);
hist(estep$ar[,3], xlim = c(0,1));abline(v = 0.44);
hist(estep$ar[,4], xlim = c(0,1));abline(v = 0.44);

The acceptance rates are better than the first time around, and we see that they are in some cases close to the optimal value of 0.44 for univariate proposals. However, this is not the case across all samples, and across various components of $\boldsymbol{\gamma}$. We can however improve on this using adaptive methods, discussed in the next section.

Before doing so lets look at some diagnostics

par(mfrow = c(1,2))
plot(gamma[1:M,1], type = 'l')
acf(gamma[1:M,1])

2.3 Adaptive Metropolis-within-Gibbs

When we monitor the acceptance rate of the MH normal RW sampler in each gibbs step, we can see that the acceptance rate is not in the optimal range in some cadses. To address this issue, we can tune the variance of the propsal distribution within each Gibbs MH step so that we can get the acceptance rates closer to the region that we desire. To do this, we will add what is called an adaptation step to the MwG algorithm described above. Let use assume that we perform this adaption step only at specific times $t \in \{50, 100, 150, \ldots\}$, where we denote these times as “batch times” $T_b$, where $b = 0,1,\ldots$. Here $T_1 = 50$ for example. At each batch time we tune the variance of the proposal given the observed acceptance rate for iterations just in that batch. We can increase the spacing between batch times given the mixing performance of the chain and the specific problem at hand, but 50 is a typical number.

Then, we can perform AMwG using the following approach:

Select starting values $\boldsymbol{x}^{(0)}$, and set $t = 0$. Set a batching schedule $\{T_b\}$ for $b = 0, 1, \ldots$ and set $b = 0$. Lets initialize $\sigma^2_{0} = 1$.
MwG update: Obtain $X_1^{(t+1)}$ via MH random walk update
1. Generate proposal $X_1^{*} = x_1^{(t)} + \epsilon$, where $\epsilon \sim N(0,\sigma_b^2)$
2. Compute the MH ratio \[R({x}^{(t)},{X}^{*}) = \frac{f(X_1^{*})}{f(x_1^{(t)})}\]
3. Then we sample a value for $X_1^{(t+1)}$ by choosing
\[\begin{align} {X}^{(t+1)} &= \begin{cases} {X}^{*} \text{ with probability min} (R({x}^{(t)},{X}^{*}),1)\\ {x}^{(t)} \text{ otherwise.} \end{cases} \end{align}\]
Then we perform the Gibbs updates for the remaining parameters that we have closed form univariate conditionals for, where we generate in turn

\[\begin{align} X_2^{(t+1)} &| \cdot ∼ f(x_2 | x_1^{(t+1)},x_3^{(t)}\ldots,x_p^{(t)}),\\ X_3^{(t+1)} &| \cdot ∼ f(x_3 | x_1^{(t+1)},x_2^{(t+1)},x_4^{(t)},\ldots,x_p^{(t)}),\\ X_p^{(t+1)} &| \cdot ∼ f(x_p | x_1^{(t+1)},\ldots,x_{p-1}^{(t+1)}), \end{align}\] 4. Adaptation Step: when $t = T_{b+1}$, a. Update the variance of the proposal distribution \[\log(\sigma_{b+1}) = \log(\sigma_{b}) \pm \delta_{(b+1)},\] where $\delta_{(b+1)}$ is called the adaptation factor. We subtract $\delta_{(b+1)}$ when the acceptance rate in step 2(c) above is smaller than say 0.44 (ideal rate for univariate normally distributed target and proposal distributions) and add when the rate is larger than 0.44. In this manner we are able to automatically tune the variance of the proposal to keep the acceptance rate in the range that we woud like it. A common choice is $\delta_{(b+1)} = \min(0.01, 1/\sqrt{T_b})$, where 0.01 is an arbitrary choice that limits the amount of adaptation for smaller values of $T_b$. b. Set $b = b +1$ 5. Increment $t$ and return to step 2.

As discussed in the last lecture, we still need to check convergence criteria pertaining to the two conditions we laid out earlier. For MwG, this simplies to checking whether $\delta_{(b+1)} \rightarrow 0$ as $b \rightarrow \infty$ to satisfy the diminishing adaptation condition. The bounded convergence condition is met if $\log(\sigma_b) \in [−Q,Q]$ where $Q<\infty$. As we can see, these conditions are relatively easy to check. The approach discussed above can extend to almost any random walk distribution $h$, with the caveat that the MH ratio would have to be updated to include the proposal distributions if $h$ is not symmetric.

In the example we gave above, we had only one parameter where the conditional was not known in closed form. As one can imagine, the AMwG algorithm will be particular useful in cases where the dimension of the parameters is large. In such cases where multiple parmeters are have unknown conditions, it may be helpful to have separate adaptation steps for each parameter, each with their own adaptive proposal distribution variances. In this manner, we can tune the propsal in each MwG step separately from the others. As one can imagine, some parameters may have very different acceptance rates as others.

Let us now apply this to the problem at hand

# rw density
h.sim = function(sd = 1){
  rnorm(1, mean = 0, sd = sd)
}

adaptive.mwg.rw.sampler = function(yi, Xi, betat, Sigma_gammat, M, prev.gamma.i = NULL, b = 50){

  # get dimension of gammai
  q = ncol(Sigma_gammat)
  
  # create the chain vector
  x.indep.chain = matrix(0, M, q)
      
  # initialize proposal variance
  prop.sd = rep(1, q)
  
  # initialize chain
  if (is.null(prev.gamma.i)) {
    # Simulate initial draw from original proposal density g
    x.indep.chain[1, ] = g.sim(Sigma_gammat)
  } else{
    # if last value from previous chain avail, start there
    x.indep.chain[1, ] = prev.gamma.i
  }
  
  #initialize batch index
  batch = 0
  
  # vector to keep track of random effect-specific overall acceptances
  accept = rep(0,q)
  
  # vector to  keep track of random effect-specific acceptances in a given batch
  b.accept = rep(0,q)
  
  # now start chain
  for(i in 1:(M-1)){
    
    # set the value at current iteration of the chain to variable xt
    xt = x.indep.chain[i,]
      
    # looping over each element of gammai, drawing univariate proposal for each
    # conditional dist unknown here so performing MH at each gibbs step
    for(j in 1:q){  
      
      # set proposal equal to previous first
      x = xt 
      
      # then only update the jth component by adding epsilon
      # variance of h distribution varies wrt to j
      x[j] = x[j] + h.sim(sd = prop.sd[j])
      
      # calculate MH ratio
      r = min(
              R(xt, x, f, yi, Xi, betat, Sigma_gammat),
              1
            )
      
      # Generate draw from bernoulli(p).
      # Alternatively, can directly compare ratio to 
      # a U(0,1) draw as we did with Rejection Sampling
      keep = rbinom(1, 1, r)
      
      if(keep == 1){
        
        # if keep = 1, then set next iteration equal to then proposal
        x.indep.chain[i+1,] = x
        
        # reset xt
        xt = x
        
        #  update number of acceptances
        accept[j] = accept[j] + 1
        b.accept[j] = b.accept[j] + 1 
        
      }else{
        
        # otherwise, carry over value from the current iteration
        x.indep.chain[i+1,] = xt
        
      }
    }
    # end of a single gibbs cycle
    
    # if at end of batch...
    if(floor(i/b) == ceiling(i/b)){
      
      # calculate increment for proposal variance
      delta.b = min(0.1, 1/sqrt(i))
      
      # loop over proposal density variance vector
      for(j in 1:q){
        
        if(b.accept[j]/b > 0.44){
          
          # if greater, add to sd
          prop.sd[j] = exp( log(prop.sd[j]) + delta.b )     
          
        }else if(b.accept[j]/b < 0.44){
          
          # otherwise, subtract
          prop.sd[j] = exp( log(prop.sd[j]) - delta.b )   
          
        }
      }
      
      # reset batch counter
      b.accept = rep(0, q)
      
      # increment batch index
      batch = batch + 1
      
    } 
    
  }
  # end chain 
  
  return(list(gammai = x.indep.chain, ar = accept/M, prop.sd = prop.sd))
}

Now, lets swap out the sampler and rerun

## set initial parameters
  tol = 10 ^ -5
  iter = 0
  eps = Inf
  qfunction = -10000 # using Qfunction for convergence
  prev.gamma = NULL

## starting values
  beta = as.vector(glm(y ~ X - 1, family = poisson())$coef)
  Sigma_gamma =  diag(rep(1, 5))


start = Sys.time()
while (eps > tol & iter < maxit) {
  ## save old qfunction
  qfunction0 = qfunction
  
  ## obtain last chain value for each obs if iter > 0
  if (iter > 0) {
    prev.gamma = gamma[seq(M, nrow(gamma), by = M), ]
  }
  
  ## E-step
  estep = e.step(
    y = y,
    X = X,
    ID = ID,
    betat = beta,
    Sigma_gammat = Sigma_gamma,
    M = M,
    n = n,
    ni = ni,
    sampler = adaptive.mwg.rw.sampler, ###
    prev.gamma = prev.gamma
  )
  
  gamma = estep$gamma
  qfunction = estep$Qfunction
  offset = estep$offset
  yaug = estep$yaug
  Xaug = estep$Xaug
  
  ## Calculate relative change in qfunction from prior iteration
  eps  = abs(qfunction - qfunction0) / abs(qfunction0)
  
  ## Start M-step
  
  # s2gamma, MLE for sigma^2 from normal with mean 0, averaged over M
  # closed form derived from Q function approximation
  
  Sigma_gamma = t(gamma) %*% gamma / (n * M)
  
  # update beta with glm
  aug = rep(1:n, each = M)
  fit = glm(
    yaug ~ Xaug - 1,
    family = poisson(),
    weights = rep(1 / M, nrow(Xaug)),
    offset = offset,
    # use starting value from previous step
    start = beta
  )
  beta = as.vector(fit$coefficients)
  
  ## update iterator
  iter = iter + 1
  if (iter == maxit)
    warning("Iteration limit reached without convergence")
  
  ## print out info to keep track
  cat(
    sprintf(
      "Iter: %d Qf: %.3f sigma_gamma1: %f beta0: %.3f beta1:%.3f eps:%f\n",
      iter,
      qfunction,
      diag(Sigma_gamma)[1],
      beta[1],
      beta[2],
      eps
    )
  )
}

## Iter: 1 Qf: -4133.076 sigma_gamma1: 0.820450 beta0: 1.015 beta1:0.455 eps:0.586692
## Iter: 2 Qf: -3943.452 sigma_gamma1: 0.731146 beta0: 0.982 beta1:0.446 eps:0.045880
## Iter: 3 Qf: -3777.995 sigma_gamma1: 0.649537 beta0: 0.950 beta1:0.431 eps:0.041957
## Iter: 4 Qf: -3658.157 sigma_gamma1: 0.595136 beta0: 0.919 beta1:0.420 eps:0.031720
## Iter: 5 Qf: -3547.671 sigma_gamma1: 0.551097 beta0: 0.891 beta1:0.403 eps:0.030203
## Iter: 6 Qf: -3473.617 sigma_gamma1: 0.518590 beta0: 0.862 beta1:0.391 eps:0.020874
## Iter: 7 Qf: -3406.961 sigma_gamma1: 0.490555 beta0: 0.835 beta1:0.380 eps:0.019189
## Iter: 8 Qf: -3339.156 sigma_gamma1: 0.473010 beta0: 0.808 beta1:0.373 eps:0.019902
## Iter: 9 Qf: -3300.098 sigma_gamma1: 0.459943 beta0: 0.783 beta1:0.363 eps:0.011697
## Iter: 10 Qf: -3263.761 sigma_gamma1: 0.446269 beta0: 0.759 beta1:0.351 eps:0.011011
## Iter: 11 Qf: -3246.697 sigma_gamma1: 0.440827 beta0: 0.740 beta1:0.339 eps:0.005228
## Iter: 12 Qf: -3215.484 sigma_gamma1: 0.425084 beta0: 0.722 beta1:0.336 eps:0.009614
## Iter: 13 Qf: -3191.880 sigma_gamma1: 0.420477 beta0: 0.708 beta1:0.327 eps:0.007341
## Iter: 14 Qf: -3172.101 sigma_gamma1: 0.419183 beta0: 0.698 beta1:0.323 eps:0.006197
## Iter: 15 Qf: -3161.332 sigma_gamma1: 0.417768 beta0: 0.688 beta1:0.314 eps:0.003395
## Iter: 16 Qf: -3154.950 sigma_gamma1: 0.419357 beta0: 0.678 beta1:0.316 eps:0.002019
## Iter: 17 Qf: -3137.221 sigma_gamma1: 0.416644 beta0: 0.669 beta1:0.306 eps:0.005619
## Iter: 18 Qf: -3116.756 sigma_gamma1: 0.423908 beta0: 0.658 beta1:0.305 eps:0.006523
## Iter: 19 Qf: -3099.569 sigma_gamma1: 0.415192 beta0: 0.648 beta1:0.298 eps:0.005514
## Iter: 20 Qf: -3098.398 sigma_gamma1: 0.418573 beta0: 0.643 beta1:0.296 eps:0.000378
## Iter: 21 Qf: -3093.176 sigma_gamma1: 0.421327 beta0: 0.637 beta1:0.298 eps:0.001686
## Iter: 22 Qf: -3075.079 sigma_gamma1: 0.423928 beta0: 0.635 beta1:0.295 eps:0.005850
## Iter: 23 Qf: -3074.436 sigma_gamma1: 0.420895 beta0: 0.631 beta1:0.294 eps:0.000209
## Iter: 24 Qf: -3060.434 sigma_gamma1: 0.421863 beta0: 0.626 beta1:0.292 eps:0.004554

## Warning: Iteration limit reached without convergence

## Iter: 25 Qf: -3046.928 sigma_gamma1: 0.410690 beta0: 0.623 beta1:0.287 eps:0.004413

end = Sys.time()
print(end - start)

## Time difference of 49.45305 mins

Now lets take a look at the acceptance rates

head(estep$ar)

##       [,1]  [,2]  [,3]  [,4]  [,5]
## [1,] 0.427 0.468 0.439 0.452 0.458
## [2,] 0.388 0.440 0.407 0.377 0.437
## [3,] 0.405 0.425 0.412 0.413 0.457
## [4,] 0.348 0.453 0.440 0.389 0.421
## [5,] 0.409 0.437 0.460 0.436 0.427
## [6,] 0.415 0.456 0.416 0.423 0.431

par(mfrow = c(2,2))
hist(estep$ar[,1], xlim = c(0,1));abline(v = 0.44);
hist(estep$ar[,2], xlim = c(0,1));abline(v = 0.44);
hist(estep$ar[,3], xlim = c(0,1));abline(v = 0.44);
hist(estep$ar[,4], xlim = c(0,1));abline(v = 0.44);

The acceptance rates are better than the first time around, and we see that they are in some mostly close to the optimal value of 0.44 for univariate proposals.
Before doing so lets look at some traceplots

par(mfrow = c(1,2))
plot(gamma[1:M,1], type = 'l')
acf(gamma[1:M,1])

2.4 Adaptive Metropolis Algorithm

In our previous discussion of the Metropolis algorithm, we mentioned that a good proposal distribution should draw samples that cover the support of the stationary distribution in a reasonable number of iterations, and also produces candidate values that are not accepted or rejected too frequently.

The Adaptive Metropolis (AM) algorithm is a variant of the Metropolis algorithm where one instead updates the variance of the proposal distribution as the algorithm runs. Similar to the previous section, the goal of this update is to ensure that the acceptance rate of the sample is in a suitable range. We can think of AM as a one-step RW Metropolis algorithm with a normal proposal distribution, where the variance of tihs proposal distribution is updated using the previous iterations of the chain.

Lets consider a RW MH sampler where the proposal density is $N(\boldsymbol{X}^{(t)}, \Sigma)$. Clearly we assume here that $\boldsymbol{X}^{(t)}$ is multivariate, and therefore we have a multivariate proposal distribution. For the AM sampler, at each iteration of the chain, a candidate value $\boldsymbol{X}^∗$ is sampled from a proposal distribution $N(\boldsymbol{X}^{(t)}, \lambda\Sigma^{(t)})$, where the goal is to tune the covariance matrix $\Sigma^{(t)}$ to improve the acceptance rate.

For a $d$-dimensional spherical multivariate normal target distribution where $\Sigma_\pi$ is the true covariance matrix of the target distribution, the proposal distribution $(2.382/p)\Sigma_\pi$ has been shown to be optimal with a and optiomal acceptance rate of 44% when $p = 1$. This optimal acceptance rate decreases to 23% as $p$ increases.

Thus, in one version of the AW algorithm, $\lambda$ is set to $(2.382/p)$. Since $\Sigma_\pi$ is unknown, it is estimated based on previous iterations of the chain. An adaptation parameter $\gamma^{(t)}$ is used to blend $\Sigma^{(t)}$ and $\Sigma^{(t+1)}$ in such a way that the diminishing adaptation condition will be upheld. A parameter $\mu^{(t)}$ is also introduced and estimated adaptively. This is used to estimate the covariance matrix $\Sigma^{(t+1)}$.

This adaptive Metropolis algorithm begins at $t = 0$ with the selection of $\boldsymbol{x}^{(0)} = \boldsymbol{X}^{(0)}$ drawn at random from some starting distribution $g$. Similarly, initialize $\mu^{(0)}$ and $\Sigma^{(0)}$. Common choices are $\mu^{(0)} = 0$ and $\Sigma^{(0)} = \boldsymbol{I}$. Given $\boldsymbol{X}^{(t)} = \boldsymbol{x}^{(t)}$, $\mu^{(t)}$, and $\Sigma^{(t)}$, the algorithm generates $\boldsymbol{X}^{(t+1)}$ as follows:

Sample a candidate value $\boldsymbol{X}^{*}$ from the proposal distribution $N(\boldsymbol{X}^{(t)}, \lambda\Sigma^{(t)})$, where $\lambda = (2.382/p)$ .
Then we sample a value for $\boldsymbol{X}^{(t+1)}$ by choosing

\[\begin{align} {X}^{(t+1)} &= \begin{cases} {X}^{*} \text{ with probability min} (R(\boldsymbol{x}^{(t)},\boldsymbol{X}^{*}),1)\\ {x}^{(t)} \text{ otherwise.} \end{cases} \end{align}\] 3. Adaptation step: Update the proposal distribution variance in two steps: \[\begin{align} \mu^{(t+1)} &= \mu^{(t)} + \gamma^{(t+1)}(\boldsymbol{X}^{(t+1)} -\mu^{(t)} ),\\ \Sigma^{(t+1)} &= \Sigma^{(t)} + \gamma^{(t+1)}\left[(\boldsymbol{X}^{(t+1)} -\mu^{(t)} )(\boldsymbol{X}^{(t+1)} -\mu^{(t)} )^T - \Sigma^{(t)}\right].\\ \end{align}\]

Here $\gamma^{(t+1)}$ is an adaptation parameter with values chosen by the user. For example, $\gamma^{(t+1)} = 1/(t + 1)$ is a reasonable choice.

Increment $t$ and return to step 1.

The updating formula for $\Sigma^{(t)}$ is constructed so that it is computationally quick to calculate and so that the adaptation diminishes as the number of iterations increases. To uphold the diminishing adaptation condition, it is required that $lim_{t\rightarrow \infty} \gamma^{(t)} = 0$. The additional condition that $_{t = 0}{(t)} = $ allows the sequenc e $\Sigma^{(t)}$ to move an infinite distance from its initial value. It may make sense to adapt $\lambda$ as well in step 3, such that \[\log(\lambda^{(t+1)}) = \log(\lambda^{(t)}) + \gamma^{(t+1)}(R(\boldsymbol{x}^{(t)},\boldsymbol{X}^{*}) - a) \] where $a$ denotes the target acceptance rate (e.g., 0.234 for higher-dimensional problems).

One drawback with the AM algorithm above is that all components within the entire vector of parameters must be accepted or rejected simultaneously. Clearly this is not efficient when the problem is high dimensional. An alternative is to develop a componentwise hybrid Gibbs AM algorithm where each component is given its own scaling parameter $\lambda^{(t)}$ and is accepted/rejected separately in step 2. In this case, the constant $a$ is usually set to a higher value, for example, $a = 0.44$, since now components are updated individually. Another variation is to carry out the adaptation in batches instead of every iteration. With the batching strategy, the adaptation (step 3 in the adaptive Metropolis algorithm) is only implemented at predetermined times.

3 Reversible Jump MCMC

So far we have discussed approaches to construct Markov chains where for sufficient large $t$ the realizations from the change are approximately distributed the target distribution $f$. In each of these cases we assumed that the dimensions of $\boldsymbol{X}^{(t)}$ was fixed for each $t$. In some situations, it may be of interest to allow for changes in the dimension of $\boldsymbol{X}^{(t)}$ from one iteration to the next. This would be of interest in situations such as Bayesian model selection problems, where one is considering many possible candidate models, each with parameter spaces of different dimensions. But, how does one implement such an approach?

Reversible Jump MCMC (RJMCMC), introduced by Peter Green in 1995, is one such method to do so, allowing for “transdimensional Markov chain Monte Carlo simulation”. Consider constructing a Markov chain to explore a space of candidate models, each of which might be used to fit observed data $\boldsymbol{y}$. Let $\mathcal{M}_1,\ldots,\mathcal{M}_k,$ denote the set of countable models being evaluated, where $\boldsymbol{\theta}_m$ denotes the parameters in the $m$th model and $p_m$ denote the number of parameters in the $m$th model. In such a settings we can think of our set of parameters as $\boldsymbol{X} = (M, \boldsymbol{\theta}_M)$ which indicate the model and the set of parameters that pertain to that model. In the prior Bayesian examples that we have considered, $M$ was essentially fixed and not considered, and our inference mainly concerned just $\boldsymbol{\theta}_M$.

As with our other Bayesian examples, we can specify prior distributions on each of these parameters, and then seek to simulate from their posterior distribution using MCMC method. Here, the realization at step $t$ of the chain is now $\boldsymbol{X}^{(t)} = (M^{(t)}, \boldsymbol{\theta}^{(t)}_{M^{(t)}})$, where $\boldsymbol{\theta}^{(t)}_{M^{(t)}}$, the parameters drawn for the model indexed $by M^{(t)}$, has dimension $p_M^{(t)}$.

The target density here is now the joint posterior density \[f(m, \boldsymbol{\theta}_m|\boldsymbol{y}) \propto f(\boldsymbol{y}|m,\boldsymbol{\theta}_m)f(\boldsymbol{\theta}_m|m) f(m)\] where $f(\boldsymbol{y}|m, \boldsymbol{\theta}_m)$ is the density of the observed data conditional on the $m$th model and the model parameters for the $m$th model, $f(\boldsymbol{\theta}_m|m)$ denotes the prior density for the parameters in the $m$th model, and $f(m)$ denotes the prior density of the mth model. The prior weight f(m) for the $m$th model may be assigned such that $\sum_{m = 1}^{K} f(m) = 1$.

We can also factor the posterior into the following form, suggesting two avenues of inference that we can pursue in this setting \[f(m, \boldsymbol{\theta}_m|\boldsymbol{y}) = f(m|\boldsymbol{y}) f(\boldsymbol{\theta}_m|m, \boldsymbol{y}).\]

Here, $f(m|\boldsymbol{y})$ is the posterior probability for the $m$th model given the data, relative to all models being considered. Also, $f(\boldsymbol{\theta}_m|m, \boldsymbol{y})$ is the posterior pertaining to the parameters in the $m$th model given the data.

What is unique about RJMCMC is that is allows for the construction of a markov chain for $\boldsymbol{X}$ that allows the chain to jump between models, each with their own parameter spaces and dimensions. This fact is that makes this approach fundamentally different from the methods that we have described already. However, like the prior approaches, RJMCMC also starts with a proposed step from the current value $\boldsymbol{x}^{(t)}$ to $\boldsymbol{X}^{*}$, and the decides whether to accept the proposal vs carry over the prior value. If the chain is constructed such that \[f(m_1, \boldsymbol{\theta}_{m_1}|\boldsymbol{y})a(m_2, \boldsymbol{\theta}_{m_2}|m_1, \boldsymbol{\theta}_{m_1} , \boldsymbol{y}) = f(m_2, \boldsymbol{\theta}_{m_2}|\boldsymbol{y})a(m_1, \boldsymbol{\theta}_{m_1}|m_2, \boldsymbol{\theta}_{m_2} , \boldsymbol{y})\]

for all $m_1$ and $m_2$, where $a(\boldsymbol{x}_2|\boldsymbol{x}_1,\boldsymbol{Y})$ denotes the density for the chain moving to state $\boldsymbol{x}_2 = (m_2, \boldsymbol{\theta}_{m_2} )$ at time $t + 1$, given that it was in state $\boldsymbol{x}_1 = (m_1, \boldsymbol{\theta}_{m_1})$ at time $t$. Chains that meet this detailed balance condition are termed reversible because the direction of time does not matter in the dynamics of the chain. That is, we can jump back and forth between models in a manner that is not dependent on whats going on currently within the chain (the current model etc).

But how do we generate proposals in situations where the dimensions between iterations $t$ and $t + 1$ are different? We do this through the use of auxiliary random variables at times with dimensions chosen so that the augmented variables (consisting of the original $\boldsymbol{X}$ and new auxiliary variables) now have equal dimensions at $t$ and $t + 1$. In doing so, the time-reversibility condition descibed is enabled to be met by using a suitable acceptance probability, and thereby ensuring that the Markov chain converges to the joint posterior for $\boldsymbol{X}$. We can think of these auxiliary variables as extra variables that we invent to facilitate the transitions between parameter spaces of different dimensions in the change.

To illustrate this, let us consider how one might propose parameters $\boldsymbol{\theta}_2$ corresponding to a proposed move from a model $\mathcal{M}_1$ with $p_1$ parameters to a model $\mathcal{M}_2$ with $p_2$ parameters when $p_2 > p_1$. A simple approach is to generate $\boldsymbol{\theta}_2$ from $\boldsymbol{\theta}_1$ and an auxillary random variable, say $\boldsymbol{U}_1$, which we assume to be an independent random variable generated separately. We assume that we can directly write $\boldsymbol{\theta}_2$ as a function of $\boldsymbol{\theta}_1$ and $\boldsymbol{U}_1$ through some invertible deterministic function $q_{1,2}$. We will see why this is important in a bit.

Given these pieces, we can now write $\boldsymbol{\theta}_2 = q_{1,2}(\boldsymbol{\theta}_1,\boldsymbol{U}_1)$. Given that $q_{1,2}$ is invertible, we can propose parameters for the reverse move via the inverse of this function, where $(\boldsymbol{\theta}_1,\boldsymbol{U}_1) = q^{−1}_{1,2}(\boldsymbol{\theta}_2) = q_{2,1}(\boldsymbol{\theta}_2)$. As with $q_{1,2}$, here $q_{2,1}$ is also an entirely deterministic way to propose $\boldsymbol{\theta}_1$ from $\boldsymbol{\theta}_2$. So, regardless of the direction, and through the introduction of $\boldsymbol{U}_1$ and choice of $q_{1,2}$, we can propose $\boldsymbol{\theta}_2$ given $\boldsymbol{\theta}_1$ and vice versa. This is despite the fact that the dimensions of $\boldsymbol{\theta}_1$ and $\boldsymbol{\theta}_1$ are unequal.

Now lets generalize this idea to generate an augmented candidate parameter vector ($\boldsymbol{\theta}^∗_{M^∗}$ and auxiliary variables $\boldsymbol{U}^∗$), given a proposed move to $M^∗$ from the current model, $m^{(t)}$. We can apply an invertible deterministic function $q_{t,∗}$ to $\boldsymbol{\theta}^{(t)}$ and some auxiliary random variables $\boldsymbol{U}$ to generate the proposal \[(\boldsymbol{\theta}^∗_{M^*} ,\boldsymbol{U}^∗) = q_{t,∗}(\boldsymbol{\theta}^{(t)},\boldsymbol{U}),\] where $\boldsymbol{U}$ is generated from proposal density $h(\cdot|m^{(t)}, \boldsymbol{\theta}^{(t)},m^∗)$. The auxiliary variables $\boldsymbol{U}^∗$ and $\boldsymbol{U}$ are used so that $q_{t,∗}$ maintains dimensionality during the Markov chain transition at time $t$. These variables are then regenerated at each iteration of the chain.

When $p_{M^∗} = p_{M^{(t)}}$ and there is no change in dimension, the approach above allows for the adaption of proposal strategies that we had discussed earlier. For example, a random walk could be obtained using $(\boldsymbol{\theta}^∗_{M^∗},\boldsymbol{U}^∗) = (\boldsymbol{\theta}^{(t)} + \boldsymbol{U},\boldsymbol{U})$ with $\boldsymbol{U} \sim N(0, \sigma^2\boldsymbol{I})$ having dimension $p_M^{(t)}$ . Alternatively, a Metropolis–Hastings chain can be constructed by using $\boldsymbol{\theta}^∗_{M^∗} =q_{t,∗}(\boldsymbol{\theta})$ when $p_\boldsymbol{U} = p_{M^*}$, for an appropriate functional form of $q_{t,∗}$ and suitable $\boldsymbol{U}$. No $\boldsymbol{U^*}$ would be required to equalize dimensions here.

When $p_{M^{(t)}} < p_{M^∗}$ , the $\boldsymbol{U}$ can be used to expand parameter dimensionality in order to match dimensions between iterations. When $p_{M^{(t)}} > p_{M^∗}$ , both $\boldsymbol{U}$ and $\boldsymbol{U}^*$ may be unnecessary: for example, the simplest dimension reduction is deterministically to reassign some elements of $\boldsymbol{\theta}^{(t)}$ to $\boldsymbol{U}^*$ and retain the rest for $\boldsymbol{\theta}^∗_{M^∗}$. We will give an example of this later. In all these cases, the reverse proposal is again obtained from the inverse of $q_{t,∗}$.

Assume that the chain is currently visiting model $m^{(t)}$, so the chain is in the state $\boldsymbol{x}^{(t)} = (m^{(t)}, \boldsymbol{\theta}^{(t)}_{m^{(t)}})$. The next iteration of the RJMCMC algorithm can be summarized as follows:

Sample a candidate model $M^*|M^{(t)}$ from a proposal density with conditional density $g(·|m^{(t)})$. The candidate model requires parameters $\boldsymbol{\theta}_{M^∗}$ of dimension $p_{M^∗}$.
Given $M^∗ = m^∗$, generate an augmenting variable $\boldsymbol{U}|m^{(t)}, \boldsymbol{\theta}^{(t)},m^∗)$ from a proposal distribution with density $h(\cdot|m^{(t)}, \boldsymbol{\theta}^{(t)},m^∗)$. Let \[(\boldsymbol{\theta}^∗_{m^∗} ,\boldsymbol{U}^∗) = q_{t,∗}(\boldsymbol{\theta}^{(t)}_{m^{(t)}} ,\boldsymbol{U}),\] where $q_{t,∗}$ is an invertible mapping from $(\boldsymbol{\theta}^{(t)}_{m^{(t)}} ,\boldsymbol{U})$ to to $(\boldsymbol{\theta}^∗_{m^∗} ,\boldsymbol{U}^∗)$ and the auxiliary variables have dimensions satisfying $p_{m^{(t)}} + p_{\boldsymbol{U}} = p_{m^{*}} + p_{\boldsymbol{U*}}$.
For a proposed model, $M^∗ = m^∗$, and the corresponding proposed parameter values $\boldsymbol{\theta}^∗_{m^∗}$ , compute the Metropolis–Hastings ratio given by

\[\frac{f(m^*, \boldsymbol{\theta}^{*}_{m*}|\boldsymbol{y})g(m^{(t)}) | m^*)h(\boldsymbol{u}^*|m^∗, \boldsymbol{\theta}^{*}_{m^*},m^{(t)})}{f(m^{(t)}, \boldsymbol{\theta}^{(t)}_{m^{(t)}}|\boldsymbol{y})g(m^* | m^{(t)})h(\boldsymbol{u}|m^{(t)}, \boldsymbol{\theta}^{(t)}_{m^{(t)}},m^{*})}|\boldsymbol{J}(t)| \]

where $\boldsymbol{J(t)}$ is the Jacobian matrix where $\boldsymbol{J}(t) =\frac{d\boldsymbol{q}_{t,∗}f(\boldsymbol{\theta}, \boldsymbol{u})}{d(\boldsymbol{\theta}, \boldsymbol{u})}\vert_{(\boldsymbol{\theta}, \boldsymbol{u}) = (\boldsymbol{\theta}^{(t)}_{m^{(t)}}, \boldsymbol{U})}$.

Then, we accept the move to the model $M^∗$ with probability equal to the minimum of 1 and the ratio above. If the proposal is accepted, set $X^{(t+1)} = (M^*, \boldsymbol{\theta}^*_{M^*})$. Otherwise, reject the candidate draw and set $X^{(t+1)} = (M^*, \boldsymbol{\theta}^*_{M^*})= \boldsymbol{x}^{(t)}$.

The last term in is the absolute value of the determinant of the Jacobian matrix arising from the change of variables from $(\boldsymbol{\theta}^{(t)}_{m^{(t)}},\boldsymbol{U})$ to $(\boldsymbol{\theta}^{*}_{m^{*}},\boldsymbol{U^*})$. If $p_M^{(t)} = p_{M^*}$ then this ratio simplifies to the standard Metropolis–Hastings ratio.

Discard $\boldsymbol{U}$ and $\boldsymbol{U^*}$. Return to step 1.

3.1 Example

We introduce an example of a bayesian model selection problem where we would like to determine whether a Poisson or Negative binomial distribution is more suitable for our data pertaining to a set of counts. That is, we would like to assess whether there is sufficient overdispersion in the counts to merit the use of a negative binomial distribution. Here, we have data pertaining to 1,040 English Premiership soccer matches for the seasons 2005/06 to 2007/08, where for each match we recorded the number of goals that were scored per match. For illustrations sake we weill assume that we can treat these counts as if they were obtained as a simple random sample.

Lets plot the data

hist(y, right = F)
abline(v = mean(y))

Looks like the mean is around 2.5, but visually it is unclear whether there is significant overdispersion. Lets call model $\mathcal{M}_1$ in this situation the poisson model, where $y_i \sim$ Pois($\lambda$) with likelihood $L(\boldsymbol{y} | \lambda) = \prod_{i = 1}^n \frac{\lambda^{y_i}e^{-\lambda}}{y_i!}$, and model $\mathcal{M}_2$ the negative binomial model, where $y_i \sim$ NB($\lambda$,$\phi$) with likelihood $L(\boldsymbol{y} | \lambda) = \prod_{i = 1}^n \frac{\Gamma(1/\phi +y_i)}{y_i!\Gamma(1/\phi )}\left(\frac{\lambda}{\phi + \lambda}\right)^{y_i}(1+\phi\lambda)^{-1/\phi}$.

Using the notation introduced earlier, we can write $\theta_1 = \lambda$ and $\boldsymbol{\theta}_2 = (\lambda,\phi)$. We can assume that the priors on each model $f(m_1) = f(m_2) = 0.5$, where we assume that they are equally likely apriori. For $\lambda$ in both models we will assume a Gamma($\alpha$,$\beta$) prior where $\alpha = 25$ and $\beta= 10$. We assume a similar prior for $\phi$ (independent from $\lambda$), except we assume that $\alpha = 1$ and $\beta= 10$.

Clearly the dimensions of the two models are different. In order to propose transitions between the two models, we need to introduce an auxiliary variable and deterministic invertible function $q$ to facilitate this. If the chain is currently in state $(1, \theta_1)$ and the model $\mathcal{M}_2$ is proposed, then a random variable $U \sim h(u| 1, \theta_1, 2)$ is generated from some proposal density $h$. Let us assume here that $U \sim N(0| \sigma^2)$, and therefore $h$ is the normal density with variance $\sigma^2$ in this example. Then, we can let $\boldsymbol{\theta}_2 = q_{1,2}(\theta_1,U) = (\lambda,\mu e^{U})$ for some fixed $\mu$. In other words, we assume that $\lambda$ is maintained between models, but the new parameter $\phi$ is a log-normal random variable, multiplicatively centred around $\mu$.

Then, the jacobian for this move is relatively easy to calculate, it is simply $\mu e^{U}$. Then, the acceptance probability is simply $\min(1,R)$, where \[R = \frac{p(2, \boldsymbol{\theta}_2 | \boldsymbol{y})}{p(1, {\theta}_1 |\boldsymbol{y})h(u|1,\theta_1, 2)}\mu e^{U}.\]

For the reverse move from $\mathcal{M}_2$ to $\mathcal{M}_1$, we can invert $q_{1,2}$ such that if the chain is in state $(1, \boldsymbol{\theta}_2)$ and $\mathcal{M}_1$ is proposed, $q_{2,1}(\lambda, \phi) = (\lambda,\log(\phi/\mu))$. Then the acceptance probability is $\min(1,R)$, where \[R = \frac{p(1, \boldsymbol{\theta}_1 | \boldsymbol{y})h(\log(\phi/\mu)|1,\theta_1, 2)}{p(1, {\theta}_1 |\boldsymbol{y})}\frac{1}{\mu e^{U}}\] which is essentially the reciprocal of the first ratio.

For our proposal, $\mu$ and $\sigma^2$ critical, as poorly chosen values may lead to slow convergence. Here $\mu$ can be chosen naturally: by considering $Var(Y)/E(Y )$ as a crude estimate of the overdispersion, we get $\mu = 0.015$ plugging in the mean of $\boldsymbol{y}$ for the expectation and the sample variance for the variance. The choice of $\sigma$ is less sensitive here, and we fix it at 1.5.

Lets use the rjmcmc package to facilitate the sampling here:

# q
q12 = function(theta1){ c(theta1[1], mu*exp(theta1[2])) }
q21 = function(theta2){ c(theta2[1], log(theta2[2]/mu)) }

# poisson priors
p.prior1 = function(theta1){dgamma(theta1[1], lamprior[1], lamprior[2], log=T) + dnorm(theta1[2], 0, sigma, log=T)}

# NB priors
p.prior2 = function(theta2){dgamma(theta2[1], lamprior[1], lamprior[2], log=T)+ dgamma(theta2[2], phiprior[1], phiprior[2], log=T)}

# function
n = length(y)
mu=0.015; sigma=1.5
lamprior = c(25,10); phiprior = c(1,10) # hyperparameters for lambda & phi
goals_post = rjmcmcpost(post.draw = list(draw1, draw2), g = list(g1, q2),
                        ginv = list(q21, q12), likelihood = list(L1, L2),
                        param.prior = list(p.prior1, p.prior2),
                        model.prior = c(0.5, 0.5), chainlength = 10000)

And running this, we get the following posterior model probabability of the poisson model is 0.71, whereas for the NB model it is 0.29, suggesting that the former is a better fit to the data. The estimated posterior distributions are plotted below:

hist(y, right = F, freq = F)
x = 0:10
lines(x+.5, dpois(x,lambda = 2.552), col = "red")
lines(x+.5, dnbinom(x,mu = 2.552, size = 1/0.02), col = "blue")

The NB fit is very similar to the poisson fit.

Advanced MCMC

Naim Rashid

2/7/2019