Statisticians often encounter situations where they need to evaluate an integral during the fitting of a statistical model. One may recall that in random effects models, Bayesian methods, and in the evaluation of expectations such integrals are necessary.
In this section we wish to cover several strategies to evaluate such integrals in practice. In particular, we focus on situations where there is no clear analytical form for the integral. In such cases we resort to numerical integration to approximate the integral.
There are a variety of methods to perform numerical integration, some of which you may have already learned in high school calculus. We will review these methods plus commonly utilized ones in statistics, such as gaussian quadrature and monte carlo integration.
Let us take a simple example of an integral below:
\[\int_{0}^1 x^2dx\]
Here, we wish to integrate the area under the function \(x^2\) from \(x=0\) to \(x=1\). Graphically, we want to evaluate the area of the shaded region below:
## choose plotting range from -1.5 to 1.5
x = seq(-1.5, 1.5, , 1000)
## make a plot with this function
plot(x,
x ^ 2,
ylab = "f(x)",
xlab = "x",
type = 'l')
## add shaded region under the curve from 0 to 1
polygon(
c(0, x[x >= 0 & x <= 1], 1),
c(0, (x ^ 2)[x >= 0 & x <= 1], 0),
col = "grey"
)
So how do we evaluate this integral? Clearly there is an analytical form of the integral in this situation, and we can fall back to what we learned in high school calculus to evaluate it:
\[\int_{0}^1 x^2dx = \frac{x^3}{3}\Big|_0^1 = \frac{1}{3}\]
In statistics there are also situations where nice analytical forms of integrals can be found, for example through the use of conjugate priors in Bayesian methods. However, more often than not this will not be the case.
Let us use an example from Givens and Hoeting that illustrates an application of the Generalized Linear Mixed Model (GLMM) to a longitudinal Alzheimer’s patient study (example 5.1).
The example described a set of 22 Alzheimer’s patients that were tracked over a period of 5 months. Once a month, patients were asked to recall words from a list that was provided at baseline. The outcome variable in this setting is the number of words recalled by each patient in a given month. These patients were also receiving an experimental therapy during the course of the study. The investigators sought to determine whether there is any improvement of word recall over time (assumed to be linear) compared to each patient’s baseline. Data for 25 control patients was also recorded but not available.
Clearly, the repeated word count measurements within each subject are not independent, and that there may be substantial between-patient variation in each patient’s baseline recall. We can utilize a Generalized Linear Mixed Model (GLMM) to account for these issues.
Let’s visualize some of the patient data before going further. The following spaghetti plot shows the individual patient trajectories:
## read in the data
alz = read.table("alzheimers.dat", header = T)
## plot trajectory of each patient as separate lines
for (i in 1:max(alz$subject)) {
# select subject indices
index = which(alz$subject == i)
# plot trajectory
if (i == 1) {
plot(
alz$month[index],
alz$words[index],
type = 'l',
ylim = range(alz$words),
ylab = "Words",
xlab = "Month",
col = i
)
} else{
lines(alz$month[index], alz$words[index], type = 'l', col = i)
}
}
It appears that the number of words recalled per month in each patient is increasing over time. This may suggest the experiment treatment is improving recall performance relative to each patient’s untreated baseline. It is also apparent that significant heterogeneity exists in patient baseline word recall.
We can more formally set up with problem with the following. Let \(y_{ij}\) be the number of words recalled by patient \(i\) in month \(j\), where \(i = 1\ldots 22\) and \(j = 1 \ldots 5\). We assume that each \(y_{ij} \sim Po(\lambda_{ij})\), where the mean and variance of \(y_{ij}\) is assumed to be some \(\lambda_{ij}\).
Also, let the covariate vector pertaining to the \(i\)th patient in the \(j\)th month be defined as \(\boldsymbol{x}_{ij} = (1, j)\), where we assume an intercept \(\beta_0\) and linear slope \(\beta_1\) for month (\(\boldsymbol{\beta} = (\beta_0,\beta_1)^T\)).
Then, we assume the counts can be modeled with a poisson GLMM such that \[\log(\lambda_{ij}) = \boldsymbol{x}_{ij}\boldsymbol{\beta} + \gamma_i\], where \(\gamma_i\) is the unobserved subject-level random effect such that \(\gamma_i \sim N(0,\sigma_{\gamma}^2)\).
So, how do we write the likelihood for this poisson GLMM with random intercept? If say \(\gamma_i\) was observable and known, we could write the likelihood as the following:
\[L(\boldsymbol{\beta}, \sigma^2_{\gamma} | \y, \boldsymbol{\gamma}) = \prod_{i = 1}^{22} \left(\prod_{j = 1}^{5}f(y_{ij} | \lambda_{ij})\right)\phi(\gamma_i| 0, \sigma^2_{\gamma}).\] Here \(f(y_{ij} | \lambda_{ij})\) is the Poisson PMF with mean \(\lambda_{ij}\) (defined above, dependent on \(\gamma_{i}\)), and \(\phi(\gamma_i| 0, \sigma^2_{\gamma})\) is the Normal PDF with mean 0 and variance \(\sigma^2_{\gamma}\). Also, \(\y\) is the vector of all observations in the study and \(\gamma = (\gamma_1,\ldots,\gamma_{22})\).
However, as in all GLMM’s, the random effects are actually unobservable. Therefore, in order to obtain the likelihood, we must integrate them out the above expression.
We perform separate integrals for each \(\gamma_i\), 22 in total, which share a common distribution determined by \(\sigma^2_{\gamma}\). In some sense, we can interpret these \(\gamma_i's\) as 22 iid draws from \(\phi(\gamma_i| 0, \sigma^2_{\gamma})\), which reflects the baseline subject-level variability about \(\beta_0\). This likelihood can be represented as the following:
\[L(\boldsymbol{\beta}, \sigma^2_{\gamma} | \y) = \prod_{i = 1}^{22}\int \left[\left(\prod_{j = 1}^{5}f(y_{ij} | \lambda_{ij})\right)\phi(\gamma_i| 0, \sigma^2_{\gamma}) \right] d\gamma_i\]
The log-likelihood is therefore \[\mathcal{l}(\boldsymbol{\beta}, \sigma^2_{\gamma} | \y) = \sum_{i = 1}^{22}\log\left[\int \left[\left(\prod_{j = 1}^{5}f(y_{ij} | \lambda_{ij})\right)\phi(\gamma_i| 0, \sigma^2_{\gamma}) \right] d\gamma_i\right].\]
Remember, the \(\lambda_{ij}\) is dependent on \(\gamma_i\) as well.
This is a bread and butter GLMM, specifically a poisson GLMM with a subject-level random intercept. Now we would like to maximize this log likelihood to obtain estimates for \(\boldsymbol{\beta}\) and \(\sigma^2_{\gamma}\). But how do we do even do this?
Well, given that the joint density in the integrand does not simplify to a known distribution, there are no simple closed form solutions here to obtain the MLE. As a result, we will have to resort to the optimization tools discussed in the prior lectures (no surprise there).
More importantly, we have the additional complication of the presence of an integral with respect to each \(\gamma_i\) in the log likelihood. Again, the joint density in the integrand does not simplify to some known distribution, so we must approximate this integral numerically during the maximization process. We also have to make some assumptions regarding switching the order or integration and derivation when calculating derivatives.
If we assume that we can switch the order of integration and derivation when computing the 1st and 2nd derivatives of the log likelihood, we can obtain analytical forms for first and second derivatives of the likelihood. Then, we can implement a few of the optimization schemes discussed in the first lecture of this module.
However, we still have to worry about evaluating these integrals in each iteration of whatever optimization method we choose. If the evaluation of the integral is complex or computationally burdensome, this may impact the choice of optimization method we may want to use practically. For example, NR may not be an efficient choice in such situations, as we will have to integrate in both the score function (1st derivative) and hessian function (2nd derivative) in each iteration. Some off the shelf software also evaluates the log likelihood by default at each iteration, adding another round of integration.
The more basic question is how would we even perform this integral to begin with? The integral clearly does not have a closed form. In the next several sections we will introduce some general methods to approximate these integrals. We will then combine these approaches with the methods discussed in the previous lecture for maximization.
Let’s consider the general problem of evaluating a one dimensional integral of the function \(f(x)\) with respect to \(x\) within the range of \(a\) to \(b\). We can write this as \(\int_a^bf(x)dx\). As mentioned earlier, oftentimes in Bayesian analysis the posterior distribution in question does not have a clear analytical form for the integral, or in frequentists methods such as GLMMs the likelihoods require integration.
To approximate such integrals, we can first split the interval \([a,b]\) into \(n\) parts, say \([x_k,x_{k+1}]\) for \(k = 0, \ldots, n-1\), where \(x_1 = a\) and \(x_n = b\). Then we can write \[\int_a^bf(x)dx = \sum_{k=0}^{n-1}\int_{x_k}^{x_{k+1}}f(x)dx.\]
In this sense, we are breaking up the original integral into the sum of integrals on disjoint subintervals spanning the original range of integration. Then, we then worry about approximating the integral on each individual interval and sum these approximations to obtain the estimate for the original full integral.
On an individual interval \([x_k,x_{k+1}]\), we initiate the approximation by inserting \(m+1\) nodes \(x_{ij}^*\) for \(j = 0, \ldots,m\) across the interval. Here \(m\) is the order of the approximation, which we will discuss in the next section. Then, we can approximate the integral in this subinterval in the general form in the following: \[\int_{x_k}^{x_{k+1}}f(x)dx = \sum_{j=0}^m A_{ij}f(x_{ij}^*)\] where \(A_{ij}\) is some set of constants.
In other words, we approximate the integral in each subinterval by evaluating the function of interest \(f(x)\) at each node in the subinterval, and then multiplying by some constant prior to summing up over the nodes.
The length of the intervals in addition to the spacing of the nodes in each integral does not need to be equal as a general rule. What may be apparent from the above approximation is that the accuracy should improve as we decrease the length of the subintervals and/or increase the number of nodes per subinterval.
For this general class of methods, the main differentiator between approximation is the choice of \(A_{ij}\).
This class of methods is characterized by having nodes equally spaced in \([x_k,x_{k+1}]\) with the name number of nodes used across all subintervals. In a general sense, the function to be integrated in each subinterval is replaced with a polynomial approximation that is easier to intergrade. Three methods in this class include Riemann, trapezoidal, and Simpson’s rules. The main difference in these rules is the order of the polynomial approximation of \(f(x)\) in the interval, ranging from 0 (constant, 1 node), 1 (linear, 2 nodes), to 2 (quadratic, 3 nodes), respectively. This is illustrated in the figure below
We discuss each of these methods briefly below
The Reimann rule boils down to approximating the function in a given subinterval with a constant function (1 node in the interval, order is 0). For each, one may use the value of \(f(x)\) at a given point in the interval, say \(f(x_k)\) for example, pertaining to a 0th degree polynomial. Then, the integral in this interval is approximated with \[\int_{x_k}^{x_{k+1}}f(x)dx \approx f(x_k)dx = (x_{k+1} - x_k)f(x_k)\]
In other words, the integral is approximated as the area of the rectangle formed by the width of the subinterval \((x_{k+1} - x_k)\) multiplied by the height of the rectangle \(f(x_k)\).
Then, the total approximation is set by summing over the individual intervals. If the lengths of each interval are the same, then this overall sum simplifies to \(h\sum_{k=0}^{n-1}f(a+ih)\), where \(h = \frac{b-a}{n}\), the width of each equally spaced interval. Note that we did not have to use \(f(x_k)\) as the constant function, \(f(x_{k+1})\) could be uses as well in the above figure.
As \(n\rightarrow \infty\), the approximation approaches the true value of the integral. The approximation is also exact, as one would imagine, when \(f(x)\) itself is constant on the interval. Since it is not known in advance what \(n\) would be best to use, we can sequentially increase \(n\) and repeat the approximation.
Similar to the approaches discussed in the first lecture of this module, we can examine when the approximation tends to converge with respect to \(n\), stopping the number of intervals at that particular point. It is recommended to use \(n_{k+1} = 2n_k\), doubling the number of intervals relative to the prior evaluation so that half the subinterval endpoints at the next step correspond to the old endpoints from the previous step. This avoids redundant calculations of the function of interest.
Drawbacks of this approximation is that it is slow to converge with respect to \(n\), in that you may need to increase the number of intervals greatly to achieve a good approximation. This is not surprising given that the approximation on the sub interval itself is a constant function, so shorter intervals will provide better approximation to the function of interest. One advantage of this approach is that it is quite easy to calculate, particularly when the length of each subinterval is the same.
Let’s go ahead and apply this to the problem of approximating the integral for the first subject in the Alzheimer’s dataset. We will vary the number of intervals so that we can observe how quickly the approximation converges relative to \(n\). First we will define the function \(f(x)\), which in this case is \(f(\gamma_1) = \prod_{j = 1}^{5}f(y_{1j} | \lambda_{1j})\phi(\gamma_1| 0, \sigma^2_{\gamma})\), the likelihood pertaining to the \(i\)th subject.
# gammai: a vector of values in the support of \gamma_i
# yi: a vector of length 5 pertaining to single individual i
# xi: a 5 x 2 matrix pertaining to the intercept and longitudinal month value for each subject (it is the same for each subject, 1:5)
# beta: a length 2 vector
# s2gamma: is a scalar
# log: logical, returns logged version of function output if T
# this function takes each element of the gammai vector and returns \prod_{j = 1}^{5}f(y_{1j} | \lambda_{1j})\phi(\gamma_1| 0, \sigma^2_{\gamma})
inner = function(gammai, yi, xi = cbind(rep(1,5), 1:5),beta, s2gamma, log = F) {
## create vector holding return value f(gammai) with respect to each element of gammai
val = rep(NA, length(gammai))
## calculate product with respect to each element of gammai
for (i in 1:length(val)) {
# calculate lambda: x_{ij} %*% beta + gammai[i]
# 5 x 1 vector pertaining to each month's measurement
lambda = exp(xi %*% beta + gammai[i])
# calculate f(y_ij | lambda_ij) for j = 1,...,5
# 5 x 1 vector
val0 = dpois(x = yi, lambda = lambda)
# calculate the product of f(y_ij | lambda_ij) over
# j = 1,...,5 times phi(gamma_i | s2_gamma)
# \prod_{j = 1}^{5}f(y_{1j} | \lambda_{1j})\phi(\gamma_1| 0, \sigma^2_{\gamma})
val[i] = prod(val0) * dnorm(x = gammai[i],
mean = 0,
sd = sqrt(s2gamma))
}
## return likelihood or log likelihood pertaniing to i'th subject
if (log == F) {
return(val)
} else{
return(log(val))
}
}
Now that we have that, let’s go ahead and apply Reimann’s rule to the first observation in our dataset. Remember, we need to integrate out the random effect pertaining to the first observation to obtain the likelihood for the first observation.
To simplify the illustration of this problem, let us assume that we would like to integrate over the range [−0.07, 0.085], and that \(\boldsymbol{\beta} = (1.804, 0.165)^T\) and \(\sigma^2_{\gamma} = 0.000225\). Normally we would like to integrate over the real number line, but we will revisit this later.
Using what we have discussed, we can visualize the integrand with respect to \(\gamma_1\):
## Set values
range = c(-0.07, 0.085)
beta = c(1.804, 0.165)
s2gamma = 0.000225
y1 = alz$words[alz$subject == 1]
## Define plotting range across gamma1
gamma1 = seq(range[1], range[2], length.out = 1000)
## plot
plot(
# x axis
gamma1,
# yaxis
inner(
gammai = gamma1,
yi = y1,
# using default for xi
beta = beta,
s2gamma = s2gamma
),
ylab = "inner(gamma1)",
xlab = "gamma1",
type = 'l'
)
So, for the range that we picked, looks like we lucked out in that most of the mass under the curve is concentrated in this region. Now let’s integrate this function using what we just learned. First, lets write a function to do this:
# intervals: the number of intervals you want to use, integer
# range: the range you want to integrate over, vector of length 2
# trace: print intermediate output, boolean
# rest are same as inner function
reimann = function(intervals, range, yi, beta, s2gamma, trace = 0 ){
## set the number of intervals n
n = intervals
## evaluate Reimann’s rule, assuming intervals of equal length
# get h
h = (range[2] - range[1])/n
# now evaluate. The book says to sum from 0 to n-1.
# 0 below pertains to the left most interval boundary, single node per interval
k = 0:(n-1)
# we need to adjust this for how the r vector
# indices are specified (add 1 to k since no 0 index in r)
points = range[1] + k*h
# now evaluate the Reimann rule for equal intervals as described above (w x h)
val = h*sum(inner(gammai = points, y = yi, beta = beta, s2gamma = s2gamma))
## print trace statement if > 0
if(trace > 0) cat(sprintf("t=%d\tn=%d\tvalue=%.10e\n", log2(n), n, val))
## return the result
return(val)
}
Now lets evaluate this for \(n = 2\) intervals
reimann(
intervals = 2,
range = range,
yi = y1,
beta = beta,
s2gamma = s2gamma
)
## [1] 5.179306e-07
We can imagine that with only two intervals this approximation is not very good. Let’s increase the number of intervals by a factor of two each time and reevaluate.
for(t in 1:10){
reimann(
intervals = 2^t,
range = range,
yi = y1,
beta = beta,
s2gamma = s2gamma,
trace = 1
)
}
## t=1 n=2 value=5.1793058591e-07
## t=2 n=4 value=2.7894492974e-07
## t=3 n=8 value=2.5450771897e-07
## t=4 n=16 value=2.5450520275e-07
## t=5 n=32 value=2.5450519735e-07
## t=6 n=64 value=2.5450518457e-07
## t=7 n=128 value=2.5450517495e-07
## t=8 n=256 value=2.5450516929e-07
## t=9 n=512 value=2.5450516623e-07
## t=10 n=1024 value=2.5450516465e-07
Clearly as we increase the number of intervals (and therefore the total number of nodes evaluated), the integral approximation improves. The question is whether there are more efficient ways to achieve a similar level of accuracy, for example by using a fewer number of intervals.
This rule can be written as the following
\[\int_a^bf(x)dx \approx \sum_{k=0}^{n-1}\frac{(x_{k+1} - x_{i})}{2}(f(x_k) + f(x_{k+1})\] which you may recognize as the area of a trapezoid, which in this case has the width of the subinterval, left side height as \(f(x_k)\) and right-side height as \(f(x_{k+1})\). The approximating polynomial here is linear, as can be seen in the figure above pertaining to \(k = 1\), two nodes over the interval (left and right bounds).
With this linear function, the area under the curve is a trapezoid. We may also derive this expression from the integral of the “fundamental polynomials” (GH, Section 5.1.2) pertaining to order \(k = 1\) (linear). We won’t get into this in too much detail, but we note it here in case you are interested in learning more. Simpsons rule is similar with \(m = 2\), we will skip it now for time purposes.
Given the above lets now implement the trapezoidal rule:
# same setup as the previous function
trap = function(intervals, range, yi, beta, s2gamma, trace = 0 ){
## set the number of intervals n
n = intervals
## evaluate Reimann’s rule, assuming intervals of equal length
# get h
h = (range[2] - range[1])/n
# now evaluate. The book says to sum from 0 to n-1.
# 0 below pertains to the left most interval boundary
k = 0:(n-1)
# we need to adjust this for how the r vector
# indices are specified (add 1 since no 0 index in r)
# now two nodes per index, left and right boundary
points_left = range[1] + k * h
points_right = range[1] + (k + 1) * h
# now evaluate the trap rule for equal intervals as described above
# f(left) first
left = inner(
gammai = points_left,
yi = y1,
beta = beta,
s2gamma = s2gamma
)
# then f(right)
right = inner(
gammai = points_right,
yi = y1,
beta = beta,
s2gamma = s2gamma
)
# calculate value, following book formula
val = (h / 2) * sum(left + right)
## print trace statement if > 0
if(trace > 0) cat(sprintf("t=%d\tn=%d\tvalue=%.10e\n", log2(n), n, val))
## return the result
return(val)
}
# now lets run it
for(t in 1:10){
trap(
intervals = 2^t,
range = range,
yi = y1,
beta = beta,
s2gamma = s2gamma,
trace = 1
)
}
## t=1 n=2 value=5.1792974859e-07
## t=2 n=4 value=2.7894451109e-07
## t=3 n=8 value=2.5450750964e-07
## t=4 n=16 value=2.5450509808e-07
## t=5 n=32 value=2.5450514501e-07
## t=6 n=64 value=2.5450515840e-07
## t=7 n=128 value=2.5450516187e-07
## t=8 n=256 value=2.5450516274e-07
## t=9 n=512 value=2.5450516296e-07
## t=10 n=1024 value=2.5450516302e-07
Looks like we are able to achieve accuracies similar to the prior approach but with a fewer number of intervals. This is not surprising as we are using a linear approximation to the integrand in each integral rather than a constant function.
In general, these approaches may require a large number of intervals of shorter length to converge, particularly with smaller \(n\). In general, the higher the order of the approximating polynomial in each interval, the faster the convergence in the integral approximation with respect to the number of intervals.
To evaluate over an infinite range, we can perform a change of variable operation on the integral in question, replacing say \(\gamma_1\) with variable \(y = \frac{\gamma_1}{1+\gamma_1}\) and then integrating over the range of \(y\) which is \((-1,1)\). This may not address the above issues about the mass of the integral being concentrated in a few intervals. Of course, such transformations can introduce other types of problems such as singularities. Other choices of transformations exist, and one may choose the best one based on their problem (GH 5.4.1).
Alternatively, one may use the methods in the next section (particularly Gauss-Hermite quadrature) to approximate integrals over the entire real line with very good accuracy.
When the mass under the integrand is concentrated mostly in just a few intervals, the previous approach may not work well. In such a case, the majority of the intervals will not contribute to the overall approximation, potentially wasting a lot of computational time. In addition, the intervals that do span the area of relevance may not be fine enough to approximate the integral in this area. We could increase the number of intervals to increase the accuracy, but we are also increasing the number of intervals we need to evaluate that do not contribute at all to the integral approximation.
Instead of having intervals with equally spaced nodes, what if we instead got rid of intervals entirely and just placed the nodes in the range of \(x\) where the bulk of the mass is under the curve?
This is the general intuition behind Gaussian Quadrature (GQ), where the trick is to first determine two quantities: the best locations of each node where the function will be evaluated, as well as the weights that are associated with each node. With these two values in hand, we can approximate a given interval with relatively good accuracy compared to the methods from the prior section using much fewer nodes. Exact integration can be obtained for \(2(M+1)\)th degree polynomial, where \(M\) is the number of nodes being utilized.
Gaussian quadrature is very effective for integrals of the following form: \[ \int_a^b f(x)w(x)dx\] and \[ \int_a^b x^pw(x)dx < \infty.\] You may recall that the former may be relevant in Bayesian applications where we have a density function \(f(x)\) and some prior \(w(x)\). The former is also similar in form to the original integrand in our problem, where in place of \(f(x)\) we have \(\prod_{j = 1}^{5}f(y_{1j} | \lambda_{1j})\) and in place of \(w(x)\) we have \(\phi(\gamma_1| 0, \sigma^2_{\gamma})\), where we wish to integrate with respect to \(\gamma_i\).
The latter is relevant to computing finite moments with respect to some pdf \(w(x)\) (such expectations when \(k = 1\)). More generally, we can use quadrature to integrate any generic function \(f(x)\) via \[ \int_a^b f^*(x)w(x)dx,\] where \(f^{*}(x) = \frac{f(x)}{w(x)}\).
As before, we can write the integral approximation for in the following manner: \[\int_a^b f(x)w(x)dx \approx \sum_{k = 1}^M A_kf(x_k),\] where \(x_1\ldots x_M\) are the selected quadrature nodes, and \(A_1 \ldots A_M\) are the weights associated with each node.
In the previous section, we chose nodes and associated “weights” in each subinterval that allowed exact integration of the approximating polynomial in that subinterval.
For gaussian quadrature, the question is how do we pick the nodes and the weights for the function of interest that we wish to integrate? We do this through the use of orthogonal polynomials.
Let’s say that we can express our integrand into the form \(f(x)w(x)\). Depending on the form of \(w(x)\) in this expression, a set of corresponding orthogonal polynomials may be chosen (see Table 5.6 in GH). One of the biggest differences between each set of orthogonal polynomials, besides the particular form of \(w(x)\) that they correspond to, are the ranges for the locations of possible nodes in \(x\).
For example, for the Hermite polynomials, the range is the real number line. We will see how this is helpful later. The roots of these polynomial are essentially the “best” locations for these nodes pertaining to a given \(w(x)\). The weights are chosen at each node to satisfy a series of conditions laid out in GH 5.3.2.
After choosing \(w(x)\), standard software can solve for the roots of the corresponding orthogonal polynomial, as well as the weights pertaining to each root, given a desired number of nodes. In practice, you will never need to solve these polynomials by hand. These values can then we utilized for integration later. For time purposes we will not get into more detail into how these orthogonal polynomials are solved, but for all practical purposes many R-based functions exist to do this once the choice of polynomial in Table 5.6 has been made.
More generally, given that \(w(x)\) is non-negative over the range of integration \([a,b]\), there exists a sequence of polynomials that are “orthogonal” with respect to \(w(x)\) over the range \([a^*,b^*]\), a transformed version of the original range of integration. In practical applications, the choice of polynomial may be easily recognizable from the form of \(w(x)\), sometimes requiring some factoring, rearranging of terms, or change of variables.
If we expand the expression for \(\phi(\gamma_i| 0, \sigma^2_{\gamma})\), we can see that we can rewrite the likelihood for the first sample as the following:
\[\int \left(\prod_{j = 1}^{5}f(y_{1j} | \lambda_{1j})\right) \times \frac{1}{\sqrt{2\pi\sigma_{\gamma}^2}}e^{-\frac{\gamma_1^2}{2\sigma_{\gamma}^2}} d\gamma_1\]
We can see that a rather easy choice for \(w(x)\) is \(w(x) = e^{-x^2}\), pertaining to Gauss-Hermite Quadrature, especially since we would like to ideally integrate over the real number line. If the assumed distribution for \(\gamma_i\) was instead distributed standard normal, we would not have to do a lot of extra work here after obtaining our desired number of quadrature nodes and weights.
Since this is not the case here, we can achieve this form in a few ways. The one way is through a change of variable to \(\gamma_i^* = \frac{\gamma_i}{\sqrt{2\sigma_{\gamma}^2}}\), integrating now with respect to \(\gamma_i^*\). We will do this in the next section.
Alternatively, a more general approach could be to simply do the following:
\[\int \left(\prod_{j = 1}^{5}f(y_{ij} | \lambda_{ij})\right) \times \frac{1}{\sqrt{2\pi\sigma_{\gamma}^2}}\left[\frac{e^{-\frac{\gamma_i^2}{2\sigma_{\gamma}^2}}}{e^{-\gamma_i^2}}\right]e^{-\gamma_i^2} d\gamma_i = \int f(\gamma_i)w(\gamma_i) d\gamma_i\]
where \(w(\gamma_i) = e^{-\gamma_i^2}\) and \(f(\gamma_i) = \left(\prod_{j = 1}^{5}f(y_{ij} | \lambda_{ij})\right) \frac{1}{\sqrt{2\pi\sigma_{\gamma}^2}}\left[\frac{e^{-\frac{\gamma_i^2}{2\sigma_{\gamma}^2}}}{e^{-\gamma_i^2}}\right]\).
We will show later that this may not work that well, and why the change of variable is the better way to go. At the end of this lecture, we will show that a more automated method to this change of variable procedure, adaptive GQ, is likely the more practical way to approach this problem.
Anyways, with this in hand, now we can apply Gauss Hermite Quadrature. But how to do this?
Once the weights and nodes have been determined, the integral is approximated simply using \[\sum_{k=1}^M A_kf(x_k),\] where \(x_k\) is the \(k\)th quadrature node and \(A_k\) is the weight associated with that node. In general, the more nodes that we select the more accurate the approximation becomes, however at the expense of more computing time.
Let’s start with \(M = 5\) nodes. Note that the total number of nodes here is much smaller than in the previous approaches (at minimum 1 node per interval). In our current example we have selected Gauss-Hermite Quadrature, which has the advantage of being able to approximate the integral in question over the entire real line. Let’s first start by determining the set of orthogonal polynomials and associated weights for \(M = 5\) nodes. Available R functions will either calculate these directly or will simple look them up in precomputed tables. Here we use the statmod package to print the nodes and associated weights for \(k = 1,\ldots, 5\):
## load the statmod library
library(statmod)
## get the nodes and associated weights up to M = 5
gh = gauss.quad(n = 5, kind = "hermite")
## print out the resulting list object
print(gh)
## $nodes
## [1] -2.020183e+00 -9.585725e-01 2.402579e-16 9.585725e-01 2.020183e+00
##
## $weights
## [1] 0.01995324 0.39361932 0.94530872 0.39361932 0.01995324
Now, let’s write a function for \(f(\gamma_i)\) as we defined earlier, and then use this function to calculate the integral approximation with the nodes and weights associated with.
# similar function setup as the inner function
f_gammai = function(gammai, yi, xi = cbind(rep(1,5), 1:5), beta, s2gamma) {
## create vector holding likelihood with respect to each gamma_i
val = rep(NA, length(gammai))
## calculate product with respect to each index of gamma
for (i in 1:length(val)) {
# calculate lambda: x_{ij} %*% beta + gammai[i]
# 5 x 1 vector pertaining to each month's measurement
lambda = exp(xi %*% beta + gammai[i])
# calculate f(y_ij | lambda_ij) for j = 1,...,5
# 5 x 1 vector
val0 = dpois(x = yi, lambda = lambda)
# calculate the product of f(y_ij | lambda_ij) over
# j = 1,...,5 times phi(gamma_i | s2_gamma)
# same as bvefore, except now we divide by exp(-gammai^2)
val[i] = prod(val0) * dnorm(x = gammai[i],
mean = 0,
sd = sqrt(s2gamma)) / exp(-gammai[i] ^ 2)
}
## return value
return(val)
}
Now that we have that out of the way, lets approximate the integral for the first observation
## get the nodes for M = 5
nodes_over_gamma1 = gh$nodes
## get the weights associated with each node for M = 5
w = gh$weights
## now approximate the integral using GH quadrature
val = sum(w *
f_gammai(
gammai = nodes_over_gamma1,
y = y1,
beta = beta,
s2gamma = s2gamma
)
)
## print the value
print(val)
## [1] 6.132906e-06
Hmm, so our naive approach where we just divided and multiplied by \(e^{-\gamma_1^2}\) didn’t work too well. If we take a look at the nodes, and then look at the original figure plotting the inner function from earlier in this lecture, the range of the nodes themselves seem to be a bit out of the range of where most of the mass under the curve:
print(nodes_over_gamma1)
## [1] -2.020183e+00 -9.585725e-01 2.402579e-16 9.585725e-01 2.020183e+00
## set plotting range over gamma1
xv = seq(range[1], range[2], length.out = 1000)
## now plot
plot(
xv,
f_gammai(
gammai = xv,
y = y1,
beta = beta,
s2gamma = s2gamma
),
ylab = "inner(gamma1)",
xlab = "gamma1",
type = 'l',
xlim = c(-2.5, 2.5)
)
abline(v = nodes_over_gamma1, lty = 2, col = "grey")
Actually, almost all of the quadrature nodes are out of the range since in this example we chose \(\sigma^2_{\gamma} = 0.000225\).
Let’s try a change of variable instead, and show how this improves on the prior naive approach. Later on, we will consider a more “automatic” way of performing the following procedure to reduce the uncertainty of how best to approximate this intergral using GQ.
For example, if say for some \(v\), \(\gamma_1 = h(v)\), then we know that
\[ \int_a^bf(\gamma_1)d\gamma_i = \int_{h^{-1}(a)}^{h^{-1}(b)} f(h(v))h'(v)dv \]
Here, we want to use \(v = \frac{\gamma_1}{\sigma_{\gamma}\sqrt{2}}\), where then \(\gamma_1 = h(v) = \sigma_{\gamma}\sqrt{2}v\). More generally, if say \(\gamma_1 \sim N(\mu,\sigma^2)\) then we perform the transformation \(v = \frac{\gamma_1 -\mu}{\sigma\sqrt{2}}\) where \(\gamma_1 = \sigma\sqrt{2}v + \mu\).
With this, we have that
\[\begin{align} \int \left(\prod_{j = 1}^{5}f(y_{1j} | \lambda_{1j})\right) \times \frac{1}{\sqrt{2\pi\sigma_{\gamma}^2}}e^{-\frac{\gamma_1^2}{2\sigma_{\gamma}^2}} d\gamma_1 &= \int \left(\prod_{j = 1}^{5}f(y_{1j} | \lambda_{1j})\right) \frac{1}{\sqrt{2\pi\sigma_{\gamma}^2}}e^{-v^2}h'(v)dv\\ &=\int \left(\prod_{j = 1}^{5}f(y_{1j} | \lambda_{1j})\right)\frac{1}{\sqrt{2\pi\sigma_{\gamma}^2}} e^{-v^2}\sigma_{\gamma}\sqrt{2}dv\\ &=\int \left(\prod_{j = 1}^{5}f(y_{1j} | \lambda_{1j})\right)\frac{1}{\sqrt{\pi}} e^{-v^2}dv\\ &\approx \sum_{k = 1}^M \frac{w_k}{\sqrt{\pi}}\prod_{j = 1}^{5}f(y_{1j} | \lambda_{1jk}) \end{align}\]
where now \(\log(\lambda_{1jk}) = \boldsymbol{x}_{1j}^T\boldsymbol{\beta} + \sigma_{\gamma}\sqrt{2}v_k\) since \(\gamma_1 = h(v) = \sigma_{\gamma}\sqrt{2}v\). Let’s try it out:
# replacing gammai with h(v) = \sigma_{\gamma}\sqrt{2}v
f_v = function(v, yi, xi = cbind(rep(1,5), 1:5),beta, s2gamma){
## create empty vector
val = rep(NA, length(v))
## calculate product over multiple potential values of gammai
for(i in 1:length(val)){
# calculate lambda
lambda = exp(xi %*% beta + + sqrt(2*s2gamma)*v[i])
# calculate numerator
val0 = dpois(x = yi, lambda = lambda)
# take the product over the vector, then divide by sqrt(pi)
val[i] = prod(val0)/(sqrt(pi))
}
return(val)
}
Now evaluate
## set node number
M = 5
## get the nodes for M = 5
gh = gauss.quad(n = M, kind = "hermite")
nodes_over_gamma1 = gh$nodes
## get the weights associated with each node for m = 5
w = gh$weights
## now approximate the integral using GH quadrature
val = sum(w * f_v(
v = nodes_over_gamma1,
yi = y1,
beta = beta,
s2gamma = s2gamma
))
## print the value
print(val)
## [1] 2.545052e-07
Looks closer to what we obtained earlier, and we did so with only 5 nodes! Let’s loop over from 1 to 10 nodes instead:
for(i in 1:10){
# set node number
M = i
# get the nodes
gh = gauss.quad(n = M, kind="hermite")
nodes_over_gamma1 = gh$nodes
# get the weights associated with each node for m = 5
w = gh$weights
# now approximate the integral using GH quadrature
val = sum(w*f_v(v = nodes_over_gamma1, yi = y1, beta = beta, s2gamma = s2gamma))
cat(sprintf("M=%d value=%.10e\n", M, val))
}
## M=1 value=2.4393483972e-07
## M=2 value=2.5444650351e-07
## M=3 value=2.5450570814e-07
## M=4 value=2.5450524438e-07
## M=5 value=2.5450524373e-07
## M=6 value=2.5450524375e-07
## M=7 value=2.5450524375e-07
## M=8 value=2.5450524375e-07
## M=9 value=2.5450524375e-07
## M=10 value=2.5450524375e-07
We can see that we achieve very high accuracy in just a small number of nodes with no restriction on the range of integration (using Gauss-Hermite Quadrature).
One reason why we see a slightly higher value for the integral approximation in this application relative to the prior approaches is because we are now integrating over the real number line, where the others were limited to a particular range.
There are some limitations to this approach, which we address later on during the discussion of adaptive Gaussian Quadrature. We will show that we can improve on this accuracy by using such an approach.
As mentioned earlier, now that we have a means to integrate the likelihood, we also have the ability to maximize the likelihood or loglihood via NR, BFGS, abd etc.. The only difference here is that we will also have to approximate the integral in all subsequent derivatives. We leave this exercise for a homework problem.
So far we have talked about several approaches that approximate the area under the curve based upon a set of fixed points. Gaussian Quadrature-based integration methods offer greater computational efficiency and higher accuracy than the Newton-Cotes methods, but take a little bit more work to set up.
However, both approaches will become computationally expensive as the dimension of the integral in question increases. For example, if we assumed in our current example that we have a random intercept and a random slope (with respect to month), the dimension of the random effect increases from 1 to 2, and now we have to integrate out both to obtain the likelihood. We will get to an example of how these methods fare with multidimensional integrals at the end of this section.
An alternative approach to these methods is called Monte Carlo integration, which scales better as the dimension of the intergral increases (at other costs).
Here, instead of evaluating the function of interest over a number of nodes at pre-determined positions, we instead randomly sample these points from a probability distribution. That is, we approximate the integral over a set of randomly drawn nodes that are sampled over the support of the function to be integrated. The dimension of the integral is therefore less of an issue here, provided that we have a means to sample nodes from a multivariate PDF instead.
Let’s assume again that we can factor our function of interest and wish to evaluate the following integral:
\[ \int h(\boldsymbol{x})f(\boldsymbol{x})d\boldsymbol{x}\]
If we assume that \(f(\boldsymbol{x})\) is the density for a random variable \(\boldsymbol{x}\), then we can approximate this integral such that:
\[ \int h(\boldsymbol{x})f(\boldsymbol{x})dx \approx \frac{1}{M}\sum_{k = 1}^M h(\boldsymbol{X}_k),\] where \(\boldsymbol{X}_1,\ldots,\boldsymbol{X}_M\) is an iid sample drawn from \(f\). As \(M \rightarrow \infty\), \(\frac{1}{M}\sum_{k = 1}^M h(\boldsymbol{X}_k) \rightarrow \int h(\boldsymbol{x})f(\boldsymbol{x})dx\) by the strong law of large numbers.
This is helpful for integrating joint probability distributions, such as in our GLMM example where \(f(\boldsymbol{x})\) is \(f(\gamma_i)=\phi(\gamma_i| 0,\sigma^2_{\gamma})\), or in calculating expectations of functions of random variables. Here, \(f\) is often called the “target distribution”, the density that we wish to sample from.
Due to the random nature of the way in which the nodes being drawn, some variance in the approximation is expected. This is called the Monte Carlo Variance (MCV) of the estimate. For example, if you repeated the integration process, your estimate of the integral would vary from attempt to attempt.
As you can imagine, this variation would generally be larger with smaller values of \(M\) compared to when larger values of \(M\) are used.
If we define \(\hat{\mu}_{MC}\) such that \[\hat{\mu}_{MC} = \frac{1}{M}\sum_{k = 1}^M h(\boldsymbol{X}_k),\] assume that \(\hat{\mu}_{MC} \rightarrow \mu = \int h(\boldsymbol{x})f(\boldsymbol{x})dx\) as \(M \rightarrow \infty\), and \(v(\boldsymbol{x}) = (h(\boldsymbol{x}) - \mu)^2\), then using results from BIOS 760 we know that the sampling variance for \(\hat{\mu}_{MC}\) is given by \(E[v(\boldsymbol{X})/M] = \sigma^2/M\). Here \(\mu\) is the expectation of \(h(\boldsymbol{X})\), where \(\boldsymbol{X}\) is a random variable with density \(f\).
We can calculate the estimate for \(\sigma^2\) using the following expression, which is similar to the form of the sample variance that you have seen in your other classes: \[\hat{\sigma}^2 = \hat{var}(\hat{\mu}_{MC}) =\frac{1}{M-1} \sum_{k = 1}^M [h(\boldsymbol{X}_k) - \hat{\mu}_{MC}]^2\].
If we assume that \(\sigma^2\) exists, then we know that the asymptotic distribution for \(\hat{\mu}_{MC}\) is approximately normal, so we can derive approximate confidence bounds for \(\mu\) and perform statistical inference later.
In later lectures, we will talk about situations where we cannot exactly sample from the target distribution in question (common in Bayesian applications). In other situations, the evaluation of the target function may alternatively be very difficult (integration/optimization needed), and therefore we would want to go the simulation route to avoid evaluating \(f\) as much as possible. In either case, we will use simulation strategies to approximately sample from such distributions. Here in this lecture, we will discuss the simpler scenario where we can exactly sample from the known parametric densities in question.
Let’s say that some random variable \(Y \sim Po(5)\) and we want to calculate \(E[Y]\). Normally we would just use the fact that the expected value of a random variable distributed \(Po(\lambda)\) is simply \(\lambda\) ,or evaluate the expectation analytically. However, we can also obtain this through MCI.
Here we wish to compute the following \[E[Y] = \int Yf(Y)dY.\] Using what we have already discussed, we know that we can approximate the expected value by generating \(M\) samples \(Y_1,\ldots,Y_M\) from the \(Po(5)\) distribution, and then calculating \(\frac{1}{M}\sum_{k=1}^M h(Y_k) = \frac{1}{M}\sum_{k=1}^M Y_k\).
Using the following, we can do this easily:
## set the seed
set.seed(1)
## set the number of MC samples to draw
M = 100
## draw the samples from f(Y)
y = rpois(M, lambda = 5)
## since h(y) = y, we can simply take the mean of y
print(mean(y))
## [1] 5.1
## estimate the MCV
print(var(y) / M)
## [1] 0.03909091
This is pretty intuitive in that we are simply drawing samples from the known distribution for \(Y\) and taking the mean to estimate \(\hat{\lambda}\), where in this case the true value for \(\lambda = 5\).
We can see that this estimate is not exactly equal to 5 and the MCV in this case is relatively large at 0.0391. So, to increase the accuracy of the integral, lets bump up \(M\):
## set the number of MC samples to draw
M = 10 ^ 4
## draw the samples
y = rpois(M, lambda = 5)
## since h(y) = y, we can simply take the mean of y
print(mean(y))
## [1] 5.0009
## estimate the MCV
print(var(y) / M)
## [1] 0.000510761
Now we can see that the error is much lower, at the cost of many more samples.
Let us now try calculating the values of higher order moments using MCI. If you recall, we are now simply calculating the expected value of \(Y^p\). Here we evaluate \(p = 2, 3, 4\):
## set the number of MC samples to draw
M = 10^4
## draw the samples from f(Y)
y = rpois(M, lambda = 5)
## h(y) = y^2
print(mean(y^2))
## [1] 29.9827
## h(y) = y^3
print(mean(y^3))
## [1] 204.7043
## h(y) = y^4
print(mean(y^4))
## [1] 1550.105
We know for example that the second moment of \(Y\) is \(\lambda + \lambda^2 = 5 + 25 = 30\), which is very close to what we get here.
We can use MCI to also estimate the area of a circle below with radius \(r = 0.5\). Let’s say that we did not know the value of \(\pi\), or that the area of circle is \(\pi r^2\). How do we calculate the area of the circle in this situation?
Let’s reformulate the problem in the following manner. First, let’s assume that the circle sits perfectly inside of a square that we know how to calculate the area of, where \(A_{square} = (2 \times r)^2 = 1\). Lets also assume that the center of the circle is located at point (0,0) in a two dimensional coordinate plane. Given this, we can write the following: \[A_{circle} = A_{square} \left[\frac{A_{circle}}{A_{square}}\right] = 1 \times p_{A_{circle}}\]
where \(p_{A_{circle}}\) is the proportion of the square’s area that falls within the circle.
If we randomly selected a point \(\boldsymbol{z} = (z_x,z_y)\) over the two-dimensional support of the square, where \(z_x\) and \(z_y \sim U(-1/2,1/2)\), then we can also write rewrite this problem in a probabilistic framework as the following:
\[\begin{align} A_{circle} &= 1 \times p(\boldsymbol{z} \in circle)\\ &= 1 \times E[I[\boldsymbol{z} \in circle]]\\ &= 1\int I[\boldsymbol{z} \in circle]d\boldsymbol{z}\\ &\approx \frac{1}{M}\sum_{k=1}^M I[\boldsymbol{Z_k} \in circle] \end{align}\]
where \(\boldsymbol{Z}_k = (Z_{x,k},Z_{y,k})\), \(Z_{x,k}\) and \(Z_{y,k} \sim U(-1/2,1/2)\), and \(k =1 \ldots M\).
Let’s implement this below:
## Set number of MC samples to draw
M = 10 ^ 7
## set the radius
r = 0.5
## Draw x and y coordinates from uniform distribution over square with side = 2r
zx = runif(M,-r, r)
zy = runif(M,-r, r)
## indicator function if point is in circle of radius 1
inside = zx ^ 2 + zy ^ 2 < r ^ 2
## calculate area
A_square = (2 * r)^2
print(mean(inside) * A_square)
## [1] 0.7854093
which is similar to \(\pi(0.5)^2 = 0.78539\). The area of the circle here is simply the area of the square \(\times\) the proportion of points falling into the circle, which is our MC estimator for \(1\times p(\boldsymbol{z} \in circle) = E[I[\boldsymbol{z} \in circle]]\). We can visualize this below:
## color points inside circle red
# create vector of length M,
col = rep("black", M)
# set points inside to red
col[inside] = "red"
## plot circle with points
# just use a subset of points to save plotting time
subset = 1:1000
# now plot
plot(zx[subset],
zy[subset],
asp = 1,
xlim = c(-r, r),
col = col[subset])
So, in other words, if we have a known quantity, such as the area of the surrounding square, and have a means to estimate the proportion of that area belonging to your feature of interest, then you can estimate the area of any arbitrary shape within the square. This is similar to randomly throwing darts at a board, and simply computing the proportion of times you hit the region of interest.
Let’s try to evaluate the integral for the first subject in the model using MCI and then compare the approach to quadrature.
## function h(gammai)
h = function(gammai, y, beta) {
# calculate lambda give gammai
lambda = exp(beta[1] + beta[2] * 1:5 + gammai)
# get poisson PMF given lambda, y
val0 = dpois(x = y, lambda = lambda)
# calculate product
return(prod(val0))
}
## starting values
# choose num draws
M = 1000
# get 1st observations data
y1 = alz$words[alz$subject == 1]
# set beta and s2gamma as before
beta = c(1.804, 0.165)
s2gamma = 0.000225
## draw samples from the target density f(gamma)
gammai = rnorm(M, 0, sd = sqrt(s2gamma))
## evaluate integral
mean(sapply(gammai, h, y = y1, beta = beta))
## [1] 2.550638e-07
Pretty close to what we had before! But, compared to quadrature we had to use a lot more nodes to get this estimate. Also, the estimate appears to be a little bit off. We can try to increase \(M\), but just to be safe let’s visualize this first:
plot(
xv,
inner(
gammai = xv,
yi = y1,
beta = beta,
s2gamma = s2gamma
),
ylab = "inner(gamma1)",
xlab = "gamma1",
type = 'l'
)
segments(
x0 = gammai,
x1 = gammai,
y0 = 0,
y1 = inner(
gammai = gammai,
yi = y1,
beta = beta,
s2gamma = s2gamma
),
col = "lightblue"
)
abline(v = 0, col = "blue")
Hmm, a few things are off here. In MCI, we are averaging over the draws from the target distribution pertaining to \(\gamma_1\), which is \(N(0,\sigma_{\gamma}^2)\). But, the joint distribution that we are trying to integrate is not centered around 0 with respect to \(\gamma_1\). This is an important distinction: \(\phi(\gamma_1\)) and \(\left(\prod_{j = 1}^{5}f(y_{1j} | \lambda_{1j})\right)\phi(\gamma_1| 0, \sigma^2_{\gamma})\) are two different functions and are not gauranteed to have the same modes with respect to \(\gamma_1\)!
The practical impact on this can be observed in the above figure. When we draw MC samples from \(\phi(\gamma_1\)), there are some portions of the support that are not as well covered as others (right-hand side of the figure for example). This is because the mode of the integrand is to the right of 0, and our target distribution is centered around 0.
We can get around this problem by simply increasing the number of draws. However, our procedure is less efficient in this situation, because we need relatively more samples to effectively cover the support of \(\gamma_1\) pertaining to where most of the mass under the curve sits.
We can see this below that even when increasing the number of samples significantly that we cannot get the level of accuracy as some of the prior approaches:
## Increasing M to 10 million
M = 10^7
## draw samples from the target density f(gamma)
gammai = rnorm(M, 0, sd = sqrt(s2gamma))
## evaluate integral
mean(sapply(gammai, h, y = y1, beta = beta))
## [1] 2.544802e-07
To address this issue, lets first circle back and revisit Gaussian Quadrature. We will discuss an adaptive version of the approach that can address on this issue, improve accuracy, and also provide a general means to apply GQ to many different cases. The problem that we describe above also impacts the performance of GQ in our GLMM application but could be overcome here by just using many more quadrature nodes.
In general, this has implications in terms of being able to “automatically” fit various functions accurately using GQ, and also may allow for better generalizations of methods to low to moderate dimensions using a special case of Adaptive Gaussian Quadrature called the “Laplace Approximation”. We will also update the MCI approach above with a similar adoptive approach via importance sampling.
Our first attempt to apply GQ simply refactored the integrand, and we found that most of the quadrature nodes did not overlap the parts of the integrand with the most mass under the curve. With the change of variable approach, the nodes that were being passed to \(h(\gamma_1)\) in this example were effectively being scaled such that they were now \(\sqrt{2\sigma^2_\gamma}v_k\) rather than just \(v_k\). This “rescaling” also helped to ensure that the rescaled nodes also hit the area with most of the mass under the curve (visualized below):
# plot function over xv
plot(
xv,
inner(
gammai = xv,
yi = y1,
beta = beta,
s2gamma = s2gamma
),
ylab = "inner(gamma1)",
xlab = "gamma1",
type = 'l',
xlim = range
)
# plot nodes
gh = gauss.quad(n = 5, kind="hermite")
abline(v = sqrt(2 * s2gamma) * gh$nodes,
lty = 2,
col = "grey")
However, as we saw earlier, the mode of the integrand does not correspond to the mode of \(\phi(x)\) (or \(w(x)\)). Therefore, the quadrature nodes utilized here are centered around 0, whereas the mode of the distribution is not. This can contribute to a lack of accuracy in the integral being approximated. In practice, it is difficult to know evaluate how much this issue is playing out in the functions prior to integration.
For example, in our GLMM example we would have to evaluate this for each subject, as each subject \(i\) has its own \(\gamma_i\) that needs to be integrated out. For the first subject, we see that ignoring this shift in the posterior mode will not have large impact with enough quadrature points, but for others this may not be the case. This will have most impact when using a small number of quadrature nodes. To get around this issue, we use AGQ.
The basic idea behind Adaptive Gaussian Quadrature (AGQ) is that we take the original function being considered and find the mode of the integrand with respect to the variable to be integrated over (again we are assuming a univariate integral). We also determine the curvature around the mode by calcuating the second derivative, evaluating it at the mode. We then use these two pieces of information to recenter and rescale the set of quadrature nodes that we get from standard software.
In this manner, we are better able to have the nodes cover the regions of high mass in the integrand. We are able to find this mode, oftentimes called the “posterior mode” of the variable to be integrated out, through standard numerical optimization routines.
Let us assume that we would like to use Gauss-Hermite Quadrature where \(w(x) = e^{-x^2}\). Then as before in standard GQ, we can evaluate integrals of the form
\[\int f(x)e^{-x^2}dx \approx \sum_{k = 1}^M A_kf(x_k)\] where the nodes \(x_k\) and associated weights \(A_k\), \(k = 1\ldots M\), are calculated from the Hermite polynomials as discussed earlier. We can trivially re-express the above integral in the following form
\[\int f(t)\phi(t; \mu, \sigma^2)dt\approx \sum_{k = 1}^M A^*_kf(t_k)\]
where \(\phi(\mu, \sigma^2) \sim N(\mu, \sigma^2)\), \(t=\mu + \sqrt{2\sigma^2}x_k\), and \(A^*_k = A_k/\sqrt{\pi}\). You might notice that we performed a change of variable here similar in nature to what was done earlier. Now, we choose \(\mu\) and \(\sigma^2\) so that the new quadrature nodes \(t_1,\ldots,t_M\) are essentially a recentered and rescaled version of the original ones \(x_1,\ldots,x_M\), and will be placed in regions of highest mass of the original integrand, called say \(g(t)\). Here, we can choose \(\hat{\mu}\) to correspond to the mode of the integrand \(g(t)\), which we can do through standard optimization routines from our first lecture. For \(\hat{\sigma}^2\), we can calculate \[\hat{\sigma}^2 = \left(-\frac{\partial^2}{\partial t^2}\log(g(t))\rvert_{t=\hat{\mu}}\right)^{-1},\] that is, the inverse second derivative of the integrand evaluated at the posterior mode.
Then, we can write the integral of \(g(t)\) in the following form:
\[\begin{align} \int g(t)dt &= h(t)\phi(t; \hat{\mu}, \hat{\sigma}^2)dt \\ &\approx \sum_{k = 1}^M A^*_kh(t_k)\\ &=\sum_{k = 1}^M A^*_kh\left(\hat{\mu} + \sqrt{2\hat{\sigma}^2}v_k\right) \end{align}\]
where \(h(t) = \frac{g(t)}{\phi(t; \hat{\mu}, \hat{\sigma}^2)}\). We demonstrate this approach below for the first sample in the Alzheimer’s dataset. Here, \(h(t) = \frac{\prod_{j = 1}^{5}f(y_{1j} | \lambda_{1j})\phi(t| 0, \sigma^2_{\gamma})}{\phi(t; \hat{\mu}, \hat{\sigma}^2)}\), \(\log(\lambda_{1j}) = \boldsymbol{x}_{1j}^T\boldsymbol{\beta} + t\),and \(\hat{\mu}\) and \(\hat{\sigma}^2\) are chosen in the manner described above.
Lets do this below:
## find posterior mode of the integrand wrt to t = gamma_1
# starting value for gamma_1 set to 0
# use nelder mead for simplocity, low dim problem
library(optimx)
pm = suppressWarnings(
optimx(
par = 0,
fn = function(x, y1, beta, s2gamma) {
inner(
gammai = x,
yi = y1,
beta = beta,
s2gamma = s2gamma
)
},
y1 = y1,
beta = beta,
s2gamma = s2gamma,
method = "Nelder-Mead",
control = list(maximize = T)
)
)
## Maximizing -- use negfn and neggr
print(pm$p1)
## [1] 0.004626465
## now estimate second derivative about the posterior mode in the log integrand
# you can write out the analytical form and plug in the value of the pm
# for time purposes we are going to approximate it analytically using the hessian() function
library(numDeriv)
hess_pm = hessian(
func = inner,
x = pm$p1,
yi = y1,
beta = beta,
s2gamma = s2gamma,
log = T
)
print(sqrt(-1 / hess_pm)) # pretty similar to s2gamma
## [,1]
## [1,] 0.01491397
Now that we have these pieces, lets write the functions that we need to evaluate the integral
# similar setup as before, now passing pm and hess_pm
h_t = function(t, yi, xi = cbind(rep(1,5), 1:5), beta, s2gamma, pm, hess_pm) {
# calculate product over multiple potential values of t
val = rep(NA, length(t))
for (i in 1:length(val)) {
# calculate lambda: x_{ij} %*% beta + t[i]
# 5 x 1 vector pertaining to each month's measurement
lambda = exp(xi %*% beta + t[i])
# calculate f(y_ij | lambda_ij) for j = 1,...,5
# 5 x 1 vector
val0 = dpois(x = yi, lambda = lambda)
# calculate product
# now dividing by N(muhat,sigma2hat)
val[i] = prod(val0) *
dnorm(x = t[i],
mean = 0,
sd = sqrt(s2gamma)) /
dnorm(x = t[i],
mean = pm,
sd = sqrt(-1 / hess_pm))
}
return(val)
}
Now let’s evaluate the integral
## set node number
M = 5
## get the nodes for M = 5
gh = gauss.quad(n = M, kind = "hermite")
nodes_over_gamma1 = gh$nodes
## recenter and rescale the nodes
t = pm$p1 + sqrt(2 * (-1 / as.numeric(hess_pm))) * nodes_over_gamma1
## get the weights associated with each node for M = 5
w = gh$weights / sqrt(pi)
## now approximate the integral using GH quadrature
val = sum(w * h_t(
t = t,
yi = y1,
beta = beta,
s2gamma = s2gamma,
pm = pm$p1,
hess_pm = hess_pm
))
print(val)
## [1] 2.545052e-07
We can visualize the new nodes below:
plot(
xv,
inner(
gammai = xv,
yi = y1,
beta = beta,
s2gamma = s2gamma
),
ylab = "inner(gamma1)",
xlab = "gamma1",
type = 'l',
xlim = range
)
# plot nodes
abline(v = t, lty = 2, col = "grey")
Looks well centered. Now let’s loop over from 1 to 10 nodes instead:
for(i in 1:10){
# set node number
M = i
# get the nodes
gh = gauss.quad(n = M, kind="hermite")
nodes_over_gamma1 = gh$nodes
# recenter and rescale the nodes
t = pm$p1 + sqrt(2*(-1/as.numeric(hess_pm)))*nodes_over_gamma1
# get the weights associated with each node for m = 5
w = gh$weights/sqrt(pi)
# now approximate the integral using AGQ
val = sum(w*h_t(t = t, yi = y1, beta = beta, s2gamma = s2gamma,pm = pm$p1,hess_pm = hess_pm))
# print value
cat(sprintf("M=%d value=%.10e\n", M, val))
}
## M=1 value=2.5450532272e-07
## M=2 value=2.5450529900e-07
## M=3 value=2.5450524313e-07
## M=4 value=2.5450524375e-07
## M=5 value=2.5450524375e-07
## M=6 value=2.5450524375e-07
## M=7 value=2.5450524375e-07
## M=8 value=2.5450524375e-07
## M=9 value=2.5450524375e-07
## M=10 value=2.5450524375e-07
Here we are able to get pretty good accuracy, converging slightly faster than regular GQ. One advantage over the previous approach is that we are able to center the nodes around the posterior mode of the integrand with respect to \(\gamma_1\).
While this didn’t make a large difference here, in other cases, for example for some of the other observations in the dataset, we may see a larger shift in the posterior mode away from zero, or a curvature different around the mode than expected from \(\sigma^2_{\gamma}\).
For these reasons, functions that perform “automatic” integration employ something similar to AGQ. More detail can be found here.
We can think of the Laplace approximation as AGQ with only \(M=1\) quadrature node. As we can see above for \(M=1\), we can still achieve relatively good accuracy with only one node under the Laplace approximation.
In addition, for multidimensional integrals for low to moderation dimension, we avoid the explosion in computation time needed if we say used AGQ with \(M>1\). For this reason, it is often employed by default in popular software such as glmer() in the lme4 R package. For higher dimensional integrals (say dimension \(\geq\) 5) alternative approaches such utilizing MCI or importance sampling (discussed below) would be preferred computationally.
So far, we have introduced the MC estimator for the integral \(\hat{\mu}_{MC} = \frac{1}{M}\sum_{k = 1}^M h(\boldsymbol{X}_k)\). However, alternative strategies may be employed to reduce the variance of the MC estimator. The side benefit of these approaches is that one may be able to achieve sufficient accuracy with a fewer number of samples. We will connect this approach with the AGQ approach discussed above.
The main idea behind importance sampling is that we oversample the regions of \(\boldsymbol{x}\) where most of the mass under \(h(\boldsymbol{x})f(\boldsymbol{x})\) is contained, and then adjust for the fact that you are oversampling in these regions relative to the original target density.
This is helpful in situations where the majority of the mass is concentrated in small region, or in a region that is far away from the mean of the target distribution. In such situations, the majority of the samples drawn will not meaningfully contribute to the overall integral. Also, the one or two samples that hit the regions of high mass will not allow you to accurately estimate the integral, and would also subject the MC estimator to have high variance as well
For example, let’s say that we want to estimate the probability that some random variable \(Y\) is greater than 4, where \(Y \sim N(0,1)\). We can estimate this directly in R:
p = 1-pnorm(4)
print(p)
## [1] 3.167124e-05
We can see that this probability is small, as expected. We can also use MCI to obtain this estimate as well. Here, we are essentially trying to compute \(E[I[Y > 4]] = \int I[Y > 4]f(Y)dY = p(Y > 4)\), where \(f(Y) \sim N(0,1)\). We can approximate this integral such that \(\int I[Y > 4]f(Y)dY = \sum_{k = 1}^{M}I[Y_k > 4]/M\) , where \(Y_k \sim N(0,1)\) for \(k = 1,\ldots,M\). Let’s try to compute this value using \(M = 10,000\) samples:
## Num of MC samples
M = 10 ^ 4
## draw from standard normal
y = rnorm(M)
## compute MC estimate of p(Y > 4)
print(mean(y > 4))
## [1] 1e-04
## compute MC variance for p(Y > 4)
print(var(y > 4) / M)
## [1] 1e-08
So, the variance is still quite large, despite using 10,000 samples. To illustrate this variability, just rerurn the code above a couple of times and see how the estimate varies around its true value.
We can also increase \(M\) to 1,000,000 and see if that helps
## Num of MC samples
M = 10 ^ 6
## draw from standard normal
y = rnorm(M)
## compute MC estimate of p(Y > 4)
print(mean(y > 4))
## [1] 3.1e-05
## compute MC variance for p(Y > 4)
print(var(y > 4) / M)
## [1] 3.099907e-11
It does imrpove, however at the cost of many more samples. So, how do we improve the efficiency of this approach? Similarly, how do we reduce the variance of the MC estimator?
To reduce the variance of the estimator, we can instead sample from a different distribution \(g(\boldsymbol{x})\) that will oversample in the portion of \(\boldsymbol{x}\) where most of the mass under \(h(\boldsymbol{x})f(\boldsymbol{x})\) is located. This location in \(x\) tends to have low probability of receiving samples from \(f(\boldsymbol{x})\).
By making the hitting of this region shift from a rare event scenario to a common one, we are essentially reducing our MC variance and also improving our efficiency of estimating \(\int h(\boldsymbol{x})f(\boldsymbol{x})d\boldsymbol{x}\). However, since we are no longer sampling from \(f(\boldsymbol{x})\), we must adjust for this, otherwise we are not estimating the same integral.
Here, we are rewriting the original integral as the following:
\[ \mu = \int h(\boldsymbol{x})f(\boldsymbol{x})d\boldsymbol{x} =\int h(\boldsymbol{x}) \frac{f(\boldsymbol{x})}{g(\boldsymbol{x})}g(\boldsymbol{x})d\boldsymbol{x}\]
where \(g(\boldsymbol{x})\) is the alternative distribution that we are sampling from, sometimes called the “importance sampling function” or “envelope”. Using similar arguments as before, we can estimate \(\mu\) in this case via \[\hat{\mu}_{IS}^* = \frac{1}{M}\sum_{k = 1}^M h(\boldsymbol{X}_k)w^*(\boldsymbol{X}_k),\] where \(w^*(\boldsymbol{X}_k) = \frac{f(\boldsymbol{X}_k)}{g(\boldsymbol{X}_k)}\) can be thought of as weights, and \(\boldsymbol{X}_1,\ldots,\boldsymbol{X}_M\) are an iid sample from \(g\). For this approach to work well, it must be easy to sample from \(g\), and \(f\) must be easy to evaluate, otherwise you are just adding more computational work to evaluate this integral.
We may also use the estimator \[\hat{\mu}_{IS} = \sum_{k = 1}^M h(\boldsymbol{X}_k)w(\boldsymbol{X}_k)\] where \(w(\boldsymbol{X}_k) = \frac{ w^*(\boldsymbol{X}_k)}{\sum_{k = 1}^M w^*(\boldsymbol{X}_k)}\) are standardized weights. We may choose to use standardized weights when \(f\) is only known up to a proportionality constant, as is typical in Bayesian applications (only the kernel of a posterior is known). Using a similar rationale as the standard MC estimator, the IS MC estimator will similar converge to the true value of the integral as the number of drawn samples tends to infinity. It can be shown however this that this estimator is biased, but this bias is small.
Some notes regarding the choice of \(g(\boldsymbol{x})\): we want to ensure that the support for \(g(\boldsymbol{x})\) is the same as the support of \(f(\boldsymbol{x})\), and that \(f(\boldsymbol{x})/g(\boldsymbol{x})\) is bounded to avoid excess variability in \(\hat{\mu}_{IS}\). This can be remedied by ensuring that \(g(\boldsymbol{x})\) has fatter tails than \(f(\boldsymbol{x})\) (we will give an example of this later). Otherwise, the importance ratios may become quite large and inflate the variance, which is the opposite of what is intended by this approach. One simple way of avoiding this issue is simply choosing a \(g(\boldsymbol{x})\) that samples in a region where \(h(\boldsymbol{x})\) is large, so that in other regions the impact of the importance ratio being large will not matter as much.
Summing things up, choosing an importance function \(g\) such that \(|h|f/g\) is approxomately constant and has finite variance (the latter condition to avoid poor performance) is often preferable. Monitoring the range of the unstandized weights in practice can help detect issuesin the choice of \(g\).
An example of this is the one given earlier, where we are trying to estimate the probability that \(Y > 4\), where \(Y\) is standard normal. Here, if we choose \(g(Y) \sim N(4,1)\), we then oversample this tail region, and undersample in regions that are not of interest, for example where \(I[Y > 4] = 0\). We show this below:
## Num of MC samples
M = 10 ^ 6
## draw from importance sampling function N(2,1), instead of N(0,1)
y_imp = rnorm(M, 4, 1)
## compute importance weights
wstar = dnorm(y_imp, 0, 1) / dnorm(y_imp, 4, 1)
## compute MC estimate of p(Y > 4)
print(mean((y_imp > 4) * wstar))
## [1] 3.158884e-05
## compute MC variance for p(Y > 4)
print(var((y_imp > 4) * wstar) / M)
## [1] 4.505807e-15
So, we are able to achieve relatively good accuracy relative to the original approach with the same number of samples.
One way to measure the quality of a choice of \(g(\boldsymbol{x})\) is by computing what is called the “effective sample size”, defined as \[\hat{M}(g, f) = \frac{M}{1+\hat{var}(w^*(X))},\] where \(w^*(X)\) is the vector of unstandardized important weights. One way to interpret this is that \(M\) samples drawn from \(g(\boldsymbol{x})\) via importance sampling is equivalent to sampling \(\hat{M}(g, f)\) samples from \(f(\boldsymbol{x})\) under the original scheme. This is one way to measure the loss in precision in approximating \(f\) with \(g\), and is a useful measure for comparing different importance functions. Here we have
# if we drew M samples from g, this would be the equivalent number of samples under directly sampling f
print(M/(1+var(wstar)))
## [1] 118.0328
Using the same posterior mode and variance as in the AQG example, we can improve the approximation further by simply sampling from the importance sampling function \(\phi(\hat{\mu}, \hat{\sigma}^2)\):
## Num of MC samples
M = 10 ^ 3
## draw from importance sampling function
y_imp = rnorm(M, pm$p1, sqrt(-1 / hess_pm))
## compute importance weights
wstar = dnorm(y_imp, 0, sqrt(s2gamma)) / dnorm(y_imp, pm$p1, sqrt(-1 / hess_pm))
## compute MC estimate of integral
print(mean(sapply(y_imp, h, y = y1, beta = beta) * wstar))
## [1] 2.545051e-07
We can see that we improved on the original accuracy from the regular MCI approach with only \(M = 10^3\) compared to \(M = 10^7\) earlier! We can also visualize this below:
plot(xv,inner(gammai = xv, yi = y1, beta = beta, s2gamma = s2gamma),
ylab = "inner(gamma1)", xlab = "gamma1", type = 'l')
segments(x0 = y_imp,x1 = y_imp,y0 = 0, y1 = inner(gammai = y_imp, yi = y1, beta = beta, s2gamma = s2gamma), col = "lightblue")
abline(v = pm$p1, col = "blue")
We see more equal coverage of both tails of the integrand.
Monte Carlo Integration (MCI) is quite general and can be applied to many functions. The drawbacks are the number of points that may be needed to sample in order to obtain similar accuracy as say gaussian quadrature. In addition, since the nodes are being randomly sampled from some distribution, there will be random variability in the estimate for the integral. That is, if one were to repeat the procedure and resample the nodes, the value of the integral may change. Therefore, one may have to sample more nodes in order to obtain similar accuracy as GQ and also to reduce the variance of the integral approximation.
As a result, one may prefer to use MCI in situations for complex or non-smooth integrands or for multidimentional integrals (discussed below).
So, to summarize:
Keep in mind that so far we have primarily dealt with evaluating one-dimensional integrals. If we were dealing with a two-dimensional integral, for Newton-Cotes type methods we would need to define a two-dimensional grid to approximate the integral with respect to each variable. This results in a dramatic increase in the computational time relative to the number of intervals chosen.
For Gaussian Quadrature (assuming \(M>1\)), we would have to determine the nodes and weights with respect to each variable, and the integral is then approximated with a double summation. That is, in one dimensional integration, if \(M\) was chosen to be 5, then we would have 5 evaluations of the function \(f\). For a two dimensional integral, this would jump to 25 evaluations of \(f\). We again have this explosion in computational time for \(M > 1\).
Adaptive Quadrature is another approach to get around some of these issues, where the intervals are adjusted such that they are places in portions of the integrand where more of the mass is located. This allows for fewer nodes to be needed to achieve similar accuracy. Such approaches should be considered in general, but especially when using multidimensional integrals. However, this does not avoid the significant increase in computation time for \(M > 1\) in general.
Clearly, as the dimension increases, the above approaches become more and more computationally difficult. Software implementations for estimating GLMM’s typically use the Laplace approximation (equivalent to using \(M=1\) in GQ) to ease some of the computational burden. The lme4 package in R contains the function glmer to do this for GLMM’s and uses some additional tricks to improve accuracy. The parameter nAGQ controls how many quadrature points to use (default \(k = 1\), Laplace), but is not available for models where the dimension of the random effects vector is greater than 1.
The benefit of using MCI becomes apparent as the dimension of the integral increases. For the trapezoidal rule for example, the convergence rate in terms of the number of nodes \(M\) is simply \(o(M^{-2/d})\), whereas for MCI it is \(o(M^{-1/2})\), independent of the dimension \(d\). For the other methods discussed, a similar reduction in convergence order is observed as \(d\) increases. It is apparent that the overall order of MCI is slower than other approaches, but does not change with the dimensions \(d\), making it attractive for problems where \(d \geq 5\), among other benefits.
So far, we have discussed ways to perform MCI when the target distribution is known and can be easily simulated from. Going back to the MCEM example from the prior lecture, the posterior distribution of the missing data, conditional on the obeserved data and the current parameter estimates, may not be a known or simple distribtion. In most cases, we are not able to directly sample from such distributions with standard statistical software. In the next lecture, we will discuss several approaches to sample from such distributions. Later we will compare the approaches discussed in this lecture with those in the following to evaluate an MCEM example in application to our Poisson GLMM and another example with a higher dimensional integral.