BIOS 667 - Lecture 4: Estimation and Inference (Ch. 4)

Fitzmaurice, Laird & Ware (2011) - Applied Longitudinal Data Analysis

Naim Rashid

Lecture Objectives

By the end of this lecture, you will be able to:

Understand the role of covariance structure in longitudinal data modeling
Apply maximum likelihood (ML) and restricted maximum likelihood (REML) estimation
Classify missing data mechanisms (MCAR, MAR, MNAR) and their implications
Compare inference approaches: Wald, likelihood ratio, and score tests
Choose appropriate estimation methods based on research goals

Roadmap

Section	Topic
1	Introduction to Linear Models for Longitudinal Data
2	Maximum Likelihood Estimation
3	Missing Data Mechanisms
4	Statistical Inference
5	Restricted Maximum Likelihood (REML)

Spans 2 class periods. Day 1 covers covariance structure and ML estimation; Day 2 covers missing data and inference (including REML).

Notation reference (so far)

Symbol	Meaning
\(i,\ j\)	subject index \(i = 1,\dots,N\) and occasion/time index \(j = 1,\dots,n_i\)
\(N,\ n_i\)	number of subjects; number of occasions for subject \(i\) (\(n\) when balanced)
\(Y_{ij},\ \mathbf{Y}_i\)	response for subject \(i\) at occasion \(j\) (scalar); response vector \((n_i \times 1)\)
\(t_{ij}\)	measurement time for subject \(i\) at occasion \(j\)
\(X_{ij},\ \mathbf{X}_i\)	covariate row vector \(1 \times p\); design matrix \(n_i \times p\)
\(\boldsymbol\beta\)	fixed-effect (population-average) coefficients
\(\varepsilon_{ij},\ \boldsymbol\varepsilon_i\)	residual error (scalar; vector)
\(\Sigma_i = \operatorname{Cov}(\mathbf{Y}_i)\)	marginal (total) covariance of \(\mathbf{Y}_i\) (decomposes into random-effect + residual parts once random effects appear, Ch. 8)
\(\rho\)	a correlation parameter (e.g. exchangeable, or AR(1) where corr at lag \(k\) is \(\rho^k\))

Bold = vector or matrix. This is a running reference card: later lectures add random-effect and GEE notation.

Notation (this lecture)

This deck adds:

Symbol	Meaning
\(\Sigma_i\)	within-subject covariance of \(\mathbf{Y}_i\) (FLW eq (4.2); pre-Ch. 8 it is just \(\operatorname{Cov}(\mathbf{Y}_i)\))
\(\boldsymbol{\theta}\)	covariance parameters: \(\Sigma_i = \Sigma_i(\boldsymbol{\theta})\)
\(R_i\)	missingness indicator (Part II), not the Ch. 8 residual covariance

Part I (Day 1): Covariance Structure and ML Estimation

Timing cue: this Part is one 75-minute block (covariance structure and ML estimation).

Introduction

From Conceptual to Estimation

Until now: models for longitudinal data were conceptual (mean + covariance)

Starting here: focus on estimation of unknown parameters

Parameter	Description
\(\beta\)	Regression coefficients
\(\boldsymbol{\theta}\)	Covariance parameters
\(\Sigma_i\)	Within-subject covariance matrix

The Framework

Conditional multivariate normal model:

\[ \mathbb{E}(\mathbf{Y}_i \mid \mathbf{X}_i) = \mathbf{X}_i \beta \]

\[ \text{Cov}(\mathbf{Y}_i \mid \mathbf{X}_i) = \Sigma_i = \Sigma_i(\boldsymbol{\theta}) \]

Here \(\Sigma_i\) is the within-subject covariance for subject \(i\) (FLW’s \(\Sigma_i(\boldsymbol\theta)\)), and \(\boldsymbol{\theta}\) collects its covariance parameters.

Interpreting the Covariance Matrix

\(\Sigma_i\) encodes within-subject covariance of repeated measures

Parameterized by a vector \(\boldsymbol{\theta}\) of covariance parameters

Covariance Structure Options

Structure	Parameters (\(q\))	Description
Unstructured (UN)	\(n_i(n_i+1)/2\)	All variances and covariances estimated
Compound Symmetry (CS)	2	One variance + one covariance
AR(1)	2	Correlation decays with time lag
Independence	1	Only variance estimated

Why Covariance Structure Matters

Choice of covariance structure directly impacts:

Aspect	Consequence
Efficiency	Are SEs valid and efficient?
Interpretability	Do correlation patterns make sense?
Feasibility	UN explodes when \(n_i\) (number of occasions) is large

Looking Ahead: Chapters 5-8

Chapters 5-8 will:

Focus on models for the mean and covariance
Explore estimation under different covariance assumptions
Develop tools for inference when \(\Sigma_i\) is known vs estimated

Next up, Lecture 5 (Ch. 5, response profiles) applies this estimation and inference machinery (GLS/ML, REML, the LRT) to the mean response over time.

Discussion Question

What might happen if we assume compound symmetry when the true structure is autoregressive (AR(1))?

Example: Simulating Covariance Structures

set.seed(667)

# Parameters
n_subj <- 25   # enough subjects to SEE constant vs decaying correlation
n_time <- 5
time <- 1:n_time

# Compound symmetry covariance (rho constant)
rho_cs <- 0.8
Sigma_cs <- matrix(rho_cs, n_time, n_time)
diag(Sigma_cs) <- 1

# AR(1) covariance (rho decays with lag)
# row()/col() return the row and column index of each matrix entry, so
# abs(row - col) is the time-lag |j - k| at each cell; rho^lag gives the
# AR(1) correlation that decays as the lag grows.
rho_ar1 <- 0.8
Sigma_ar1 <- rho_ar1 ^ abs(row(Sigma_cs) - col(Sigma_cs))

# Independence covariance (rho is 0)
rho_in <- 0
Sigma_in <- matrix(rho_in, n_time, n_time)
diag(Sigma_in) <- 1

# Simulate data
library(MASS)
sim_cs  <- MASS::mvrnorm(n_subj, mu = rep(0, n_time), Sigma = Sigma_cs)
sim_ar1 <- MASS::mvrnorm(n_subj, mu = rep(0, n_time), Sigma = Sigma_ar1)
sim_in  <- MASS::mvrnorm(n_subj, mu = rep(0, n_time), Sigma = Sigma_in)

# Put in long format
to_long <- function(mat, type){
  data.frame(id = rep(1:nrow(mat), each = n_time),
             time = rep(time, nrow(mat)),
             y = as.vector(t(mat)),
             cov_type = type)
}
df <- rbind(to_long(sim_cs, "CS"), to_long(sim_ar1, "AR1"), to_long(sim_in, "Indep"))

Example: Simulating Covariance Structures

Visualization: CS vs AR(1) vs Independence

library(ggplot2)
ggplot(df, aes(x = time, y = y, group = id, color = cov_type)) +
  geom_line(alpha = 0.75) +
  facet_wrap(~cov_type) +
  labs(title = "Simulated Trajectories under CS vs AR(1) vs Identity Covariance",
       x = "Time", y = "Response") +
  theme_minimal()

Visualization: CS vs AR(1) vs Independence

Key Takeaways from Simulation

Structure	Pattern
Compound Symmetry	Correlation constant across time; lines look “parallel”
AR(1)	Correlation decays with lag; trajectories diverge over time
Independence	No correlation; erratic patterns

If mis-specified, standard errors are incorrect and inference is invalid (point estimates stay consistent)

Check Your Understanding: Covariance Structures

If you have 8 time points and use an unstructured covariance, how many covariance parameters must be estimated?
When would AR(1) be preferred over compound symmetry?
What happens to standard errors if you assume independence when observations are actually correlated?

Answers:

With \(n_i = 8\) occasions, \(n_i(n_i+1)/2 = 8(9)/2 = 36\) parameters (8 variances + 28 covariances)
AR(1) is preferred when correlation decays with time lag (e.g., measurements closer in time are more correlated)
Standard errors will typically be underestimated, leading to inflated Type I error rates (false positives)

Maximum Likelihood Estimation

ML Estimation Goal

Goal: Jointly estimate both:

Regression coefficients \(\beta\)
Covariance parameters \(\boldsymbol{\theta}\)

The Log-Likelihood

Under the multivariate normal model:

\[ \ell(\beta, \boldsymbol{\theta}) = -\tfrac{1}{2} \sum_{i=1}^N \Big[ \log |\Sigma_i| + (\mathbf{Y}_i - \mathbf{X}_i \beta)^\top \Sigma_i^{-1} (\mathbf{Y}_i - \mathbf{X}_i \beta) \Big] \]

Case 1: Independent Observations

When responses are independent (e.g., cross-sectional data):

\[\Sigma_i = \sigma^2 I\]

Log-likelihood simplifies to OLS framework:

\[ \hat{\beta} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y} \]

Visualization: Independent Case

set.seed(667)
n <- 50
x <- rnorm(n)
y <- 1 + 2*x + rnorm(n, sd = 1)
fit <- lm(y ~ x)
plot(x, y, pch = 19, col = "steelblue")
abline(fit, col = "darkred", lwd = 2)

Visualization: Independent Case

Independence: Key Point

With independence:

OLS line = ML estimate
All observations contribute equally

Case 2: Non-Independent Observations

For longitudinal data: responses within subject \(i\) are correlated

\(\Sigma_i\) not proportional to identity matrix
Need to model covariance explicitly (e.g., CS, AR(1), UN)
ML must jointly estimate \(\beta\) and \(\boldsymbol{\theta}\)

Key Result: GLS/ML Estimator of \(\beta\)

For a known (or estimated) covariance \(\Sigma_i\), the ML estimator of \(\beta\) is the generalized least squares (GLS) form:

\[ \hat{\beta} = \left( \sum_{i=1}^N \mathbf{X}_i^\top \Sigma_i^{-1} \mathbf{X}_i \right)^{-1} \left( \sum_{i=1}^N \mathbf{X}_i^\top \Sigma_i^{-1} \mathbf{Y}_i \right) \]

Setting \(\Sigma_i = \sigma^2 I\) (independence) drops \(\Sigma_i^{-1}\) to a constant and recovers the OLS form \(\hat{\beta} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y}\).

Implications of Non-Independence

If correlation is ignored:

Aspect	Consequence
\(\hat{\beta}\)	Still unbiased (under correct mean model)
Standard errors	Incorrect - invalid inference

Proper ML uses \(\Sigma_i^{-1}\) weighting to account for correlation

Example: CS vs AR(1) Model Comparison

library(nlme)
data(Orthodont, package = "nlme")
Orthodont <- data.frame(Orthodont)

# Fit ML with compound symmetry vs AR1
fit_cs  <- gls(distance ~ age,
               correlation = corCompSymm(value = 0.5, form = ~1|Subject),
               data = Orthodont, method = "ML")

fit_ar1 <- gls(distance ~ age,
               correlation = corAR1(value = 0.5, form = ~age|Subject),
               data = Orthodont, method = "ML")

AIC(fit_cs, fit_ar1)

Example: CS vs AR(1) Model Comparison

        df      AIC
fit_cs   4 451.3895
fit_ar1  4 462.2225

Model Comparison: Interpretation

Compare model fits using AIC
AR(1) often better when correlation decays with time
Demonstrates sensitivity of ML to covariance choice

In plain language: When you see an AIC comparison, the model with the lower AIC provides a better balance of fit and parsimony. If AR(1) has lower AIC than CS, this suggests that correlations between measurements do indeed decay as time between measurements increases - a common pattern in longitudinal data.

SAS Equivalent: PROC MIXED (Reference Only)

The same CS vs AR(1) comparison in SAS uses PROC MIXED with a REPEATED statement:

/* Compound symmetry */
proc mixed data=orthodont method=reml;
  class subject;
  model distance = age / solution;
  repeated / type=cs subject=subject;
run;

/* First-order autoregressive */
proc mixed data=orthodont method=reml;
  class subject;
  model distance = age / solution;
  repeated / type=ar(1) subject=subject;
run;

Note

R is the runnable language for this course (used in homework); SAS is shown for reference only. Compare fits across the two runs via the AIC reported in each PROC MIXED output.

Why ML Matters

ML provides a unified framework for:

Task	Application
Estimation	Regression parameters
Estimation	Covariance structure
Testing	Competing models (LRT, AIC)

Caveat: Can be biased for variance components in small samples - motivates REML

Discussion Prompt

Suppose you analyze repeated measures with independence assumed, but the true covariance is AR(1).

What happens to \(\hat{\beta}\)?
What happens to its standard errors?
How might conclusions differ?

Check Your Understanding: Maximum Likelihood

What two sets of parameters does ML simultaneously estimate in longitudinal models?
If you fit a model assuming compound symmetry but the true structure is AR(1), will \(\hat{\beta}\) be biased?
Why might we use AIC to compare different covariance structures?

Answers:

Regression coefficients (\(\beta\)) and covariance parameters (\(\boldsymbol{\theta}\))
No, \(\hat{\beta}\) remains unbiased under a correctly specified mean model. However, standard errors will be incorrect.
AIC balances model fit against complexity, allowing fair comparison of non-nested covariance structures (e.g., CS vs AR(1))

Part II (Day 2): Missing Data and Inference

Day 2 Roadmap

Day 1 covered covariance structure and ML estimation. Today:

Missing data mechanisms (MCAR, MAR, MNAR)
Statistical inference (Wald, LRT, score tests)
Restricted maximum likelihood (REML)

Missing Data Mechanisms

The Missing Data Problem

Missing data is ubiquitous in longitudinal studies

The mechanism driving missingness is critical to:

Validity of parameter estimates
Interpretation of results

Rubin’s Classification

Mechanism	Abbreviation	Description
Missing Completely at Random	MCAR	Missingness independent of data
Missing at Random	MAR	Missingness depends on observed data
Missing Not at Random	MNAR	Missingness depends on unobserved data

Core question: Can we ignore the missingness mechanism?

Note

FLW Ch. 4 (Section 4.3) reviews only MCAR and MAR plus the likelihood-based handling of MAR. MNAR (and the selection / pattern-mixture models, multiple imputation, and inverse probability weighting introduced below) is a forward look to FLW Ch. 17-18 (course L17-L18), not Ch. 4.

MCAR: Missing Completely at Random

Probability of missingness does not depend on data (neither observed nor unobserved)

Let \(R_i\) be the missingness indicator for subject \(i\) (which responses are observed vs missing). Note this \(R_i\) is the missing-data indicator and is distinct from the residual-covariance matrix \(R_i\) in the master notation.

\[ P(R_i \mid Y_i, X_i) = P(R_i) \]

Example: Lab sample lost due to shipping error

Implication: Complete case analyses remain unbiased, but less efficient

Rare in practice, but a useful baseline

MAR: Missing at Random

Probability of missingness depends only on observed data

Split \(Y_i\) into its observed and missing parts: \(Y_{i,\text{obs}}\) is the responses actually recorded for subject \(i\), and \(Y_{i,\text{mis}}\) is the responses that are missing.

\[ P(R_i \mid Y_i, X_i) = P(R_i \mid Y_{i,\text{obs}}, X_i) \]

Example: Dropout depends on age or baseline health status (observed)

Implication: Valid inference possible if model conditions on observed predictors

MAR: Methods for Valid Inference

Method	Description
Likelihood-based	Model includes predictors of missingness
Multiple imputation	Impute missing values under MAR assumption
Inverse probability weighting	Weight observations by probability of being observed

Note

FLW Ch. 4 (Section 4.3) covers only MCAR/MAR and the likelihood-based handling of MAR (first row). Multiple imputation and inverse probability weighting are a forward look to FLW Ch. 17-18 (course L17-L18).

MAR: A Critical Caveat

MAR does not mean automatically safe

Valid inference requires that the variables driving missingness are part of your analysis (or imputation) model

Otherwise: bias creeps in, even if the true mechanism is MAR

MAR Examples: Correct vs Incorrect

Scenario	Modeling Approach	Validity
Dropout depends on gender	Gender included as covariate	Valid
Dropout depends on baseline depression	Depression omitted from model	Biased

Takeaway: Always include predictors of missingness in your model if they are observed

MNAR: Missing Not at Random

Note

Beyond FLW Ch. 4: MNAR and its models are a forward look to FLW Ch. 17-18 (course L17-L18). FLW Ch. 4 covers only MCAR/MAR.

Probability of missingness depends on unobserved data

\[ P(R_i \mid Y_i, X_i) = P(R_i \mid Y_{i,\text{mis}}, Y_{i,\text{obs}}, X_i) \]

Example: Patients with worsening symptoms more likely to drop out

MNAR: Implications

Ignorability fails; specialized models needed:

Model Type	Approach
Selection models	Model missingness mechanism jointly
Pattern-mixture models	Stratify by missingness patterns

Assumptions must be explicit and untestable

Note

Selection and pattern-mixture models are a forward look to FLW Ch. 17-18 (course L17-L18), not FLW Ch. 4.

Visual Illustration of Mechanisms

library(ggplot2)
set.seed(667)

n <- 200
x <- rnorm(n)
y <- 0.5 + 1.2*x + rnorm(n)

# Create missingness
mar  <- ifelse(x > 0.5, NA, y)           # MAR: depends on observed X
mcar <- ifelse(runif(n) < 0.3, NA, y)    # MCAR: random
mnar <- ifelse(y > 1, NA, y)             # MNAR: depends on unobserved Y

df <- data.frame(x, y, mcar, mar, mnar)
df_long <- tidyr::pivot_longer(df, cols = c(y, mcar, mar, mnar),
                               names_to = "Type", values_to = "Value")

ggplot(df_long, aes(x, Value, color = Type)) +
  geom_point(alpha = 0.6) +
  facet_wrap(~Type) +
  theme_minimal() +
  labs(title = "Illustrating MCAR, MAR, and MNAR")

Visual Illustration of Mechanisms

Consequences of Ignoring Mechanisms

Mechanism	Consequence if Ignored
MCAR	Unbiased estimates but less power
MAR	Valid if appropriately modeled
MNAR	Bias if treated as MAR (biggest risk)

Incorrect assumption can change clinical conclusions

Discussion Prompt

Suppose a trial of a new cancer therapy has 20% dropout:

Healthier patients drop due to travel burden

Sicker patients drop due to toxicity

Which mechanism is most plausible?
How would this affect analysis?

Practical Takeaways for Missing Data

Action	Recommendation
Ask	Why is data missing?
Document	Dropout patterns carefully
Analyze	Perform sensitivity analyses for MNAR scenarios
Default	Start with MAR, but assess robustness

Check Your Understanding: Missing Data

A patient drops out because they felt better and no longer wanted to come in for visits. Is this MCAR, MAR, or MNAR?
Under MAR, what must be true for likelihood-based methods to yield valid inference?
Why is MNAR the most problematic mechanism?

Answers:

This is likely MNAR - the dropout depends on the unobserved outcome (how they would have felt at future visits). However, if feeling better was captured by observed variables (e.g., last observed score), it could be MAR.
The variables that predict missingness must be included in the analysis model (e.g., if dropout depends on baseline severity, include baseline severity as a covariate)
MNAR is problematic because the missingness depends on unobserved data, so we cannot verify the mechanism from the observed data alone. Special models are needed and assumptions are untestable.

Statistical Inference

Goals of Inference

After estimation, we want to make inferences about:

Regression coefficients \(\beta\)
Covariance parameters \(\boldsymbol{\theta}\)

Why Longitudinal Inference Differs

Inference in longitudinal models differs from OLS because:

Issue	Consequence
Within-subject correlation	Observations are not independent
Covariance modeling	\(\Sigma_i\) must be estimated

Standard OLS Inference (Review)

In cross-sectional linear regression:

\[\hat{\beta} \sim N(\beta, \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1})\]

Assumes:

Independent, identically distributed errors
Homoscedasticity

Inference with Correlated Data

With correlated data:

\[\text{Var}(\hat{\beta}) = \left( \sum_i \mathbf{X}_i^\top \Sigma_i^{-1} \mathbf{X}_i \right)^{-1}\]

Key difference: Dependence on \(\Sigma_i\)

If \(\Sigma_i\) mis-specified: incorrect standard errors, invalid inference (the point estimate \(\hat{\beta}\) stays consistent)

Three Approaches to Hypothesis Testing

Test	Description
Likelihood Ratio Test (LRT)	Compares nested models via likelihood difference
Wald Test	Based on asymptotic normality of \(\hat{\beta}\)
Score Test	Based on derivative of likelihood at restricted model

Note

Beyond FLW Ch. 4: FLW Ch. 4 (Section 4.4) presents only the Wald test and the LRT. The score test is an instructor extension included here for completeness; it does not appear in FLW Ch. 4.

Likelihood Ratio Tests (LRT)

For testing hypotheses about parameters in nested models:

\[\Lambda = -2(\ell_{\text{restricted}} - \ell_{\text{full}})\]

Asymptotically \(\chi^2\)-distributed

Used for testing fixed effects and comparing covariance structures

LRT: Pros and Cons

Aspect	Details
Pros	More robust than Wald when testing multiple parameters
Cons	Requires both restricted and full model fits
Best for	Fixed-effects comparisons using ML (not REML)

Wald Tests

Based on asymptotic normality of \(\hat{\beta}\):

\[W = (\hat{\beta} - \beta_0)^\top \text{Var}(\hat{\beta})^{-1} (\hat{\beta} - \beta_0)\]

Aspect	Details
Pros	Flexible; can test single parameters or linear combinations
Cons	Relies on SE accuracy; sensitive to covariance misspecification

Score Tests

Based on derivative of likelihood at restricted model

Note

Beyond FLW Ch. 4: the score test is an instructor extension, not part of FLW Ch. 4 (which covers only the Wald test and the LRT).

Aspect	Details
Pros	Requires only restricted fit
Cons	Less commonly applied in practice
Best for	Testing variance components (e.g., random effects)

Boundary caveat for variance components (beyond FLW Ch. 4)

This caveat is correct but is an extension: FLW Ch. 4 (p. 100) recommends against Wald-testing covariance parameters and defers variance-component testing to FLW Ch. 7 (no mixture is given in Ch. 4). The boundary mixture below is the Stram-Lee result.

A variance component being tested under \(H_0\) sits on the boundary of its parameter space (a variance cannot be negative). The usual chi-square reference is then wrong. For testing a single variance component the correct null distribution is a 50:50 mixture, \(\tfrac{1}{2}\chi^2_0 + \tfrac{1}{2}\chi^2_1\), not a naive \(\chi^2_1\). Using the naive distribution is conservative (p-values too large).

Comparison of Inference Approaches

The two tests of FLW Ch. 4 are the LRT and the Wald test. The score test (last row) is an instructor extension, not Ch. 4 content.

Test	Requires Full Model?	Small-Sample Properties	Notes
LRT	Yes	Better for fixed effects	Not valid under REML for fixed effects
Wald	Yes	Sensitive to SE estimation	Flexible
Score (beyond Ch. 4)	No	Can be conservative	Good for covariance terms

When to Use Which Test?

Situation	Recommended Test
Quick, single-parameter, exploratory	Wald
Nested models, multiple fixed effects	LRT
Variance components, unstable full model	Score (beyond FLW Ch. 4)

LRT and Wald are the FLW Ch. 4 tests; the score-test row is the instructor extension.

Satterthwaite Degrees of Freedom

In finite samples, \(\chi^2\) and \(t\) approximations may be too liberal

Satterthwaite adjustment: estimates a fractional degrees of freedom from the variance of the estimated variance components, so the \(t\) reference distribution matches the small-sample behavior of \(\hat{\beta} / \text{SE}(\hat{\beta})\). lmerTest computes it automatically; you do not specify the formula by hand.

Package note

Every model in this deck is fit with nlme::lme, which reports containment degrees of freedom only, not Satterthwaite or Kenward-Roger. To get Satterthwaite (or Kenward-Roger) df you need lme4::lmer with lmerTest, or pbkrtest for Kenward-Roger.

Satterthwaite: Key Features

Feature	Benefit
Fractional df	More accurate Type I error control
Automatic	Available in `lmerTest::lmer()`
Small samples	Critical for validity

Inference Summary

Method	Key Characteristic
Wald	Convenient but SE-sensitive
LRT	Preferred for nested model comparisons
Score (beyond FLW Ch. 4)	Good for variance component testing
Satterthwaite df	Critical for small-sample validity

The FLW Ch. 4 toolkit is the Wald test and the LRT; the score test is an instructor extension.

Restricted Maximum Likelihood (REML)

Motivation for REML

Estimating variance components by ML can be biased

Because ML does not account for loss of degrees of freedom from estimating \(\beta\)

The REML Solution

REML modifies the likelihood to use only error contrasts of the data

This eliminates \(\beta\) from the variance estimation

Mathematical Definition: ML

Suppose \(\mathbf{Y} \sim N(\mathbf{X}\beta, \boldsymbol\Sigma)\)

ML log-likelihood:

\[ \ell(\beta, \boldsymbol{\theta}) = -\tfrac{1}{2}\left\{\log|\boldsymbol\Sigma| + (\mathbf{Y}-\mathbf{X}\beta)^\top \boldsymbol\Sigma^{-1}(\mathbf{Y}-\mathbf{X}\beta) \right\} \]

The OLS Intuition: \(n\) vs \(n-p\)

Start with the familiar OLS case (here \(n\) is the generic OLS sample size, the number of data points, not the longitudinal occasion count):

Method	Variance Estimate
ML	\(\hat{\sigma}^2 = \frac{1}{n}\sum (y_i - \hat{y}_i)^2\) (biased)
REML	\(\hat{\sigma}^2 = \frac{1}{n-p}\sum (y_i - \hat{y}_i)^2\) (unbiased)

Dividing by \(n\) ignores that we spent \(p\) degrees of freedom estimating \(\beta\). The \(-p\) correction is what makes the estimate unbiased. REML carries this same correction into the general covariance setting.

Mathematical Definition: REML

REML log-likelihood integrates out \(\beta\):

\[ \ell_R(\boldsymbol{\theta}) = -\tfrac{1}{2}\left\{ \log|\boldsymbol\Sigma| + \underbrace{\log|\mathbf{X}^\top \boldsymbol\Sigma^{-1}\mathbf{X}|}_{\text{multivariate analogue of the } -p \text{ correction}} + (\mathbf{Y}-\mathbf{X}\hat{\beta})^\top \boldsymbol\Sigma^{-1}(\mathbf{Y}-\mathbf{X}\hat{\beta}) \right\} \]

The \(\log|\mathbf{X}^\top \boldsymbol\Sigma^{-1}\mathbf{X}|\) term is the matrix generalization of subtracting \(p\) from the denominator: it debits the likelihood for the \(p\) fixed effects that were estimated.

Intuition: Why REML Works

Aspect	ML	REML
Treatment of \(\beta\)	Fixed	Integrated out
Variance estimates	Biased downward	Less biased
Equivalent to	N/A	Linear combinations that eliminate \(\beta\)

Advantages of REML

Advantage	Details
Less biased	Especially in small samples
Standard default	`nlme::lme`, `lme4::lmer`
Random effects	Works well for LME models

Disadvantages of REML

Disadvantage	Details
Design dependence	REML likelihood depends on \(\mathbf{X}\)
Model comparison	Cannot compare different fixed effects via REML LRT
Computation	Slightly heavier than ML

ML vs REML: Decision Guide

Use Case	Recommended Method
Comparing models with different fixed effects	ML
Estimating variance components	REML
Small sample sizes	REML

Visualization: Bias in Variance Estimates

set.seed(667)
library(nlme)

# Simulate small dataset
dat <- data.frame(id = rep(1:20, each = 3),
                  time = rep(1:3, times = 20))
# rnorm(20) draws one random intercept per subject; [dat$id] then repeats
# each subject's intercept across that subject's 3 rows (a subject-level effect).
dat$y <- 1 + 0.5*dat$time + rnorm(60, sd = 1) + rnorm(20)[dat$id]

# Fit ML and REML
fit_ml   <- lme(y ~ time, random = ~1|id, data = dat, method = "ML")
fit_reml <- lme(y ~ time, random = ~1|id, data = dat, method = "REML")

c("ML Var(Intercept)"   = VarCorr(fit_ml)[1, "Variance"],
  "REML Var(Intercept)" = VarCorr(fit_reml)[1, "Variance"])

Visualization: Bias in Variance Estimates

  ML Var(Intercept) REML Var(Intercept) 
         "1.369898"          "1.451962"

ML vs REML: Interpreting the Output

In plain language: Notice that REML typically produces a larger variance estimate than ML. This is because REML corrects for the downward bias inherent in ML estimation. In small samples, this difference can be substantial. The REML estimate is analogous to using \(n-p\) instead of \(n\) in the denominator when estimating residual variance in ordinary regression.

REML Takeaways

Point	Summary
Bias correction	REML corrects small-sample bias in variance estimates
Fixed effects	Use ML for fixed effects comparisons
Variance components	Use REML for variance components
Software	Always check software defaults

Check Your Understanding: Inference and REML

When comparing two models with different fixed effects (e.g., with vs. without an interaction term), should you use ML or REML for the likelihood ratio test?
Why does ML tend to underestimate variance components?
In what situation would Satterthwaite degrees of freedom be particularly important?

Answers:

ML - REML likelihoods are not comparable across models with different fixed effects because the error contrasts used depend on the design matrix \(\mathbf{X}\)
ML does not account for the degrees of freedom lost by estimating the fixed effects \(\beta\). It treats \(\beta\) as known when estimating variances.
Small sample sizes (small number of clusters/subjects) - the \(\chi^2\) and standard \(t\) approximations may be too liberal, leading to inflated Type I error rates

Common Mistakes to Avoid

Using REML for fixed effects LRT: Use ML when comparing models with different fixed effects; REML likelihoods are not comparable
Ignoring correlation structure entirely: Even though \(\hat{\beta}\) stays consistent, standard errors are incorrect and inference is invalid - potentially leading to false conclusions
Assuming MAR when dropout depends on unobserved outcomes: If sicker patients drop out due to worsening symptoms (unobserved), treating this as MAR leads to bias
Not including predictors of missingness in your model: Under MAR, validity requires conditioning on observed variables that predict missingness
Using overly complex covariance structures with small samples: Unstructured covariance with \(n_i(n_i+1)/2\) parameters can fail to converge or produce unstable estimates

Lecture Summary

Topic	Key Point
Covariance structure	Choice affects efficiency and inference validity
ML estimation	Joint estimation of \(\beta\) and \(\boldsymbol{\theta}\)
Missing data	MCAR, MAR, MNAR have different implications
Inference	Wald, LRT, Score tests have different strengths
REML	Preferred for variance estimation in small samples