| R1 | R2 | R3 | R4 | ... Rn | Y1 | Y2 | Y3 | Y4 | ... Yn |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 1 | Y1 | Y2 | Y3 | Y4 | Yn |
| 1 | 0 | 1 | 1 | 1 | Y1 | * | Y3 | Y4 | Yn |
| 1 | 1 | 0 | 1 | 1 | Y1 | Y2 | * | Y4 | Yn |
| 1 | 1 | 1 | 0 | 1 | Y1 | Y2 | Y3 | * | Yn |
| 1 | 0 | 0 | 0 | 0 | Y1 | * | * | * | Yn |
| 1 | 0 | 0 | 0 | 0 | Y1 | * | * | * | * |
Fitzmaurice, Laird & Ware (2011) - Applied Longitudinal Data Analysis
By the end of this lecture, you will be able to:
| Symbol | Meaning |
|---|---|
| \(Y_i = (Y_{i1},\dots,Y_{in})'\) | intended (complete) response vector, subject \(i\) |
| \(Y_i^O,\; Y_i^M\) | observed and missing components of \(Y_i\) |
| \(R_i = (R_{i1},\dots,R_{in})'\) | response (missingness) indicator: \(R_{ij}=1\) if \(Y_{ij}\) observed, \(0\) if missing |
| \(X_i\) | covariates (assumed fully observed) |
| \(\Sigma_i\) | marginal covariance of \(Y_i\) (canonical, as in L08-L10) |
| \(V_i\) | GEE working covariance (Ch. 12-13; appears in the GEE estimating equation) |
| \(D_i = k\) | dropout time: subject leaves between occasions \(k-1\) and \(k\) |
Chapter-local override: \(R_i\)
In this chapter \(R_i\) / \(R_{ij}\) denotes the response (missingness) indicator vector (FLW Ch. 17, also Ch. 4), NOT the within-subject residual covariance \(R_i\) from L08-L10. The override is local to L17.
Part I: Introduction & Foundations
Part II: Hierarchy of Missing Data Mechanisms
Part III: Dropout as Monotone Missingness
Part IV: Methods for Handling Dropout
Part V: Practical Guidance
Missing data are the rule, not the exception in longitudinal studies.
Three critical implications:
| Issue | Description |
|---|---|
| Imbalance | Not all individuals have same number of measurements at common occasions |
| Loss of precision | Missing data reduce information; greater missingness leads to decreased precision |
| Potential for bias | Missing data can lead to misleading inferences depending on the mechanism |
The missing data mechanism is the critical issue we must address.
Study Design:
| Feature | Description |
|---|---|
| Population | Children enrolled in grades 1-2 (ages 6-7) |
| Outcome | Annual spirometry measurements |
| Follow-up | Until high school graduation |
| Measurements/child | 1 to 12 (highly variable) |
Main Reason for Missing Data: Moving in/out of school district
| Scenario | Mechanism | Reason for Relocation | Creates Bias? |
|---|---|---|---|
| A (MCAR) | Random | Parent’s job transfer | No |
| B (MAR/MNAR) | Informative | Child’s respiratory problems | Potentially |
Question: How do we determine which scenario applies and handle each appropriately?
Study Design:
| Feature | Description |
|---|---|
| Design | Coronary Risk Factor study in school-age children |
| Cohorts | Ages 5-7, 7-9, 9-11, 11-13, 13-15 |
| Timing | Biennial measurements 1977-1981 |
| Outcome | Obesity status (age-gender-specific weight norms) |
Missing Data Patterns:
| Scenario | Mechanism | Examples |
|---|---|---|
| A (Unrelated) | MCAR | Family relocation, random absence |
| B (Related) | MAR/MNAR | Parents of obese children more/less likely to consent; obese children more likely absent |
Key Insight: Same study, different reasons for missing data leads to different analysis strategies needed.
Golden Rule of Missing Data
When data are missing, we must carefully consider WHY they are missing.
Some types of missing data are relatively benign; others can introduce serious bias.
The missing data mechanism determines:
Consider a study designed to collect \(n\) measurements per subject.
Complete response vector (intended): \[Y_i = (Y_{i1}, Y_{i2}, \ldots, Y_{in})'\]
Response indicator vector: \[R_i = (R_{i1}, R_{i2}, \ldots, R_{in})'\]
where \(R_{ij} = 1\) if \(Y_{ij}\) is observed, and \(R_{ij} = 0\) if \(Y_{ij}\) is missing.
Example: For a 6-visit study (visits 0-5), if a subject drops out after visit 2: \[R_i = (1, 1, 1, 0, 0, 0)'\]
Covariates: \(X_i\) (assumed fully observed)
Given the response indicators \(R_i\), partition \(Y_i\) into:
| Component | Description |
|---|---|
| \(Y_i^O\) | Observed components of \(Y_i\) |
| \(Y_i^M\) | Missing components of \(Y_i\) |
Key observation: \(R_i\) is recorded for all individuals and stratifies the population into distinct sub-populations defined by missing data patterns.
Response Indicators
|
Response Vector
|
||||||||
|---|---|---|---|---|---|---|---|---|---|
| R1 | R2 | R3 | R4 | ... Rn | Y1 | Y2 | Y3 | Y4 | ... Yn |
| 1 | 1 | 1 | 1 | 1 | Y1 | Y2 | Y3 | Y4 | Yn |
| 1 | 0 | 1 | 1 | 1 | Y1 | * | Y3 | Y4 | Yn |
| 1 | 1 | 0 | 1 | 1 | Y1 | Y2 | * | Y4 | Yn |
| 1 | 1 | 1 | 0 | 1 | Y1 | Y2 | Y3 | * | Yn |
| 1 | 0 | 0 | 0 | 0 | Y1 | * | * | * | Yn |
| 1 | 0 | 0 | 0 | 0 | Y1 | * | * | * | * |
Each row represents a different missing data pattern. \(R_i\) divides the target population into subpopulations.
Quick Self-Check
Scenario Analysis: In the Six Cities Study, a child’s family moves because the parent got a job promotion. Is this MCAR, MAR, or MNAR?
Notation Practice: A subject has measurements at visits 1, 2, and 3, then drops out. Write their response indicator vector \(R_i\) for a 5-visit study.
Conceptual: Why does missing data potentially lead to bias, not just loss of precision?
Answers
MCAR - The reason for moving (job promotion) is unrelated to the child’s respiratory health outcomes.
\(R_i = (1, 1, 1, 0, 0)'\) - The first three visits are observed (1), the last two are missing (0).
Missing data leads to bias when the observed data are not representative of the full population. If subjects with certain outcome values are more/less likely to be observed, sample statistics will be systematically different from population parameters.
We classify missing data by considering how \(R_i\) is related to \(Y_i\):
| Mechanism | Abbreviation |
|---|---|
| Missing Completely at Random | MCAR |
| Missing at Random | MAR |
| Not Missing at Random | MNAR |
Terminology Warning: The nomenclature is NOT intuitive!
Formal Definition:
\[\Pr(R_i \mid Y_i^O, Y_i^M, X_i) = \Pr(R_i \mid X_i)\]
Intuition: Missingness is unrelated to both observed AND unobserved responses (given covariates).
Bivariate Example: \(Y_i = (Y_{i1}, Y_{i2})'\), \(Y_{i1}\) fully observed, \(Y_{i2}\) sometimes missing.
If \(Y_{i2}\) is MCAR: \[\Pr(R_{i2} = 1 \mid Y_{i1}, Y_{i2}, X_i) = \Pr(R_{i2} = 1 \mid X_i)\]
Probability that \(Y_{i2}\) is missing does NOT depend on \(Y_{i1}\) or \(Y_{i2}\).
Example 1: Rotating Panel Design
Example 2: Six Cities Study (Scenario A)
Subtle but important distinction:
| Type | Definition |
|---|---|
| General MCAR | \(\Pr(R_i \mid Y_i^O, Y_i^M, X_i) = \Pr(R_i \mid X_i)\) (conditional on covariates) |
| Strict MCAR | \(\Pr(R_i \mid Y_i^O, Y_i^M, X_i) = \Pr(R_i)\) (no dependence on anything) |
Key: MCAR requires conditional independence given all relevant covariates in \(X_i\).
Example 1: Clinical trial dropout by treatment arm
Example 2: Side effects
Essential Feature: Observed data are a random sample of complete data.
Consequences:
| Property | Status |
|---|---|
| Distribution of \(Y_i\) same across sub-populations | Yes |
| Distribution of \(Y_i^O\) matches target population | Yes |
| Sample means, variances unbiased | Yes |
| “Completers” are random subsample | Yes |
| Distribution of \(Y_i^M\) same for dropouts and completers | Yes |
Under MCAR, almost any method yields valid inferences:
| Method | Valid? |
|---|---|
| Complete-case analysis | Yes (inefficient) |
| Available-data analysis | Yes |
| GLS / GEE methods | Yes |
| ML / REML likelihood-based methods | Yes |
| Linear mixed models / GLMMs | Yes |
Caveat: Even though methods are unbiased under MCAR, there is still loss of precision (efficiency).
Formal Definition:
\[\Pr(R_i \mid Y_i^O, Y_i^M, X_i) = \Pr(R_i \mid Y_i^O, X_i)\]
Intuition: Missingness depends on observed responses and covariates, but NOT on unobserved responses (given \(Y_i^O\) and \(X_i\)).
Bivariate Example: If \(Y_{i2}\) is MAR: \[\Pr(R_{i2} = 1 \mid Y_{i1}, Y_{i2}, X_i) = \Pr(R_{i2} = 1 \mid Y_{i1}, X_i)\]
Probability that \(Y_{i2}\) is missing depends on observed \(Y_{i1}\), but NOT on unobserved \(Y_{i2}\).
Example 1: Protocol-Driven Removal
Example 2: Six Cities Study (Scenario B - MAR version)
Non-Example (MNAR): If relocation based on unobserved variable (e.g., parent’s assessment of future health)
If subjects are stratified on values of \(Y_i^O\), then within strata, missingness in \(Y_i^M\) is like a chance mechanism.
Visual analogy:
Mathematical consequence: \[\Pr(R_i \mid Y_i^O, Y_i^M, X_i) = \Pr(R_i \mid Y_i^O, X_i)\]
Key Difference from MCAR:
Because missingness now depends on \(Y_i^O\):
| Property | Status |
|---|---|
| Distribution of \(Y_i\) same across sub-populations | No |
| “Completers” are a random sample | No |
| Sample means/variances from available data are unbiased | No |
| Distribution of \(Y_i^M\) given \(Y_i^O\) same as completers | Yes |
Critical Implication: Valid inferences require correctly modeling the joint distribution of \(Y_i\) (both mean AND covariance).
Under MAR, missing values can be validly predicted from observed data.
For multivariate normal responses:
\[E(Y_i^M \mid Y_i^O) = \mu_i^M + \Sigma_i^{MO} (\Sigma_i^O)^{-1} (Y_i^O - \mu_i^O)\]
This is the conditional expectation formula from multivariate normal theory (recall properties of partitioned multivariate normal from earlier lectures on covariance structures).
The prediction formula requires:
Mean vector: \[\mu_i = \begin{pmatrix} \mu_i^O \\ \mu_i^M \end{pmatrix}\]
Covariance matrix: \[\Sigma_i = \begin{pmatrix} \Sigma_i^O & \Sigma_i^{OM} \\ \Sigma_i^{MO} & \Sigma_i^M \end{pmatrix}\]
Critical insight: The off-diagonal block \(\Sigma_i^{MO}\) determines how observed values inform missing values!
Therefore: Under MAR, correct covariance specification is essential for valid inference.
Under MAR, method validity depends on assumptions:
| Method | Valid? | Condition |
|---|---|---|
| Likelihood-based (ML/REML, LMM, GLMM) | Yes | IF joint distribution correctly specified |
| Complete-case analysis | No | Generally biased |
| Standard GEE | No | Biased (uses biased sample moments) |
| GLS (without correct covariance) | No | Biased |
| Weighted GEE | Yes | IF weights correctly specified |
| Multiple Imputation | Yes | IF imputation model correct |
MAR (and MCAR) are often called “ignorable” mechanisms.
What “ignorable” means:
What “ignorable” does NOT mean:
Bottom line: We still must carefully model \(f(Y_i \mid X_i)\), including covariance!
Confusion is common! Let’s clarify with an example:
Scenario: In longitudinal trial, subjects with worsening symptoms more likely to drop out
| Question | Answer |
|---|---|
| Is missingness unrelated to \(Y_i\) at all? | No, so NOT MCAR |
| Is missingness predictable from observed \(Y_i^O\) only? | If yes, then MAR |
| Does missingness depend on unobserved \(Y_i^M\)? | If yes, then MNAR |
Our recommendation:
MAR should be the default assumption for longitudinal data analysis unless:
Reasons:
Formal Definition:
\[\Pr(R_i \mid Y_i^O, Y_i^M, X_i)\]
depends on \(Y_i^M\) (the unobserved values).
Equivalently: MNAR means the MAR assumption is violated: \[\Pr(R_i \mid Y_i^O, Y_i^M, X_i) \neq \Pr(R_i \mid Y_i^O, X_i)\]
Intuition: Missingness related to what the values would have been had they been observed.
Example 1: Quality-of-Life Instruments
Example 2: Muscatine Study (Scenario B)
Example 3: Six Cities Study (MNAR version)
Under MNAR:
| Property | Consequence |
|---|---|
| Distribution of \(Y_i^M\) given \(Y_i^O\) | Differs between dropouts and completers |
| Distribution of \(Y_i^M\) | Depends on \(Y_i^O\) AND \(\Pr(R_i \mid Y_i, X_i)\) |
| Model for missingness mechanism | Cannot be ignored |
Critical consequence: The specific model chosen for \(\Pr(R_i \mid Y_i, X_i)\) drives the results.
Fundamental issue: MNAR assumptions are unverifiable from observed data.
Without external information:
Implication: Under MNAR, sensitivity analysis is essential
Under MNAR, standard methods are all biased:
| Method | Valid? |
|---|---|
| Complete-case analysis | No |
| Available-data / GEE | No |
| ML/REML (without modeling missingness) | No |
| Standard multiple imputation | No |
Required approach: Joint modeling of outcome AND missingness
| Aspect | MCAR | MAR | MNAR |
|---|---|---|---|
| Missingness depends on... | Covariates only | Observed Y + covariates | Unobserved Y |
| Observed data are... | Random sample | NOT random sample | NOT random sample |
| Complete-case valid? | Yes (inefficient) | No (biased) | No (biased) |
| GEE valid? | Yes | No* (need weights) | No (biased) |
| ML/REML valid? | Yes | Yes* (if Sigma correct) | No (unless modeled) |
| Must model missingness? | No | No (ignorable) | Yes (required) |
* = conditional on correct model specification
Quick Self-Check
MCAR vs MAR: A clinical trial subject drops out because their most recent blood pressure reading (which you have recorded) was very high. Is this MCAR, MAR, or MNAR?
The “Ignorable” Trap: A colleague says “The data are MAR, so we can ignore the missing data problem.” What’s wrong with this statement?
Side-by-Side: Complete this table:
| Scenario | Mechanism |
|---|---|
| Equipment malfunction loses data | ? |
| Subjects with improving symptoms stop coming | ? |
| Subjects drop out based on how they feel today (unrecorded) | ? |
Answers
MAR - The dropout depends on an observed value (the recorded blood pressure), not on unobserved future values.
“Ignorable” means we can ignore the missingness MODEL, not the missing data itself. Under MAR, we still must correctly specify the joint distribution of \(Y_i\) (both mean AND covariance) for valid inference. Complete-case analysis is still biased under MAR!
| Scenario | Mechanism |
|---|---|
| Equipment malfunction loses data | MCAR |
| Subjects with improving symptoms stop coming | MAR (if improvement is based on recorded trajectory) |
| Subjects drop out based on how they feel today (unrecorded) | MNAR |
Key Distinction: MCAR vs MAR
MCAR: \(\Pr(R_i \mid Y_i, X_i) = \Pr(R_i \mid X_i)\) - Missingness independent of ALL outcomes
MAR: \(\Pr(R_i \mid Y_i^O, Y_i^M, X_i) = \Pr(R_i \mid Y_i^O, X_i)\) - Missingness independent of UNOBSERVED outcomes, given observed
The difference: Under MAR, dropout CAN depend on observed \(Y\) values; under MCAR, it cannot.
Definition: Dropout refers to the special case where:
\[\text{If } R_{ik} = 0, \text{ then } R_{ik+1} = \cdots = R_{in} = 0\]
Characteristics:
| Feature | Description |
|---|---|
| Pattern | Once out, stays out |
| Type | Monotone missing data |
| Contrast | Intermittent missingness has gaps with returns |
Dropout indicator: \(D_i = k\) means subject drops out between occasions \(k-1\) and \(k\)
library(ggplot2)
library(dplyr)
# Create monotone dropout pattern
set.seed(667)
n_subjects <- 20
n_times <- 5
dropout_data <- tibble(
id = rep(1:n_subjects, each = n_times),
time = rep(1:n_times, n_subjects),
dropout_time = rep(sample(2:(n_times+1), n_subjects, replace = TRUE), each = n_times)
) %>%
mutate(
observed = time < dropout_time,
dropout_group = factor(dropout_time,
levels = 2:(n_times+1),
labels = paste("Dropout after Y", 1:n_times, sep=""))
)
ggplot(dropout_data, aes(x = time, y = reorder(id, dropout_time), fill = observed)) +
geom_tile(aes(alpha = observed), color = "white", linewidth = 0.5) +
geom_tile(data = dropout_data %>% filter(observed),
aes(fill = dropout_group), color = "white", linewidth = 0.5, alpha = 1) +
scale_fill_manual(values = c("FALSE" = "gray85",
"Dropout after Y1" = "#08519c",
"Dropout after Y2" = "#3182bd",
"Dropout after Y3" = "#6baed6",
"Dropout after Y4" = "#9ecae1",
"Dropout after Y5" = "#c6dbef"),
name = "Status") +
scale_alpha_manual(values = c("FALSE" = 1, "TRUE" = 1), guide = "none") +
scale_x_continuous(breaks = 1:n_times, labels = paste0("Y", 1:n_times)) +
scale_y_discrete(breaks = seq(5, 20, 5)) +
labs(x = "Measurement Occasion",
y = "Individual (ordered by dropout time)",
title = "Monotone Missing Data Pattern",
subtitle = "Color gradient shows dropout timing (darker = earlier dropout)") +
theme_minimal(base_size = 14) +
theme(legend.position = "bottom",
panel.grid.minor = element_blank())Notice: More observations at \(Y_j\) than at \(Y_{j+1}\) for all \(j\).
Applying the MCAR/MAR/MNAR hierarchy to dropout:
| Type | Mechanism | Probability Model |
|---|---|---|
| Completely Random | MCAR | \(\Pr(D_i = k \mid D_i \geq k, Y_i, X_i) = \Pr(D_i = k \mid D_i \geq k, X_i)\) |
| Random | MAR | Depends on previously observed outcomes |
| Informative | MNAR | Depends on current/future unobserved outcomes |
Central issue: Do those who drop out differ from those who stay?
| If they… | Mechanism | Consequence |
|---|---|---|
| Do NOT differ | MCAR | Any reasonable method valid |
| DO differ (predictable from \(Y_i^O\)) | MAR | Need likelihood-based or weighted methods |
| DO differ (depends on \(Y_i^M\)) | MNAR | Must model missingness mechanism |
We’ll illustrate the three mechanisms via simulation.
Data Generating Process:
Dropout Model:
\[\text{logit}\{\Pr(D_i = k \mid D_i \geq k, Y_{i,k-1}, Y_{ik})\} = \theta_1 + \theta_2(Y_{i,k-1} - \mu_{k-1}) + \theta_3(Y_{ik} - \mu_k)\]
| Scenario | \(\theta_2\) | \(\theta_3\) | Mechanism |
|---|---|---|---|
| A | 0 | 0 | MCAR |
| B | 1.0 | 0 | MAR |
| C | 0 | 0.8 | MNAR |
Computational Note
The following simulations generate N=1000 subjects x 5 timepoints and fit multiple models. With caching enabled, slides render quickly.
library(MASS)
library(dplyr)
library(ggplot2)
simulate_dropout <- function(N = 1000, n_time = 5,
beta0 = 5, beta1 = 0.25, rho = 0.7,
theta1 = -0.5, theta2 = 0, theta3 = 0) {
# Generate AR(1) covariance
times <- 1:n_time
Sigma <- outer(times, times, function(s, t) rho^abs(s - t))
# Generate complete data
mu <- beta0 + beta1 * times
Y_complete <- mvrnorm(N, mu = mu, Sigma = Sigma)
# Generate dropout
dropout_time <- rep(n_time + 1, N) # Initialize as completers
for (i in 1:N) {
for (k in 2:n_time) {
# Logistic dropout probability
lin_pred <- theta1 +
theta2 * (Y_complete[i, k-1] - mu[k-1]) +
theta3 * (Y_complete[i, k] - mu[k])
prob_dropout <- plogis(lin_pred)
if (runif(1) < prob_dropout) {
dropout_time[i] <- k
break
}
}
}
# Create observed data (set to NA after dropout)
Y_obs <- Y_complete
for (i in 1:N) {
if (dropout_time[i] <= n_time) {
Y_obs[i, dropout_time[i]:n_time] <- NA
}
}
# Return data frame
data.frame(
Y_complete = Y_complete,
Y_obs = Y_obs,
dropout_time = dropout_time,
subject = 1:N
)
}set.seed(667)
beta0 <- 5
beta1 <- 0.25
n_time <- 5
# Simulate complete data (no dropout)
dat_complete <- simulate_dropout(N = 1000, theta2 = 0, theta3 = 0)
# Calculate means at each time
means_complete <- colMeans(dat_complete[, paste0("Y_complete.", 1:n_time)])
# Plot
data.frame(
time = 1:n_time,
mean_obs = means_complete,
true_mean = beta0 + beta1 * (1:n_time)
) %>%
ggplot(aes(x = time)) +
geom_line(aes(y = true_mean), linewidth = 1, color = "black") +
geom_point(aes(y = mean_obs), size = 3, color = "steelblue") +
scale_x_continuous(breaks = 1:n_time) +
labs(x = "Time", y = "Y",
title = "Complete Data: Sample Means vs Population Line",
subtitle = "No missing data - means coincide with population values") +
theme_minimal(base_size = 14)Sample means virtually coincide with population regression line.
About 38% dropout at each occasion (constant hazard).
data.frame(
time = 1:n_time,
mean_obs = means_mcar,
true_mean = beta0 + beta1 * (1:n_time)
) %>%
ggplot(aes(x = time)) +
geom_line(aes(y = true_mean), linewidth = 1, color = "black") +
geom_point(aes(y = mean_obs), size = 3, color = "steelblue") +
scale_x_continuous(breaks = 1:n_time) +
labs(x = "Time", y = "Y",
title = "Scenario A: MCAR Dropout",
subtitle = "Observed means unbiased despite heavy dropout") +
theme_minimal(base_size = 14)Key Observation: Even with 85% missing at time 5, sample means are unbiased!
Dropout depends on previous response: those with high \(Y_{i,k-1}\) more likely to drop out.
data.frame(
time = 1:n_time,
mean_obs = means_mar,
true_mean = beta0 + beta1 * (1:n_time)
) %>%
ggplot(aes(x = time)) +
geom_line(aes(y = true_mean), linewidth = 1, color = "black") +
geom_point(aes(y = mean_obs), size = 3, color = "coral") +
scale_x_continuous(breaks = 1:n_time) +
labs(x = "Time", y = "Y",
title = "Scenario B: MAR Dropout (depends on past Y)",
subtitle = "Observed means BIASED downward - high responders drop out preferentially") +
theme_minimal(base_size = 14)Because those with large \(Y_{i,k-1}\) drop out more, observed means are biased low.
Dropout depends on current (unobserved) response: those with high \(Y_{ik}\) more likely to drop out at time \(k\).
data.frame(
time = 1:n_time,
mean_obs = means_nmar,
true_mean = beta0 + beta1 * (1:n_time)
) %>%
ggplot(aes(x = time)) +
geom_line(aes(y = true_mean), linewidth = 1, color = "black") +
geom_point(aes(y = mean_obs), size = 3, color = "firebrick") +
scale_x_continuous(breaks = 1:n_time) +
labs(x = "Time", y = "Y",
title = "Scenario C: MNAR Dropout (depends on current unobserved Y)",
subtitle = "Observed means STRONGLY BIASED - cannot predict from observed data alone") +
theme_minimal(base_size = 14)Because those with large unobserved \(Y_{ik}\) drop out, bias is substantial.
library(tidyr)
# Calculate sample sizes at each timepoint for annotations
n_obs_mcar <- colSums(!is.na(dat_mcar[, paste0("Y_obs.", 1:n_time)]))
n_obs_mar <- colSums(!is.na(dat_mar[, paste0("Y_obs.", 1:n_time)]))
n_obs_nmar <- colSums(!is.na(dat_nmar[, paste0("Y_obs.", 1:n_time)]))
combined_data <- data.frame(
time = rep(1:n_time, 3),
mean_obs = c(means_mcar, means_mar, means_nmar),
mechanism = rep(c("(a) MCAR", "(b) MAR", "(c) MNAR"), each = n_time),
true_mean = rep(beta0 + beta1 * (1:n_time), 3),
n_obs = c(n_obs_mcar, n_obs_mar, n_obs_nmar)
)
ggplot(combined_data, aes(x = time)) +
geom_line(aes(y = true_mean), linewidth = 0.8, color = "black") +
geom_point(aes(y = mean_obs, color = mechanism), size = 3) +
geom_line(aes(y = mean_obs, color = mechanism), linewidth = 0.6, linetype = "dashed") +
geom_text(aes(label = paste0("n=", n_obs), y = mean_obs),
vjust = -0.8, size = 2.5, color = "gray30") +
scale_color_manual(values = c("(a) MCAR" = "steelblue",
"(b) MAR" = "coral",
"(c) MNAR" = "firebrick")) +
scale_x_continuous(breaks = 1:n_time) +
facet_wrap(~ mechanism, ncol = 3) +
labs(x = "Time", y = "Y",
title = "Comparison of Observed Means Under Different Dropout Mechanisms",
subtitle = "Black line = true population mean; Points = observed sample means (n = sample size)") +
theme_minimal(base_size = 13) +
theme(legend.position = "none",
strip.text = element_text(face = "bold"))Now let’s fit models and compare estimates.
library(nlme)
# Reshape MCAR data to long format
dat_mcar_long <- dat_mcar %>%
dplyr::select(subject, starts_with("Y_obs")) %>%
pivot_longer(cols = starts_with("Y_obs"),
names_to = "time",
values_to = "y") %>%
mutate(time = as.numeric(gsub("Y_obs.", "", time))) %>%
filter(!is.na(y))
# ML fit (with AR(1) covariance)
fit_ml_mcar <- gls(y ~ time, data = dat_mcar_long,
correlation = corAR1(form = ~ time | subject),
method = "REML")
# GEE fit (working independence)
fit_gee_mcar <- geepack::geeglm(y ~ time, data = dat_mcar_long,
id = subject,
corstr = "independence")
# Extract coefficients
ml_mcar <- summary(fit_ml_mcar)$tTable
gee_mcar <- summary(fit_gee_mcar)$coefficients
ML (AR1)
|
GEE (Independence)
|
|||
|---|---|---|---|---|
| Parameter | Estimate | SE | Estimate | SE |
| Intercept | 4.987 | 0.040 | 4.998 | 0.043 |
| time | 0.273 | 0.016 | 0.272 | 0.018 |
Both methods provide unbiased estimates under MCAR.
ML (AR1)
|
GEE (Independence)
|
|||
|---|---|---|---|---|
| Parameter | Estimate | SE | Estimate | SE |
| Intercept | 5.024 | 0.040 | 5.128 | 0.040 |
| time | 0.235 | 0.016 | 0.105 | 0.016 |
Key result: ML remains unbiased (correct covariance), GEE substantially biased.
ML (AR1)
|
GEE (Independence)
|
|||
|---|---|---|---|---|
| Parameter | Estimate | SE | Estimate | SE |
| Intercept | 5.115 | 0.039 | 5.151 | 0.042 |
| time | 0.139 | 0.016 | 0.073 | 0.018 |
Both methods biased under MNAR! Slope estimates far from truth (0.25).
How to Read These Results
When interpreting results from missing data analysis, ask:
1. What mechanism did I assume?
2. What does the slope estimate mean?
3. How do I report uncertainty?
ML (AR1)
|
GEE (Indep)
|
||||
|---|---|---|---|---|---|
| Dropout | Parameter | Estimate | SE | Estimate | SE |
| MCAR: Both unbiased | |||||
| MCAR | Intercept | 4.987 | 0.040 | 4.998 | 0.043 |
| MCAR | time | 0.273 | 0.016 | 0.272 | 0.018 |
| MAR: ML unbiased, GEE biased | |||||
| MAR | Intercept | 5.024 | 0.040 | 5.128 | 0.040 |
| MAR | time | 0.235 | 0.016 | 0.105 | 0.016 |
| MNAR: Both biased | |||||
| MNAR | Intercept | 5.115 | 0.039 | 5.151 | 0.042 |
| MNAR | time | 0.139 | 0.016 | 0.073 | 0.018 |
| Mechanism | Sample Means | ML | GEE |
|---|---|---|---|
| MCAR | Unbiased | Unbiased | Unbiased |
| MAR | Biased | Unbiased (with correct covariance) | Biased |
| MNAR | Strongly biased | Biased | Biased |
The motivating Six Cities study is in our fev1 data: annual lung-function (log FEV1) on children, measured from about age 6 to 19. Missing data here are real, not imposed: children enter and leave the district at different ages.
# Real FLW longitudinal data (Six Cities pulmonary thread); download fallback
fev1_url <- "https://content.sph.harvard.edu/fitzmaur/ala2e/fev1.txt"
fev1_file <- "../../data/fev1.txt"
if (!file.exists(fev1_file)) {
fev1_file <- "data/fev1.txt"
if (!file.exists(fev1_file)) {
dir.create("data", showWarnings = FALSE)
download.file(fev1_url, fev1_file)
}
}
fev1 <- read.table(fev1_file, header = TRUE)
# Real imbalance: how many measurement occasions does each child have?
occ_per_child <- table(fev1$id)
cat("Children:", length(unique(fev1$id)), "\n")Children: 300
Total observations: 1994
Occasions per child (median): 7
Children with all 12 occasions: 25
Children with 3 or fewer: 91
library(nlme)
library(dplyr)
fev1 <- fev1 %>% mutate(age_c = age - 10) # center age at 10 years
# Available-data fit (REML): uses ALL children, including partial records
fit_avail <- gls(logfev1 ~ age_c, data = fev1,
correlation = corCAR1(form = ~ age | id), method = "REML")
# Complete-case: only children with all 12 occasions
cc_ids <- as.integer(names(which(table(fev1$id) == 12)))
fev1_cc <- fev1 %>% filter(id %in% cc_ids)
fit_cc <- gls(logfev1 ~ age_c, data = fev1_cc,
correlation = corCAR1(form = ~ age | id), method = "REML")
avail <- summary(fit_avail)$tTable["age_c", ]
cc <- summary(fit_cc)$tTable["age_c", ]Available-data (REML, all 300 children): slope 0.084 (SE 0.001).
Complete-case (only 25 children): slope 0.079 (SE 0.004).
Reading it
The two slope estimates are close here, but complete-case keeps only 25 of 300 children and its standard error more than doubles (0.004 vs 0.001). When dropout is close to ignorable, the dominant cost of discarding data is lost precision, exactly the MCAR lesson. The available-data fit retains every observed measurement.
Quick Self-Check
Simulation Interpretation: In the MAR simulation (Scenario B), why are the observed means biased downward?
Method Selection: You have MAR data. Which of these methods will give unbiased estimates?
Dropout Mechanism: A subject’s probability of dropping out at visit \(k\) depends on their depression score at visit \(k-1\) (which you observed). What mechanism is this?
Answers
In the MAR simulation, subjects with high \(Y_{i,k-1}\) are more likely to drop out. This means the remaining (observed) subjects at later times have systematically lower values than the full population would have had, leading to downward-biased observed means.
Only (c) ML with correctly specified AR(1) covariance gives unbiased estimates under MAR.
MAR - The dropout depends on the observed previous value, not on the current unobserved value. In the Diggle-Kenward dropout taxonomy this is “random dropout” (MNAR is their “informative dropout”).
| Method | Description |
|---|---|
| Complete-case | Use only subjects with full data |
| Available-data | Use all observed data |
| Likelihood-based | ML/REML with correct model |
| Imputation | Fill in missing values (LVCF, MI) |
| Weighting | Inverse probability weighting |
Each method makes different assumptions and has different validity conditions.
Approach: Exclude all subjects with any missing data
When valid: MCAR only
| Pros | Cons |
|---|---|
| Simple to implement | Highly inefficient |
| Standard software works | Reduced statistical power |
| Biased under MAR or MNAR | |
| Can exclude most of the sample |
Recommendation: Rarely acceptable in practice
Under MAR, complete-case analysis is biased even though it uses “high quality” data!
Original sample: 1000 subjects
Complete cases: 177 subjects
Percent retained: 17.7 %
Complete-case estimate of slope: 0.271
True value: 0.250
Bias: 0.021
Approach: Use all observed data (not just complete cases)
Examples: GLS, Standard GEE
When valid: MCAR only
Why it fails under MAR:
Available-data methods assume sample means/covariances of \(Y_i^O\) are unbiased for population parameters. This is TRUE under MCAR but FALSE under MAR!
Intuition:
Imagine a trial where sicker patients drop out more often.
Mathematical reason:
GEE estimating equations use: \[\sum_{i=1}^N D_i^T V_i^{-1} (Y_i^O - \mu_i(X_i; \beta)) = 0\]
Under MAR, \(E(Y_i^O \mid X_i) \neq \mu_i(X_i; \beta)\) in general.
Approach: Maximize likelihood based on observed data:
\[L(\beta, \Sigma) = \prod_{i=1}^N f(Y_i^O \mid X_i; \beta, \Sigma)\]
Examples: GLS with correctly specified covariance, LMMs (lme, lmer), GLMMs
When valid: MCAR or MAR (if model is correct!)
Critical requirement under MAR: Must correctly specify both mean model and covariance model.
Under MAR, prediction of missing values depends critically on the covariance structure.
Recall the conditional expectation formula:
\[E(Y_i^M \mid Y_i^O) = \mu_i^M + \Sigma_i^{MO} (\Sigma_i^O)^{-1} (Y_i^O - \mu_i^O)\]
Depends on covariance!
Misspecified covariance leads to wrong conditional mean and biased \(\hat{\beta}\).
Correct covariance leads to unbiased; Misspecified leads to biased (under MAR)!
True slope: 0.25
Correct covariance (AR1):
Slope estimate: 0.235
Bias: -0.015
Misspecified covariance (Independence):
Slope estimate: 0.105
Bias: -0.145
The likelihood-based MAR analysis (our primary approach) in SAS. PROC MIXED with REML uses all available occasions, so it is valid under MAR if the mean and covariance are correct (no imputation needed).
/* Likelihood-based analysis valid under MAR (data in long form) */
proc mixed data = long method = reml;
class id;
model y = time / solution;
repeated time / subject = id type = un; /* unstructured covariance */
run;
For the multiple-imputation thread (a Chapter 18 preview), the SAS pipeline is PROC MI then PROC MIANALYZE:
proc mi data = wide nimpute = 20 seed = 667 out = imp;
var y1 y2 y3 y4 y5; /* monotone or MCMC imputation */
run;
proc mixed data = imp; by _Imputation_;
model y = time / solution covb;
ods output SolutionF = mixparms;
run;
proc mianalyze parms = mixparms; /* Rubin's rules pooling */
modeleffects intercept time;
run;
Basic idea: Fill in (impute) missing values, then analyze complete data.
| Approach | Description |
|---|---|
| Ad hoc | LVCF, baseline carried forward |
| Model-based single | Conditional mean imputation |
| Multiple imputation | Accounts for uncertainty |
We’ll focus heavily on LVCF because it’s widely used (and widely misunderstood).
Procedure:
Assumption: \[E(Y_{ij} \mid \text{dropout at } k, Y_{i,k-1}) = Y_{i,k-1} \text{ for all } j \geq k\]
“No change after dropout”
Statistical folklore: “LVCF gives conservative estimate of treatment effect”
Reality: FALSE!
Bias can go either direction depending on:
We will prove this mathematically.
Rare scenario where LVCF is reasonable: Dropout due to cure or recovery
In most clinical trials: Dropout often due to:
| Reason | LVCF Assumption |
|---|---|
| Lack of efficacy | Not plausible |
| Side effects | Not plausible |
| Administrative | Unrelated |
Beyond bias, LVCF has other serious flaws:
| Problem | Consequence |
|---|---|
| Underestimates variance | Treats imputed values as observed; SEs too small |
| Violates correlation structure | Creates artificial “plateaus” in trajectories |
| Inflates Type I error | Variability often increases over time |
Simple two-timepoint design:
| Component | Description |
|---|---|
| \(Y_{i1}\) | Baseline (always observed) |
| \(Y_{i2}\) | Follow-up (sometimes missing) |
| \(\text{trt}_i\) | Treatment (1) vs Control (0) |
| \(R_{i2}\) | 1 if \(Y_{i2}\) observed, 0 if missing |
| \(\pi_0\) | \(\Pr(R_{i2} = 1 \mid \text{trt}_i = 0)\) (control retention) |
| \(\pi_1\) | \(\Pr(R_{i2} = 1 \mid \text{trt}_i = 1)\) (treatment retention) |
Stratify by dropout status (completer vs dropout):
Change from baseline for controls:
\[E(Y_{i2} - Y_{i1} \mid \text{trt}_i = 0, R_{i2} = 0) = \alpha_1 \quad \text{(dropouts)}\]
\[E(Y_{i2} - Y_{i1} \mid \text{trt}_i = 0, R_{i2} = 1) = \gamma_1 \quad \text{(completers)}\]
Interpretation:
Change from baseline for treatment:
\[E(Y_{i2} - Y_{i1} \mid \text{trt}_i = 1, R_{i2} = 0) = \alpha_1 + \alpha_2 \quad \text{(dropouts)}\]
\[E(Y_{i2} - Y_{i1} \mid \text{trt}_i = 1, R_{i2} = 1) = \gamma_1 + \gamma_2 \quad \text{(completers)}\]
Parameter interpretation:
Target parameter (what we want to estimate):
\[\delta = E(Y_{i2} - Y_{i1} \mid \text{trt}_i = 1) - E(Y_{i2} - Y_{i1} \mid \text{trt}_i = 0)\]
Average over dropout status:
For treatment group: \[E(Y_{i2} - Y_{i1} \mid \text{trt}_i = 1) = (\alpha_1 + \alpha_2)(1 - \pi_1) + (\gamma_1 + \gamma_2)\pi_1\]
For control group: \[E(Y_{i2} - Y_{i1} \mid \text{trt}_i = 0) = \alpha_1(1 - \pi_0) + \gamma_1\pi_0\]
Therefore: \[\delta = \alpha_2 + (\pi_1 - \pi_0)(\gamma_1 - \alpha_1) + \pi_1(\gamma_2 - \alpha_2)\]
LVCF assumption: \(E(Y_{i2} \mid \text{trt}_i, R_{i2} = 0) = E(Y_{i1} \mid \text{trt}_i, R_{i2} = 0)\)
Equivalently: \(E(Y_{i2} - Y_{i1} \mid \text{trt}_i, R_{i2} = 0) = 0\)
This means LVCF assumes \(\alpha_1 = \alpha_2 = 0\).
LVCF treatment effect estimand:
\[\delta_{\text{LVCF}} = (\gamma_1 + \gamma_2)\pi_1 - \gamma_1\pi_0 = (\pi_1 - \pi_0)\gamma_1 + \pi_1\gamma_2\]
Bias = LVCF estimate - True effect:
\[\text{Bias} = \delta_{\text{LVCF}} - \delta\]
After algebra:
\[\text{Bias} = (\pi_1 - \pi_0)\alpha_1 - (1 - \pi_1)\alpha_2\]
Key insight: Bias depends on:
Special case: MCAR dropout
Under MCAR, dropouts and completers have same expected change: \[\gamma_k = \alpha_k \quad \text{for } k = 1, 2\]
True treatment effect: \(\delta = \gamma_2 = \alpha_2\)
But LVCF gives: \[\delta_{\text{LVCF}} = (\pi_1 - \pi_0)\gamma_1 + \pi_1\gamma_2\]
Therefore: \[\text{Bias}_{\text{MCAR}} = (\pi_1 - \pi_0)\gamma_1 + (\pi_1 - 1)\gamma_2\]
Shocking Conclusion
Even under MCAR, LVCF is biased!
Why does this happen?
Scenario: No treatment effect, but response changes over time
| Parameter | Value |
|---|---|
| \(\gamma_1\) | 2.0 (both groups improve) |
| \(\gamma_2\) | 0 (no treatment effect) |
| \(\pi_0\) | 0.6 (60% control completes) |
| \(\pi_1\) | 0.8 (80% treatment completes) |
True treatment effect: \(\delta = 0\)
LVCF estimand: \[\delta_{\text{LVCF}} = (0.8 - 0.6) \times 2.0 + 0.8 \times 0 = 0.4\]
Bias: 0.4 units (positive bias!)
LVCF creates spurious treatment effect where none exists!
Scenario: Same as before but swap dropout rates
| Parameter | Value |
|---|---|
| \(\gamma_1\) | 2.0 |
| \(\gamma_2\) | 0 (no treatment effect) |
| \(\pi_0\) | 0.8 (80% control completes) |
| \(\pi_1\) | 0.6 (60% treatment completes) |
True treatment effect: \(\delta = 0\)
LVCF estimand: \[\delta_{\text{LVCF}} = (0.6 - 0.8) \times 2.0 + 0.6 \times 0 = -0.4\]
Bias: -0.4 units (negative bias!)
set.seed(667)
N <- 1000
gamma1 <- 2.0 # Both groups improve
gamma2 <- 0.0 # No treatment effect
# Scenario 1: Higher retention in treatment
pi0_scenario1 <- 0.6
pi1_scenario1 <- 0.8
# Scenario 2: Higher retention in control
pi0_scenario2 <- 0.8
pi1_scenario2 <- 0.6
simulate_lvcf_bias <- function(N, gamma1, gamma2, pi0, pi1) {
# Generate data
trt <- rbinom(N, 1, 0.5)
y1 <- rnorm(N, 10, 2) # Baseline
# True change
true_change <- ifelse(trt == 0, gamma1, gamma1 + gamma2) + rnorm(N, 0, 1)
y2_true <- y1 + true_change
# Dropout
r2 <- rbinom(N, 1, ifelse(trt == 0, pi0, pi1))
# LVCF: If dropout, use y1 for y2
y2_lvcf <- ifelse(r2 == 1, y2_true, y1)
# Estimate treatment effect
delta_true <- mean(y2_true[trt == 1] - y1[trt == 1]) - mean(y2_true[trt == 0] - y1[trt == 0])
delta_lvcf <- mean(y2_lvcf[trt == 1] - y1[trt == 1]) - mean(y2_lvcf[trt == 0] - y1[trt == 0])
c(true = delta_true, lvcf = delta_lvcf, bias = delta_lvcf - delta_true)
}
result1 <- simulate_lvcf_bias(N, gamma1, gamma2, pi0_scenario1, pi1_scenario1)
result2 <- simulate_lvcf_bias(N, gamma1, gamma2, pi0_scenario2, pi1_scenario2)
cat("Scenario 1: pi0=0.6, pi1=0.8 (higher treatment retention)\n")Bias direction depends on dropout rates! LVCF is NOT conservative.
Scenario 1: pi0=0.6, pi1=0.8 (higher treatment retention)
True delta: 0.08
LVCF delta: 0.58
Bias: 0.51 (favors treatment)
Scenario 2: pi0=0.8, pi1=0.6 (higher control retention)
True delta: 0.06
LVCF delta: -0.25
Bias: -0.3 (favors control)
Three fatal flaws:
| Flaw | Consequence |
|---|---|
| Biased in either direction | Can favor treatment or control; creates spurious effects |
| Underestimates variance | SEs too small; p-values anti-conservative |
| Unrealistic assumptions | “No change after dropout” rarely plausible |
Recommendation
Avoid LVCF except in rare justified cases (e.g., cure scenarios).
Key idea: Model outcomes differently for different dropout patterns
Pattern-mixture factorization:
\[f(Y_i, R_i \mid X_i) = f(Y_i \mid R_i, X_i) \times \Pr(R_i \mid X_i)\]
Why useful?
Challenge: Unobserved outcomes in dropout groups require assumptions or sensitivity analysis.
Step 1: Create \(m\) complete datasets by imputing missing values \(m\) times
Step 2: Analyze each of the \(m\) datasets separately
Step 3: Pool results using Rubin’s rules
Combined estimate: \[\hat{\beta}_{\text{MI}} = \frac{1}{m}\sum_{j=1}^m \hat{\beta}_j\]
Combined variance: \[\text{Var}(\hat{\beta}_{\text{MI}}) = \bar{W} + \frac{m+1}{m}B\]
Where:
| Component | Formula | Interpretation |
|---|---|---|
| \(\bar{W}\) | \(\frac{1}{m}\sum_{j=1}^m \text{SE}_j^2\) | Within-imputation variance |
| \(B\) | \(\frac{1}{m-1}\sum_{j=1}^m (\hat{\beta}_j - \hat{\beta}_{\text{MI}})^2\) | Between-imputation variance |
library(mice)
# Use MAR data from earlier (small subset for demo)
dat_mi <- dat_mar %>%
dplyr::select(subject, Y_obs.1, Y_obs.2, Y_obs.3, Y_obs.4, Y_obs.5) %>%
rename(y1 = Y_obs.1, y2 = Y_obs.2, y3 = Y_obs.3, y4 = Y_obs.4, y5 = Y_obs.5)
# Perform MI (m=10 imputations)
# method = "pmm": Predictive mean matching (preserves data distribution)
imp <- mice(dat_mi[, -1], m = 10, method = "pmm", printFlag = FALSE, seed = 667)
# Fit model to each imputed dataset
fit_mi <- with(imp, lm(y5 ~ y1))
# Pool results using Rubin's rules
pooled <- pool(fit_mi)
summary(pooled, conf.int = TRUE)[2, ]MI properly accounts for imputation uncertainty. (Details in Chapter 18!)
term estimate std.error statistic df p.value 2.5 % 97.5 % conf.low
2 y1 0.375 0.0309 12.1 111 4.27e-22 0.313 0.436 0.313
conf.high
2 0.436
Key idea: Weight observed data to correct for under-representation
Weight for subject \(i\):
\[w_i = \frac{1}{\Pr(\text{subject } i \text{ remains in study})}\]
Subjects with low probability of remaining get high weights (represent many similar dropouts).
Calculation for complete-case:
\[w_i = \frac{1}{\pi_{i1} \times \pi_{i2} \times \cdots \times \pi_{in}}\]
where \(\pi_{ik} = \Pr(D_i > k \mid D_i \geq k, Y_{i1}, \ldots, Y_{i,k-1}, X_i)\)
Step 1: Estimate dropout probabilities
For each time \(k = 2, \ldots, n\):
Step 2: Calculate weights
\[\hat{w}_i = \frac{1}{\hat{\pi}_{i1} \times \hat{\pi}_{i2} \times \cdots \times \hat{\pi}_{in}}\]
Step 3: Weighted analysis
# Using MAR data - estimate dropout at each time
dat_mar_ipw <- dat_mar %>%
dplyr::select(subject, dropout_time, starts_with("Y_obs")) %>%
pivot_longer(cols = starts_with("Y_obs"),
names_to = "time",
values_to = "y",
values_drop_na = FALSE) %>%
mutate(time = as.numeric(gsub("Y_obs.", "", time)),
observed = !is.na(y)) %>%
arrange(subject, time) %>%
group_by(subject) %>%
mutate(y_lag = lag(y)) %>%
ungroup()
# Fit dropout model: probability of observing Y at time t given Y at t-1
dropout_model <- glm(observed ~ y_lag + time,
data = dat_mar_ipw %>% filter(time > 1),
family = binomial)
# Predict probability of staying observed
dat_mar_ipw$prob_stay <- predict(dropout_model,
newdata = dat_mar_ipw,
type = "response")
# Calculate weights
dat_mar_ipw <- dat_mar_ipw %>%
mutate(prob_stay = if_else(time == 1, 1, prob_stay)) %>%
group_by(subject) %>%
mutate(weight = 1 / cumprod(prob_stay)) %>%
ungroup() %>%
filter(observed)
# Weighted GEE
fit_ipw <- geepack::geeglm(y ~ time,
data = dat_mar_ipw,
id = subject,
weights = weight,
corstr = "independence")
# Compare results
cat("Comparison of slopes:\n")IPW corrects bias! (Full implementation in Chapter 18)
Comparison of slopes:
Unweighted GEE slope: 0.105 (biased - ignores dropout)
Weighted GEE slope: 0.254 (corrected for dropout)
True slope: 0.250
| Method | MCAR | MAR | MNAR | Efficiency | Notes |
|---|---|---|---|---|---|
| Complete-case | Valid | Biased | Biased | Low | Throws away data |
| Available-data (GEE) | Valid | Biased | Biased | Medium | Uses biased moments |
| ML/REML | Valid | Valid (dagger) | Biased | High | Needs correct Sigma |
| LVCF | Biased* | Biased | Biased | NA | Never recommended |
| Multiple Imputation | Valid | Valid (dagger) | Biased | High | Accounts for uncertainty |
| IPW (Weighted GEE) | Valid | Valid (dagger) | Biased | Medium-High | Needs correct dropout model |
| Notes: | |||||
| * Even under MCAR! (dagger) If model correctly specified |
Quick Self-Check
LVCF Bias: A treatment arm has 80% retention and a control arm has 60% retention. Both groups naturally improve by 3 units over time (no treatment effect). What bias will LVCF introduce?
IPW Concept: Subject A has a 90% probability of remaining in the study; Subject B has a 30% probability. What are their IPW weights, and why does B get a larger weight?
Multiple Imputation: Why do we impute multiple times (e.g., \(m=20\)) instead of just once?
Answers
Using the LVCF bias formula: \(\text{Bias} = (\pi_1 - \pi_0)\gamma_1 = (0.8 - 0.6) \times 3 = 0.6\) units. LVCF will show a spurious treatment effect of 0.6 units favoring treatment, even though the true effect is zero!
Subject A: weight = \(1/0.9 = 1.11\); Subject B: weight = \(1/0.3 = 3.33\). Subject B gets a larger weight because they are “rare” - their type of subject (low probability of remaining) is underrepresented in the observed data. B’s observation needs to count for ~3 similar subjects who dropped out.
Single imputation treats imputed values as if they were observed, underestimating variance. Multiple imputation captures the uncertainty in the imputation process through the between-imputation variance term in Rubin’s rules. This gives correct standard errors and valid confidence intervals.
The LVCF Fallacy
Never assume LVCF is “conservative.” The direction of bias depends on:
LVCF can create false positive results, false negative results, or attenuate/exaggerate true effects depending on these factors.
Missing Data in Longitudinal Study
|
v
+-------------------------------------------+
| Missingness completely unrelated to Y? |
+-----------------------+-------------------+
|
+-------------+----------------+
| |
YES (MCAR) NO
| |
v v
+--------------+ +-------------------------------+
| MCAR | | Predictable from observed Y? |
| | +---------------+---------------+
| Any method: | |
| - ML/REML | +--------+----------+
| - GEE | YES (MAR) NO (MNAR)
| - CC valid | | |
+--------------+ v v
+--------------+ +-----------------+
| MAR | | MNAR |
| | | |
| Methods: | | Joint models: |
| - ML/REML | | - Selection |
| - MI | | - Pattern-mix |
| - IPW-GEE | | - Sensitivity |
+--------------+ +-----------------+
| Recommendation | Details |
|---|---|
| Default: MAR | MCAR too restrictive; MAR allows likelihood-based inference |
| Model covariance carefully | Try UN, AR(1), CS; use AIC/BIC for selection |
| Avoid LVCF | Biased even under MCAR; only use if scientifically justified |
| Use modern methods | ML/REML with correct model; MI; IPW for marginal models |
Always conduct sensitivity analyses!
Under MAR:
Under MNAR (or suspicion thereof):
Question: How strong would MNAR have to be to overturn our conclusion?
Approach:
Start with MAR analysis (e.g., treatment effect = 5 units, p < 0.05)
Introduce sensitivity parameter \(\delta\): \[E(Y_{\text{missing}} \mid Y_{\text{obs}}, X) = E_{\text{MAR}}(Y_{\text{missing}} \mid Y_{\text{obs}}, X) + \delta\]
Re-analyze for range of \(\delta\) values
Find tipping point: smallest \(|\delta|\) where conclusion changes
Interpretation:
Describe missingness patterns:
State assumption about missing data mechanism:
Describe method used:
Report sensitivity analyses:
Interpret appropriately:
Example statement:
“We analyzed data using maximum likelihood under the assumption that data are missing at random (MAR), conditional on observed outcomes and baseline covariates. We modeled the covariance using an unstructured matrix and confirmed robustness using multiple imputation with \(m=20\) imputations.”
| Pitfall | Reality |
|---|---|
| “The data are MCAR because dropout seems random” | MCAR is a strong assumption, rarely justified |
| “LVCF is conservative” | NO! Bias can go either direction |
| “GEE handles missing data” | Standard GEE valid under MCAR only |
| “Using all available data is always better than complete-case” | Under MAR, both are biased! |
| Pitfall | Reality |
|---|---|
| “My model is correct so MAR inference is valid” | Must check covariance specification! |
| “MI means I can ignore missing data” | MI requires correct imputation model |
| “Small amount of missing data means not a problem” | Even 10% missing can cause substantial bias |
Note
The numbers below are an instructor illustration (not a real analysis); they show how to reason from dropout reasons to a mechanism and an analysis plan.
Hypothetical trial: Depression treatment, 4 visits over 12 weeks
| Visit | Sample Size | Cumulative Dropout |
|---|---|---|
| Baseline | 200 | 0% |
| Week 4 | 180 | 10% |
| Week 8 | 155 | 22.5% |
| Week 12 | 130 | 35% |
Main reasons for dropout (from exit interviews):
Assessment:
| Reason | Likely Mechanism |
|---|---|
| Lack of improvement | MAR (predictable from observed trajectory) |
| Side effects | MAR (predictable from observed symptoms) |
| Moved/lost contact | MCAR (unrelated to outcome) |
| Improved and stopped | MAR or MNAR |
Overall: Likely MAR
Recommended approach:
Warning
The treatment effects, SEs, and p-values below are fabricated for illustration (not estimated from data). They show the shape of a sensitivity-analysis table: MAR methods agreeing, MNAR scenarios spanning a range.
| Method | Treatment Effect | SE | P-value |
|---|---|---|---|
| Complete-case | -2.8 | 0.80 | 0.001 |
| ML (UN) | -3.5 | 0.90 | <0.001 |
| MI (m=20) | -3.4 | 0.95 | <0.001 |
| IPW-GEE | -3.6 | 1.00 | <0.001 |
| MNAR (worse) | -2.9 | 1.10 | 0.008 |
| MNAR (better) | -4.1 | 1.00 | <0.001 |
| Note: | |||
| MAR methods (rows 2-4) give consistent results; MNAR scenarios show range of uncertainty |
Conclusion: Robust effect under MAR; some sensitivity to MNAR assumptions but qualitative conclusion unchanged.
Three critical insights:
| Insight | Details |
|---|---|
| Mechanism matters | MCAR: any method; MAR: need likelihood/weighted; MNAR: must model missingness |
| LVCF is problematic | Biased even under MCAR; direction depends on dropout rates |
| Modern methods are accessible | ML/REML (nlme, lme4); MI (mice, Amelia); IPW (geepack) |
Practical wisdom:
Next lecture will cover:
Multiple Imputation in detail
Inverse Probability Weighting in detail
Advanced Topics
Key Papers:
Critiques of LVCF:
Top 10 Errors to Avoid
| # | Mistake | Correct Approach |
|---|---|---|
| 1 | Assuming MCAR without justification | Assume MAR as default; provide evidence if claiming MCAR |
| 2 | “LVCF is conservative” | LVCF can bias in either direction; avoid unless scientifically justified |
| 3 | “GEE handles missing data automatically” | Standard GEE only valid under MCAR; use IPW-GEE for MAR |
| 4 | Ignoring covariance under MAR | ML under MAR requires correct covariance specification |
| 5 | “More data is always better than complete-case” | Both are biased under MAR; neither is valid |
| 6 | Single imputation | Use multiple imputation to account for uncertainty |
| 7 | Not reporting missing data details | Always document rates, patterns, and assumed mechanism |
| 8 | Skipping sensitivity analysis | Always assess robustness, especially if >10% missing |
| 9 | “Small amount of missing is okay to ignore” | Even 10-15% missing can cause substantial bias under MAR/MNAR |
| 10 | Treating “ignorable” as “can ignore” | “Ignorable” means ignore the missingness MODEL, not the problem |
Missing data are ubiquitous in longitudinal studies and require careful handling.
Framework:
Remember:
Next: Chapter 18 - Deep dive into MI and IPW methods
BIOS 667 · UNC Gillings - Lecture 17 (Ch.17)