# Define the response vector for subject i
Y_i <- c(12, 10, 9)
print(Y_i)Fitzmaurice, Laird & Ware (2011) - Applied Longitudinal Data Analysis
| Objective | Where you use it |
|---|---|
| Vector/matrix notation | Every later lecture; HW1 setup of \(\mathbf{Y}_i\), \(\mathbf{X}_i\) |
| General linear model (mean + covariance) | LME and GEE on HW3-HW4; final project model specification |
| Descriptive visualization | HW1 exploratory plots; first step of the final-project analysis |
| Three covariance approaches | Motivates covariance-pattern (Ch. 7) and random-effects (Ch. 8) choices on HW3 |
| Limits of historical approaches | Justifies why the final project uses modern methods, not ANOVA/MANOVA |
| Section | Topics |
|---|---|
| Part 1 | Notation and distributional assumptions |
| Part 2 | Descriptive methods for visualization |
| Part 3 | Modeling the covariance structure |
| Part 4 | Historical approaches and their limitations |
| Part 5 | Summary and next steps |
You will get the most out of this lecture if you are comfortable with:
We build directly on the \(\mathbf{X}\boldsymbol\beta\) matrix form from L02; today we add the within-subject covariance.
| Symbol | Meaning |
|---|---|
| \(i,\ j\) | subject index \(i = 1,\dots,N\) and occasion/time index \(j = 1,\dots,n_i\) |
| \(N,\ n_i\) | number of subjects; number of occasions for subject \(i\) (\(n\) when balanced) |
| \(Y_{ij},\ \mathbf{Y}_i\) | response for subject \(i\) at occasion \(j\) (scalar); response vector \((n_i \times 1)\) |
| \(t_{ij}\) | measurement time for subject \(i\) at occasion \(j\) |
| \(X_{ij},\ \mathbf{X}_i\) | covariate row vector \(1 \times p\); design matrix \(n_i \times p\) |
| \(\boldsymbol\beta\) | fixed-effect (population-average) coefficients |
| \(\varepsilon_{ij},\ \boldsymbol\varepsilon_i\) | residual error (scalar; vector) |
| \(\Sigma_i = \operatorname{Cov}(\mathbf{Y}_i)\) | marginal (total) covariance of \(\mathbf{Y}_i\) (decomposes into random-effect + residual parts once random effects appear, Ch. 8) |
| \(\rho\) | a correlation parameter (e.g. exchangeable, or AR(1) where corr at lag \(k\) is \(\rho^k\)) |
Bold = vector or matrix. This is a running reference card: later lectures add random-effect and GEE notation.
\(Y_{ij}\) denotes the response for subject \(i\) at occasion \(j\).
We group the \(n_i\) repeated measures for subject \(i\) into a column vector:
\[ \mathbf{Y}_{i} = \begin{pmatrix} Y_{i1} \\ Y_{i2} \\ \vdots \\ Y_{in_{i}} \end{pmatrix}_{n_i \times 1} \]
| Level | Independence? |
|---|---|
| Between subjects | Yes (conditional on covariates) |
| Within subject | No (repeated measures are correlated) |
This within-subject correlation is the defining feature of longitudinal data.
\(X_{ij}\) is the \(1 \times p\) row vector of covariates for subject \(i\) at occasion \(j\). Stacking these rows over the \(n_i\) occasions gives the design matrix for subject \(i\):
\[ \mathbf{X}_{i} = \begin{pmatrix} X_{i1} \\ X_{i2} \\ \vdots \\ X_{in_{i}} \end{pmatrix}_{n_i \times p} \]
So \(\mathbf{X}_i\) is \(n_i \times p\), and the model is written \(\mathbf{X}_i\boldsymbol\beta\).
Rows are occasions, columns are covariates. Intercept is a column of 1s; Sex is constant (time-constant); Time increases down the rows (time-varying).
| Type | Description | Example |
|---|---|---|
| Time-constant | Does not change over time | Sex, race |
| Time-varying | Changes at each occasion | Age, weight |
The model for the observation vector of subject \(i\) is:
\[ \mathbf{Y}_{i} = \mathbf{X}_{i}\boldsymbol{\beta} + \boldsymbol{\varepsilon}_{i} \]
The mean response is linear in \(\boldsymbol{\beta}\).
| Component | Notation | Description |
|---|---|---|
| Response vector | \(\mathbf{Y}_i\) | Observed outcomes |
| Design matrix | \(\mathbf{X}_i\) | Covariates |
| Regression parameters | \(\boldsymbol{\beta}\) | Fixed effects |
| Random errors | \(\boldsymbol{\varepsilon}_i\) | Zero-mean errors |
A subject measured at three time points with responses 12, 10, and 9.
Interpretation: This shows the response vector \(\mathbf{Y}_i\) for a single subject with 3 measurements. The values (12, 10, 9) represent the outcome at times 1, 2, and 3 respectively. Notice the downward trend: this subject’s response is decreasing over time. In R, this is stored as a numeric vector of length 3.
[1] 12 10 9
Covariates include an intercept, sex (time-constant), and time (time-varying).
Interpretation: This \(3 \times 3\) design matrix corresponds to the same subject with 3 visits. Each row is one occasion, so this is \(X_{i1}, X_{i2}, X_{i3}\) stacked into \(\mathbf{X}_i\). The intercept column is all 1s (standard for regression). The Sex column is 0 for all rows because sex is time-constant (this subject is male). The Time column increases (1, 2, 3) because time is time-varying. When we compute \(\mathbf{X}_i\boldsymbol{\beta}\), we get the predicted mean response at each visit.
Intercept Sex Time
[1,] 1 0 1
[2,] 1 0 2
[3,] 1 0 3
We can stack data for all \(N\) subjects into one model:
\[ \mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon} \]
Where \(\mathbf{Y} = (\mathbf{Y}_1^{\top}, \dots, \mathbf{Y}_N^{\top})^{\top}\) and \(\boldsymbol{\varepsilon} = (\boldsymbol{\varepsilon}_1^{\top}, \dots, \boldsymbol{\varepsilon}_N^{\top})^{\top}\).
\(\mathbf{X}\) is formed by stacking the subject-specific design matrices \(\mathbf{X}_i\).
The covariance matrix \(\boldsymbol\Sigma\) for the full data set is block-diagonal:
\[ \boldsymbol\Sigma = \mathrm{diag}(\Sigma_1, \Sigma_2, \dots, \Sigma_N) \]
Here \(\Sigma_i = \operatorname{Cov}(\mathbf{Y}_i)\) is the within-subject covariance of subject \(i\)’s response (unstructured for now). The off-diagonal blocks are zero because different subjects are independent.
Dense blocks on the diagonal are the within-subject covariances \(\Sigma_1, \Sigma_2, \Sigma_3\). The white off-diagonal blocks are zero: different subjects are independent.
| Component | Expression | Meaning |
|---|---|---|
| Mean structure | \(E(\mathbf{Y}_{i}) = \mathbf{X}_{i}\boldsymbol{\beta}\) | How the average response changes |
| Covariance structure | \(\mathrm{Cov}(\mathbf{Y}_{i}) = \Sigma_i\) | How measures are correlated |
\(\Sigma_i = \operatorname{Cov}(\mathbf{Y}_i)\) is the within-subject covariance of the response (unstructured for now); we decompose it once random effects appear in Ch. 8.
We assume the vector of responses for subject \(i\) is multivariate normal:
\[ \mathbf{Y}_i \sim N_{n_i}(\mathbf{X}_i\boldsymbol{\beta}, \Sigma_i) \]
Unlike the univariate case, it is very difficult to formally verify the multivariate normal assumption.
Formal statistical tests exist but are often unhelpful, as they can detect departures that are not substantively important.
| Check | Purpose |
|---|---|
| Histograms of residuals | Assess marginal normality at each time |
| Boxplots of residuals | Identify outliers and skewness |
| Scatterplots of residual pairs | Check for linear trends between times |
Caveat: These checks can only provide evidence of departures. They cannot prove the assumption holds.
The assumption of multivariate normality is not as critical for valid inferences about \(\boldsymbol{\beta}\) as the assumptions about:
Unless departures are very extreme (e.g., highly skewed data), correctly modeling dependence is more important.
| Pitfall | Why It Matters |
|---|---|
| Confusing mean and covariance misspecification | Misspecifying the mean model is a bigger problem |
| Overstating the role of normality | It is a working assumption, not a strict requirement |
For simplicity, we often use:
We are always referring to the conditional distribution given the covariates.
Check Your Understanding: Part 1a (Dimensions and structure)
Answers:
Check Your Understanding: Part 1b (Normality)
Answer:
Before modeling, we need to understand the patterns in our data.
Goal: To see individual trajectories and overall trends.
A spaghetti plot connects successive repeated measures for each individual.
Purpose: Shows individual trajectories, the spread of starting points (between-subject variability), and any outlying subjects or unbalanced/irregular data.
Caution (FLW Section 3.3): The joined-segments plot is not very informative about trends in the mean response; between- and within-subject variability is often almost completely obscured. For reading the mean trend, use the mean-response (mean-trajectory) plot instead, not the spaghetti plot.
# Load packages
library(dplyr)
library(ggplot2)
# Create mock longitudinal data
set.seed(123)
data_long <- tibble(
subject = rep(1:20, each = 5),
time = rep(1:5, times = 20),
response = rnorm(100, mean = 50 + 2*time + 0.5*subject, sd = 5)
)
# Create the plot
ggplot(data_long, aes(x = time, y = response, group = subject)) +
geom_line(alpha = 0.6) +
geom_point(alpha = 0.6) +
labs(title = "Spaghetti Plot", x = "Time", y = "Response") +
theme_minimal()Interpretation: Each line represents one subject’s trajectory over 5 time points. The plot is good for seeing the spread of starting points (substantial between-subject variability, with some subjects high and some low), how individuals maintain their relative positions over time (a subject who starts high tends to stay high), and any outlying subjects or unbalanced data. It is not the right tool for reading the mean trend: with many overplotted lines the average drift is easily obscured. For the mean trend, read the mean-response plot on the next slide. The persistence of relative position here is the visual cue for the within-subject correlation our models must accommodate.
A mean trajectory plot reveals the overall trend in the data.
Purpose: Shows the overall trend of the group and a confidence interval for that trend.
# Summarize data
summary_data <- data_long %>%
group_by(time) %>%
summarize(
mean_response = mean(response),
sd_response = sd(response),
n = n(),
se_response = sd_response / sqrt(n)
)
# Create the plot with a ribbon for uncertainty
ggplot(summary_data, aes(x = time, y = mean_response)) +
geom_line(color = "blue", linewidth = 1.5) +
geom_point(color = "blue", size = 3) +
geom_ribbon(aes(ymin = mean_response - 1.96 * se_response,
ymax = mean_response + 1.96 * se_response),
alpha = 0.2, fill = "blue") +
labs(title = "Mean Trajectory with 95% CI", x = "Time", y = "Mean Response") +
theme_minimal()Interpretation: The blue line shows the average response at each time point across all 20 subjects. The shaded ribbon represents a 95% confidence interval for the mean. The upward slope indicates that, on average, the response increases by about 2 units per time point. The narrow confidence interval reflects relatively precise estimation of the mean trajectory. This plot complements the spaghetti plot by summarizing the overall trend while hiding individual variation.
A correlation heatmap shows pairwise correlations between repeated measures.
Purpose: Shows how the correlation decays over time. Do measurements far apart in time have a lower correlation?
Note
This heatmap is an instructor-added exploratory device, not one of the FLW Section 3.3 descriptive plots (which are time plots of the raw data, mean-response plots, and smoothing via moving averages or lowess). FLW introduces correlation and correlation decay later, under Section 3.5 “Modeling the Covariance”; we show the heatmap here to bridge into that section.
# Load packages
library(tidyr)
# Simulated illustration: genuine AR(1) serial correlation within each subject.
# Each subject's 5 errors follow an AR(1) process, so correlation decays with
# time separation. The population lag-k correlation is 0.8^k, but the finite-
# sample estimate is weaker at longer lags. rho = 0.8 matches the running AR(1)
# example in L02.
set.seed(667)
n_sub <- 200
phi <- 0.8
ar1_data <- do.call(rbind, lapply(1:n_sub, function(s) {
err <- as.numeric(arima.sim(model = list(ar = phi), n = 5))
tibble(subject = s, time = 1:5, response = 50 + 2 * (1:5) + err)
}))
# Pivot data to a wide format
wide_data <- ar1_data %>%
select(subject, time, response) %>%
pivot_wider(names_from = time, values_from = response) %>%
select(-subject)
# Calculate correlation matrix
cor_matrix <- cor(wide_data, use = "pairwise.complete.obs")
# Tidy matrix for plotting
cor_long <- cor_matrix %>%
as.data.frame() %>%
tibble::rownames_to_column("Time1") %>%
pivot_longer(-Time1, names_to = "Time2", values_to = "Correlation")
# Create the heatmap
ggplot(cor_long, aes(x = Time1, y = Time2, fill = Correlation)) +
geom_tile() +
scale_fill_gradient2(low = "red", high = "blue", mid = "white",
midpoint = 0, limit = c(-1,1), name = "Correlation") +
theme_minimal() +
labs(title = "Correlation Heatmap (AR(1) data)")Interpretation: This heatmap displays the correlation matrix for the 5 time points. The diagonal is darkest blue (correlation = 1, since each time point is perfectly correlated with itself). Because the errors were simulated from an AR(1) process, the off-diagonal correlations fade as the time separation grows: adjacent times (lag 1) are strongly correlated, while the most separated pair (times 1 and 5, lag 4) is much weaker. That visible fade away from the diagonal is the signature of serial correlation, exactly the pattern an AR(1) covariance structure is built to capture.
Check Your Understanding: Part 2a (Reading the plots)
Answers:
Check Your Understanding: Part 2b (Why visualize first)
Answer:
The plots above used simulated data to isolate each idea. Now we run the same three descriptive views on a real FLW dataset.
Data: Treatment of Lead-Exposed Children (TLC) trial, the running example from Lecture 2 (FLW Ch. 2, Tables 2.2-2.3). We use the placebo group: \(N = 50\) children, blood-lead level measured at weeks 0, 1, 4, and 6 (\(n = 4\) occasions).
Why placebo: an untreated group lets us see the natural decay and the within-subject correlation structure without a treatment effect confounding the trend.
library(dplyr)
library(tidyr)
library(ggplot2)
# tlc.csv is WIDE: ID, Treatment Group, then lead level at weeks 0, 1, 4, 6.
# Local-first with a download fallback (FLW ships a header-free, space-delim file).
tlc_file <- "../../data/tlc.csv"
if (!file.exists(tlc_file)) tlc_file <- "data/tlc.csv"
if (file.exists(tlc_file)) {
tlc <- read.csv(tlc_file)
} else {
tlc <- read.table("https://content.sph.harvard.edu/fitzmaur/ala2e/tlc-data.txt",
header = FALSE,
col.names = c("ID", "Treatment.Group",
"Lead.Level.Week.0", "Lead.Level.Week.1",
"Lead.Level.Week.4", "Lead.Level.Week.6"))
}
# Placebo group, one column per occasion -> long format for plotting
weeks <- c(0, 1, 4, 6)
# Select the four weekly lead-level columns BY NAME (robust to column order);
# fall back to the last four columns if a local file uses non-standard names.
lead_cols <- grep("Lead.*Level|Week", names(tlc), ignore.case = TRUE, value = TRUE)
if (length(lead_cols) != 4) lead_cols <- tail(names(tlc), 4)
pbo_wide <- tlc[tlc$Treatment.Group == "P", lead_cols]
colnames(pbo_wide) <- paste0("w", weeks)
pbo_wide$id <- seq_len(nrow(pbo_wide))
pbo_long <- pbo_wide %>%
pivot_longer(cols = starts_with("w"), names_to = "wk", values_to = "lead") %>%
mutate(week = weeks[match(wk, paste0("w", weeks))])
nrow(pbo_wide) # number of placebo childrenThe 50 placebo trajectories spread widely at baseline (the between-subject heterogeneity), and children who start high tend to stay high: relative position persists across weeks. That persistence is the visual signature of within-subject correlation the spaghetti plot exposes but cannot quantify.
Mean blood-lead drifts down only mildly under placebo, from about 26.3 mcg/dL at week 0 to about 23.6 mcg/dL at week 6. This is the mean-structure picture \(E(\mathbf{Y}_i) = \mathbf{X}_i\boldsymbol\beta\) made visual.
Warning
The 95% ribbon uses a naive standard error that ignores within-subject correlation, so treat it as descriptive only. Proper standard errors require modeling the marginal covariance \(\Sigma_i\) (Part 3).
Off-diagonal correlations are high and positive throughout, ranging from about 0.76 to 0.87 (FLW Table 2.3). The dominant signal is strong positive correlation everywhere (between-subject heterogeneity), with a modest decay as the gap between weeks widens. These are exactly the two facts (non-constant structure, decaying correlation) the covariance models in Part 3 are built to capture.
A defining feature of longitudinal data is that repeated responses on the same individual are correlated.
This correlation cannot be ignored.
| Benefit | Description |
|---|---|
| Valid inferences | Correct standard errors and p-values |
| Efficiency | More precise estimates |
| Missing data | Valid estimates under certain conditions |
| Approach | Description |
|---|---|
| Unstructured | Completely arbitrary pattern |
| Covariance pattern models | Specific parsimonious structure |
| Random effects | Individual-specific effects induce structure |
This approach assumes no explicit structure for the covariance.
It estimates the unique variance for each time point and the covariance for every possible pair.
Key Feature: Provides a completely flexible and arbitrary pattern.
| Issue | Description |
|---|---|
| Parameter-heavy | \(\frac{n_i(n_i+1)}{2}\) parameters, where \(n_i\) = number of occasions (55 for 10 occasions!) |
| Unstable estimates | Small samples lead to instability |
| Data limitations | Cannot accommodate irregular timing or missing data |
This approach places a specific, parsimonious structure on the covariance matrix.
Inspired by time series analysis.
Key Idea: The correlation between repeated measures decays as the time separation increases.
An alternative, indirect way to model the covariance.
It accounts for correlation by introducing individual-specific random effects into the model.
Example: A random intercept represents all unobserved factors that make some subjects naturally higher or lower responders.
| Benefit | Description |
|---|---|
| Flexibility and parsimony | Few parameters, realistic patterns |
| Handles irregular data | Natural handling of missing observations |
| Multiple effects | Random intercepts and slopes for richer patterns |
| Approach | Pros | Cons |
|---|---|---|
| Unstructured | Most flexible | Parameter-heavy, requires complete data |
| Covariance Pattern | Parsimonious, handles irregular timing | Assumes a specific pattern |
| Random Effects | Flexible, parsimonious, handles missing data | Structure can be too restrictive |
Compound Symmetry (CS): Constant correlation (\(\rho\)) for any pair of times.
\[ \begin{pmatrix} 1 & \rho & \rho & \rho \\ \rho & 1 & \rho & \rho \\ \rho & \rho & 1 & \rho \\ \rho & \rho & \rho & 1 \end{pmatrix} \]
AR(1): Correlation decays exponentially as time separation increases (\(\rho^k\)).
\[ \begin{pmatrix} 1 & \rho & \rho^2 & \rho^3 \\ \rho & 1 & \rho & \rho^2 \\ \rho^2 & \rho & 1 & \rho \\ \rho^3 & \rho^2 & \rho & 1 \end{pmatrix} \]
Check Your Understanding: Part 3a (Unstructured and pattern models)
Answers:
Check Your Understanding: Part 3b (Random effects)
Answer:
| Method | Description |
|---|---|
| Summary Measures | Reduce to a single value |
| Univariate Repeated Measures ANOVA | Extension of ANOVA |
| MANOVA | Multivariate analysis of variance |
| Method | Advantage | Disadvantage |
|---|---|---|
| Summary Measures | Simple, easy to use | Discards trajectory information |
| Univariate ANOVA | Easy to perform | Requires compound symmetry |
| MANOVA | No covariance structure required | Cannot handle missing data |
The simplest approach: reduce repeated measures to a single summary measure.
Examples: Mean response or regression slope.
Limitation: Discards information about individual trajectories.
An extension of ANOVA.
Assumption: Compound symmetry (equal variances and constant correlation).
This is often inappropriate for longitudinal data where correlation typically decays with time separation.
Solution: Greenhouse and Geisser (1959) developed an adjustment to correct the F-test.
Treats repeated measures as a single, multi-dimensional outcome.
Advantage: Does not assume a specific covariance structure; it estimates an unstructured matrix.
Limitation: Requires a complete set of measurements for every subject.
MANOVA requires complete measurements for every subject.
| Subject | Time 1 | Time 2 | Time 3 | Time 4 |
|---|---|---|---|---|
| 1 | 12 | 10 | 9 | 11 |
| 2 | 15 | 14 | - | 13 |
| 3 | 11 | 12 | 10 | 12 |
Subject 2 would be dropped from the analysis.
| Method | Assumption | Parameters |
|---|---|---|
| Univariate ANOVA | Compound symmetry | 2 |
| MANOVA | Unstructured | Many |
| Modern methods | Realistic and parsimonious | Few |
Focusing only on the mean structure: The covariance structure is equally important. Misspecifying it leads to incorrect standard errors and invalid inference, even if your mean model is correct.
Using unstructured covariance by default: While flexible, unstructured covariance requires many parameters and cannot handle missing data or irregular timing. Consider parsimonious alternatives (CS, AR(1), random effects).
Applying MANOVA with incomplete data: MANOVA drops subjects with any missing observations. This wastes data and can introduce bias if missingness is related to the outcome.
Assuming compound symmetry without checking: Univariate repeated measures ANOVA assumes equal correlations across all pairs. This is often unrealistic, since correlations typically decay with time separation.
Confusing the mean and covariance misspecification consequences: Misspecifying the mean model biases \(\hat{\beta}\). Misspecifying the covariance structure affects standard errors but not point estimates (if the mean is correct).
The general linear model has two parts:
| Component | Expression |
|---|---|
| Mean structure | \(E(\mathbf{Y}_i) = \mathbf{X}_i\boldsymbol{\beta}\) |
| Covariance structure | \(\mathrm{Cov}(\mathbf{Y}_i) = \Sigma_i\) |
The covariance structure makes longitudinal analysis unique.
We must account for the non-independence of repeated measures.
Choosing a model for \(\Sigma_i\) that is both flexible and parsimonious is the main challenge.
Upcoming chapters will cover modern approaches that are more flexible and robust.
| Topic | Description |
|---|---|
| Serial correlation | Accounting for time-ordered dependence |
| Random variation | Modeling individual-specific effects |
| Incomplete data | Handling missing observations |
BIOS 667 · UNC Gillings - Lecture 3 (Ch. 3)