BIOS 667 - Lecture 1: Introduction to Longitudinal Data (Ch. 1)

Fitzmaurice, Laird & Ware (2011) - Applied Longitudinal Data Analysis

Naim Rashid

Lecture Objectives

By the end of this lecture, you will be able to:

Recognize the structure of longitudinal data
Distinguish longitudinal vs. cross-sectional designs
Explain why correlation matters for valid inference
Describe the advantages and tradeoffs of longitudinal designs
Identify the scientific questions longitudinal studies answer

Roadmap

What is longitudinal data analysis?
Why use longitudinal designs?
Summary and next steps

Prerequisites / recall

This lecture assumes you are comfortable with:

Intro regression: fitting a linear model, reading coefficients and standard errors
Correlation and covariance: what they measure and that they are not the same thing

We build on these to motivate why repeated, correlated measurements need special handling.

Part I: What is Longitudinal Data?

Definition

Longitudinal data: Repeated measurements collected from the same subjects over time.

Key goal: Understand both:

Within-person changes (trajectories)
Between-person differences in those changes

Example: Six Cities Study (FEV1 Data)

Tracked annual lung function in children
This is the Topeka, Kansas subset (girls) of the Six Cities Study
300 girls, 1-12 repeated measurements per child
FEV1 = forced expiratory volume; modeled on the log scale (logfev1)
Available: data/fev1.txt
FEV1/Six Cities is an instructor-chosen course-wide running example (formally introduced in FLW Ch. 9, Table 9.1), not one of FLW Chapter 1’s four motivating studies (TLC lead trial, Muscatine obesity, anti-epileptic drug trial, Connecticut Child Surveys; FLW Section 1.3)

Visualizing Repeated Measures

fev <- read.table("../../data/fev1.txt", header = TRUE)

# Sample 15 children for clarity
set.seed(667)
ids <- sample(unique(fev$id), 15)
fev_sub <- fev[fev$id %in% ids, ]

ggplot(fev_sub, aes(x = age, y = logfev1, group = id, color = factor(id))) +
  geom_line(linewidth = 0.8) +
  geom_point(size = 2) +
  theme_minimal() +
  theme(legend.position = "none") +
  labs(title = "Lung Function Trajectories (15 Children)",
       x = "Age (years)", y = "log(FEV1)")

Visualizing Repeated Measures

Key Observation

Each child’s trajectory is correlated: measurements don’t bounce randomly.

Cross-sectional: one dot per child
Longitudinal: connected trajectory per child

Cross-Sectional vs. Longitudinal

Feature	Cross-Sectional	Longitudinal
Design	One measurement/subject	Repeated measurements
Focus	Population snapshot	Trajectories, change
Correlation	Independent	Within-subject correlation
Efficiency	Less for dynamic questions	Each subject = own control

Check Your Understanding: design type

A study measures blood pressure in 500 patients once. Is this longitudinal or cross-sectional data?

Answer: Cross-sectional: each patient is measured only once.

Check Your Understanding: correlation and plots

In the FEV1 study, why are the measurements from the same child correlated?
What is the key difference between a spaghetti plot and a scatterplot of all observations?

Answers:

Measurements from the same child share common characteristics (genetics, environment, baseline lung capacity) that make them more similar to each other than to measurements from other children.
A spaghetti plot connects observations within each individual, revealing trajectories; a scatterplot treats all points as independent, hiding the within-person structure.

Part II: Why Longitudinal Designs?

Advantages

Efficiency: Each subject serves as their own control
Separates aging from cohort effects: longitudinal designs isolate the within-person change over time from cohort effects that confound cross-sectional comparisons (FLW Section 1.2, body-fat/menarche example)
Temporal ordering: exposure/measurement precedes outcome (a structural feature, not proof of causation)
Change rates: Directly estimate slopes and trajectories
Precision: Baseline adjustment improves power (we return to this in Ch. 5)

Tradeoffs

Dropout: Missing data patterns (MAR/MNAR)
Irregular timing: Unequal visit schedules
Complex dependence: Requires covariance modeling (Ch. 7-8)

Why Correlation Matters

Ignoring within-subject correlation leads to:

Incorrect standard errors (the point estimates stay consistent; only the SEs are wrong)
Invalid hypothesis tests
Overstated significance

Proper models must distinguish within- vs. between-subject effects.

Clarification on Standard Errors (FLW pp. 43-44)

Point estimates (\(\hat{\beta}\)) remain consistent under correct mean specification, even when correlation is ignored. The issue is that standard errors are inconsistent: they do not converge to true values. When positive within-subject correlation exists, ignoring it typically underestimates standard errors, leading to inflated test statistics and spuriously significant results.

See It: Naive vs. Cluster SE

The cluster bootstrap here is a pedagogical illustration of the concept. The methods FLW and this course actually use to fix correlation-induced SE problems are correctly specified covariance/mixed models (Ch. 7-8) and the robust sandwich/empirical estimator (GEE, Ch. 13), not the bootstrap.

set.seed(667)

# Simulate clustered data: 50 subjects, 8 visits each,
# strong positive within-subject correlation (shared random intercept).
# The covariate of interest is a BETWEEN-subject (time-invariant) exposure:
# one value per subject, so it is confounded with the cluster structure and
# ignoring within-subject correlation understates its SE.
n_sub <- 50; n_vis <- 8; var_b <- 3   # subject-level variance
sim <- do.call(rbind, lapply(1:n_sub, function(i) {
  b_i <- rnorm(1, 0, sqrt(var_b))      # subject random intercept
  x_i <- rnorm(1)                      # subject-level exposure (one value per subject)
  y   <- 1 + 0.5 * x_i + b_i + rnorm(n_vis)
  data.frame(id = i, x = x_i, y = y)
}))

fit <- lm(y ~ x, data = sim)
naive_se <- summary(fit)$coefficients["x", "Std. Error"]

# Cluster bootstrap: resample whole subjects (keeps within-subject correlation)
boot_b <- replicate(500, {
  ids <- sample(unique(sim$id), replace = TRUE)
  d   <- do.call(rbind, lapply(ids, function(j) sim[sim$id == j, ]))
  coef(lm(y ~ x, data = d))["x"]
})
cluster_se <- sd(boot_b)

se_tab <- data.frame(SE = c("Naive (independence)", "Cluster bootstrap"),
                     value = c(naive_se, cluster_se))

ggplot(se_tab, aes(x = SE, y = value, fill = SE)) +
  geom_col(width = 0.6) +
  geom_text(aes(label = round(value, 3)), vjust = -0.4) +
  theme_minimal() +
  theme(legend.position = "none") +
  labs(title = "Naive SE understates true uncertainty",
       subtitle = "Independence SE is the smaller bar; the cluster bootstrap respects within-subject correlation and is the honest one",
       x = NULL, y = "Standard error of slope")

See It: Naive vs. Cluster SE

See It: The Takeaway

For the subject-level exposure x, at set.seed(667):

Naive (independence) SE: 0.105
Cluster bootstrap SE: 0.215

The honest cluster SE (0.215) is about 2 times larger than the naive SE (0.105).

Warning

Ignoring the positive within-subject correlation makes the naive SE too small, so test statistics are inflated and results look more significant than they are. Because the exposure is fixed within a subject, the 8 visits give far less independent information than 400 rows suggest.

Walk through the numbers live, but read them off the rendered slide (they are inline R from the chunk objects, so they cannot drift from the code). The mechanism: with a time-invariant exposure, all 8 observations on a subject share both the exposure value and the random intercept, so the effective sample size is closer to the 50 subjects than the 400 rows. The naive lm treats all 400 rows as independent and divides by too large an n, producing an SE that is roughly half the honest one here. This is the concrete, correct version of “ignoring correlation understates the SE” that the earlier slide stated. Contrast with a within-subject covariate that varies occasion to occasion and is orthogonal to the random intercept, where clustering need not inflate the slope SE at all. That distinction (between- vs within-subject effects) is exactly the within-vs-between theme of the prior slide.

Scientific Questions

Longitudinal analysis addresses:

Describe: Mean trajectories, individual variation
Explain: How covariates shift trajectories
Predict: Future outcomes for individuals/populations
Compare: Treatment effects over time
Account for dependence: Valid inference with correlation

Longitudinal vs. Clustered Data

Type	Definition	Example
Longitudinal	Repeated measures over time	FEV1 across visits
Clustered	Units within higher-level groups	Students in classrooms

Techniques overlap, but time structure adds constraints and opportunities.

Summary

What: repeated measures on the same subjects over time.
Why: efficient (own control), temporal ordering, direct change rates.
Catch: within-subject correlation corrupts SEs (estimates stay consistent).
Use: describe, explain, predict, and compare trajectories.

Canonical Resources

Author slides: content.sph.harvard.edu/fitzmaur/ala2e/
Datasets: content.sph.harvard.edu/fitzmaur/ala2e/datasets.html
Sample R code: content.sph.harvard.edu/fitzmaur/ala2e/SampleR.html

For Next Time

Read Chapter 2 of Fitzmaurice, Laird & Ware (2011)

Lecture 2 (Ch. 2): longitudinal data structures, notation, and the sources of within-subject correlation