BIOS 667 - Lecture 1: Introduction to Longitudinal Data (Ch. 1)

Fitzmaurice, Laird & Ware (2011) - Applied Longitudinal Data Analysis

Naim Rashid

Lecture Objectives

By the end of this lecture, you will be able to:

  1. Recognize the structure of longitudinal data
  2. Distinguish longitudinal vs. cross-sectional designs
  3. Explain why correlation matters for valid inference
  4. Describe the advantages and tradeoffs of longitudinal designs
  5. Identify the scientific questions longitudinal studies answer

Roadmap

  1. What is longitudinal data analysis?
  2. Why use longitudinal designs?
  3. Summary and next steps

Prerequisites / recall

This lecture assumes you are comfortable with:

  • Intro regression: fitting a linear model, reading coefficients and standard errors
  • Correlation and covariance: what they measure and that they are not the same thing

We build on these to motivate why repeated, correlated measurements need special handling.

Part I: What is Longitudinal Data?

Definition

Longitudinal data: Repeated measurements collected from the same subjects over time.

Key goal: Understand both:

  • Within-person changes (trajectories)
  • Between-person differences in those changes

Example: Six Cities Study (FEV1 Data)

  • Tracked annual lung function in children
  • This is the Topeka, Kansas subset (girls) of the Six Cities Study
  • 300 girls, 1-12 repeated measurements per child
  • FEV1 = forced expiratory volume; modeled on the log scale (logfev1)
  • Available: data/fev1.txt
  • FEV1/Six Cities is an instructor-chosen course-wide running example (formally introduced in FLW Ch. 9, Table 9.1), not one of FLW Chapter 1’s four motivating studies (TLC lead trial, Muscatine obesity, anti-epileptic drug trial, Connecticut Child Surveys; FLW Section 1.3)

Visualizing Repeated Measures

fev <- read.table("../../data/fev1.txt", header = TRUE)

# Sample 15 children for clarity
set.seed(667)
ids <- sample(unique(fev$id), 15)
fev_sub <- fev[fev$id %in% ids, ]

ggplot(fev_sub, aes(x = age, y = logfev1, group = id, color = factor(id))) +
  geom_line(linewidth = 0.8) +
  geom_point(size = 2) +
  theme_minimal() +
  theme(legend.position = "none") +
  labs(title = "Lung Function Trajectories (15 Children)",
       x = "Age (years)", y = "log(FEV1)")

Visualizing Repeated Measures

Key Observation

Each child’s trajectory is correlated: measurements don’t bounce randomly.

  • Cross-sectional: one dot per child
  • Longitudinal: connected trajectory per child

Cross-Sectional vs. Longitudinal

Feature Cross-Sectional Longitudinal
Design One measurement/subject Repeated measurements
Focus Population snapshot Trajectories, change
Correlation Independent Within-subject correlation
Efficiency Less for dynamic questions Each subject = own control

Check Your Understanding: design type

  1. A study measures blood pressure in 500 patients once. Is this longitudinal or cross-sectional data?

Answer: Cross-sectional: each patient is measured only once.

Check Your Understanding: correlation and plots

  1. In the FEV1 study, why are the measurements from the same child correlated?
  2. What is the key difference between a spaghetti plot and a scatterplot of all observations?

Answers:

  1. Measurements from the same child share common characteristics (genetics, environment, baseline lung capacity) that make them more similar to each other than to measurements from other children.
  2. A spaghetti plot connects observations within each individual, revealing trajectories; a scatterplot treats all points as independent, hiding the within-person structure.

Part II: Why Longitudinal Designs?

Advantages

  • Efficiency: Each subject serves as their own control
  • Separates aging from cohort effects: longitudinal designs isolate the within-person change over time from cohort effects that confound cross-sectional comparisons (FLW Section 1.2, body-fat/menarche example)
  • Temporal ordering: exposure/measurement precedes outcome (a structural feature, not proof of causation)
  • Change rates: Directly estimate slopes and trajectories
  • Precision: Baseline adjustment improves power (we return to this in Ch. 5)

Tradeoffs

  • Dropout: Missing data patterns (MAR/MNAR)
  • Irregular timing: Unequal visit schedules
  • Complex dependence: Requires covariance modeling (Ch. 7-8)

Why Correlation Matters

Ignoring within-subject correlation leads to:

  • Incorrect standard errors (the point estimates stay consistent; only the SEs are wrong)
  • Invalid hypothesis tests
  • Overstated significance

Proper models must distinguish within- vs. between-subject effects.

Clarification on Standard Errors (FLW pp. 43-44)

Point estimates (\(\hat{\beta}\)) remain consistent under correct mean specification, even when correlation is ignored. The issue is that standard errors are inconsistent: they do not converge to true values. When positive within-subject correlation exists, ignoring it typically underestimates standard errors, leading to inflated test statistics and spuriously significant results.

See It: Naive vs. Cluster SE

The cluster bootstrap here is a pedagogical illustration of the concept. The methods FLW and this course actually use to fix correlation-induced SE problems are correctly specified covariance/mixed models (Ch. 7-8) and the robust sandwich/empirical estimator (GEE, Ch. 13), not the bootstrap.

set.seed(667)

# Simulate clustered data: 50 subjects, 8 visits each,
# strong positive within-subject correlation (shared random intercept).
# The covariate of interest is a BETWEEN-subject (time-invariant) exposure:
# one value per subject, so it is confounded with the cluster structure and
# ignoring within-subject correlation understates its SE.
n_sub <- 50; n_vis <- 8; var_b <- 3   # subject-level variance
sim <- do.call(rbind, lapply(1:n_sub, function(i) {
  b_i <- rnorm(1, 0, sqrt(var_b))      # subject random intercept
  x_i <- rnorm(1)                      # subject-level exposure (one value per subject)
  y   <- 1 + 0.5 * x_i + b_i + rnorm(n_vis)
  data.frame(id = i, x = x_i, y = y)
}))

fit <- lm(y ~ x, data = sim)
naive_se <- summary(fit)$coefficients["x", "Std. Error"]

# Cluster bootstrap: resample whole subjects (keeps within-subject correlation)
boot_b <- replicate(500, {
  ids <- sample(unique(sim$id), replace = TRUE)
  d   <- do.call(rbind, lapply(ids, function(j) sim[sim$id == j, ]))
  coef(lm(y ~ x, data = d))["x"]
})
cluster_se <- sd(boot_b)

se_tab <- data.frame(SE = c("Naive (independence)", "Cluster bootstrap"),
                     value = c(naive_se, cluster_se))

ggplot(se_tab, aes(x = SE, y = value, fill = SE)) +
  geom_col(width = 0.6) +
  geom_text(aes(label = round(value, 3)), vjust = -0.4) +
  theme_minimal() +
  theme(legend.position = "none") +
  labs(title = "Naive SE understates true uncertainty",
       subtitle = "Independence SE is the smaller bar; the cluster bootstrap respects within-subject correlation and is the honest one",
       x = NULL, y = "Standard error of slope")

See It: Naive vs. Cluster SE

See It: The Takeaway

For the subject-level exposure x, at set.seed(667):

  • Naive (independence) SE: 0.105
  • Cluster bootstrap SE: 0.215

The honest cluster SE (0.215) is about 2 times larger than the naive SE (0.105).

Warning

Ignoring the positive within-subject correlation makes the naive SE too small, so test statistics are inflated and results look more significant than they are. Because the exposure is fixed within a subject, the 8 visits give far less independent information than 400 rows suggest.

Scientific Questions

Longitudinal analysis addresses:

  • Describe: Mean trajectories, individual variation
  • Explain: How covariates shift trajectories
  • Predict: Future outcomes for individuals/populations
  • Compare: Treatment effects over time
  • Account for dependence: Valid inference with correlation

Longitudinal vs. Clustered Data

Type Definition Example
Longitudinal Repeated measures over time FEV1 across visits
Clustered Units within higher-level groups Students in classrooms

Techniques overlap, but time structure adds constraints and opportunities.

Summary

Summary

  • What: repeated measures on the same subjects over time.
  • Why: efficient (own control), temporal ordering, direct change rates.
  • Catch: within-subject correlation corrupts SEs (estimates stay consistent).
  • Use: describe, explain, predict, and compare trajectories.

Canonical Resources

  • Author slides: content.sph.harvard.edu/fitzmaur/ala2e/
  • Datasets: content.sph.harvard.edu/fitzmaur/ala2e/datasets.html
  • Sample R code: content.sph.harvard.edu/fitzmaur/ala2e/SampleR.html

For Next Time

Read Chapter 2 of Fitzmaurice, Laird & Ware (2011)

  • Lecture 2 (Ch. 2): longitudinal data structures, notation, and the sources of within-subject correlation