Collaboration Policy: You may discuss problems with classmates, but all submitted work must be your own.
AI Use Policy: AI tools may be used for code assistance. Disclose any AI use in your submission.
Submission: Submit a rendered HTML file to Gradescope by the due date.
Due Date: See Canvas for due date
Grading Rubric
Category
Weight
Description
Interpretation & Reasoning
40%
Conceptual understanding, correct interpretations
Analysis & Setup
40%
Code correctness, model specification
Clarity & Presentation
20%
Writing quality, formatting, organization
Setup
Data Description: Epilepsy Clinical Trial
Throughout this homework, we will use data from a placebo-controlled clinical trial of 59 epileptics (Thall and Vail, 1990). Patients with partial seizures were randomized to receive either progabide or placebo as an adjuvant to standard anti-epileptic chemotherapy. Seizure counts were recorded during the 8-week baseline period and during four subsequent 2-week intervals.
# Load and prepare the epilepsy data with download fallbackdata_url <-"https://content.sph.harvard.edu/fitzmaur/ala2e/epilepsy.txt"data_file <-if (file.exists("../../data/epilepsy.txt")) "../../data/epilepsy.txt"else"data/epilepsy.txt"if (!file.exists(data_file)) {dir.create("data", showWarnings =FALSE)download.file(data_url, data_file)}epilepsy_raw <-read.table( data_file,skip =36, # Skip the header textcol.names =c("id", "trt", "age", "baseline", "y1", "y2", "y3", "y4"))# Convert to long format for longitudinal analysisepilepsy <- epilepsy_raw %>%pivot_longer(cols = y1:y4, names_to ="visit", values_to ="seizures") %>%mutate(visit_num =as.numeric(gsub("y", "", visit)),trt =factor(trt, labels =c("Placebo", "Progabide")),log_baseline =log(baseline /4+0.5), # log rate per 2 weekslog_age =log(age) ) %>%arrange(id, visit_num)# Quick look at the data structurehead(epilepsy, 12)
Part (a) - WRITE: Marginal vs. Conditional Residuals (5 pts)
Explain the difference between marginal residuals and conditional residuals in the context of linear mixed-effects models. When would you use each type of residual for diagnostic purposes?
Your answer here
Part (b) - CODE: Fit LMM and Extract Residuals (5 pts)
Fit a random intercept model predicting log_seizures from visit_num, trt, their interaction, and log_baseline. Extract both marginal and conditional residuals.
# Your code here
Part (c) - WRITE: Interpret Residual Plots (5 pts)
Examine the residual plots below and describe what each plot reveals about model adequacy. Are there any concerns?
# Your code here to create residual diagnostic plots# Include: residuals vs fitted, residuals vs time, QQ plot, histogram
Your answer here
Part (d) - CODE: Check Random Effects Normality (5 pts)
Extract the random effects (BLUPs) and create a QQ plot to assess normality. Interpret the results.
# Your code here
Your answer here
Question 2: Influence Diagnostics for Mixed Models (20 points)
Part (a) - WRITE: Level-1 vs. Level-2 Outliers (5 pts)
In mixed models, we distinguish between Level-1 outliers (unusual observations) and Level-2 outliers (unusual clusters/subjects). Explain this distinction and why cluster-level influence diagnostics are particularly important in longitudinal data analysis.
Your answer here
Part (b) - CODE: Cook’s Distance-Like Measure (5 pts)
Implement a cluster deletion diagnostic by refitting the model without each subject and computing the change in fixed effect estimates. Identify the most influential subjects.
TipHint: Cluster Deletion Diagnostic Algorithm
This involves iteratively removing each subject and measuring the impact on parameter estimates. Here is the general approach:
Pseudocode:
1. Fit the full model to all data and extract fixed effects (beta_full)
2. Get list of unique subject IDs
3. Create empty storage for results (e.g., a matrix)
4. For each subject i in the data:
a. Create dataset excluding subject i
b. Refit the model on this reduced dataset
c. Extract fixed effects (beta_minus_i)
d. Calculate change: delta_i = beta_full - beta_minus_i
e. Store the change (or a summary like Cook's D-like statistic)
5. Identify subjects with largest influence
R code structure:
# Get unique subject IDssubjects <-unique(epilepsy$id)# Storage for influence measuresinfluence_results <-data.frame(id = subjects,delta_intercept =NA,delta_trt =NA,cooks_d =NA)# Loop over subjectsfor (i inseq_along(subjects)) {# Exclude subject i data_minus_i <- epilepsy[epilepsy$id != subjects[i], ]# Refit model (use try() to handle any fitting issues) fit_minus_i <-try(lme(...), silent =TRUE)if (!inherits(fit_minus_i, "try-error")) {# Calculate influence measures# Store results... }}
Note: This can take some time to run since you are refitting the model ~59 times.
# Your code here
Part (c) - WRITE: Interpret Influential Subjects (5 pts)
Examine the data for the 2-3 most influential subjects identified above. What characteristics make them influential?
# Your code here to examine the most influential subjects
Your answer here
Part (d) - CODE: Sensitivity Analysis (5 pts)
Refit the model excluding the most influential subject and compare the fixed effect estimates to the full model. Discuss whether conclusions change.
# Your code here
Your answer here
Question 3: Generalized Linear Models Review and Marginal Models (20 points)
Part (a) - WRITE: Link Functions and Variance Functions (5 pts)
For count data like seizure counts, explain why a Poisson regression model might be appropriate. Specify the:
Canonical link function
Variance function
Mean-variance relationship
Also discuss what overdispersion means and why it is a concern for the epilepsy data.
Your answer here
Part (b) - CODE: Fit Marginal Model with GEE (5 pts)
Fit a marginal model using GEE for the seizure count data with a Poisson family, log link, and exchangeable working correlation structure. Include visit_num, trt, their interaction, and log_baseline as predictors.
# Your code here
Part (c) - WRITE: Interpret GEE Coefficients (5 pts)
Interpret the following coefficients from your GEE model in terms of seizure rates:
The treatment main effect (trtProgabide)
The treatment-by-time interaction (visit_num:trtProgabide)
The baseline effect (log_baseline)
# Your code here to extract and display coefficients
Your answer here
Part (d) - CODE: Compare Working Correlation Structures (5 pts)
Fit GEE models with independence, AR(1), and unstructured working correlation structures. Compare the estimated coefficients and robust standard errors across models.
# Your code here
Your answer here
Question 4: Robust Standard Errors and GEE Inference (20 points)
Part (a) - WRITE: Sandwich Estimator Derivation (5 pts)
The robust (sandwich) variance estimator for GEE is: \[\widehat{\text{Var}}(\hat{\beta}) = B_0^{-1} B_1 B_0^{-1}\]
Explain what each component represents:
\(B_0\) (the “bread”)
\(B_1\) (the “meat”)
Why is this estimator called “robust” and what is it robust to?
Your answer here
Part (b) - CODE: Model-Based vs. Robust Standard Errors (5 pts)
For the exchangeable GEE model from Question 3, extract and compare the model-based (naive) and robust standard errors. Discuss the implications of any differences.
TipHint: Extracting Standard Errors from geepack
The geepack::geeglm() function stores both naive and robust variance estimates. After fitting a model called gee_fit, you can access them via:
# Summary gives robust SEs by defaultsummary(gee_fit)# For naive (model-based) SEs, access the geese object:# Naive variance-covariance matrixnaive_vcov <- gee_fit$geese$vbeta.naiv# Robust variance-covariance matrixrobust_vcov <- gee_fit$geese$vbeta# Convert to standard errorsnaive_se <-sqrt(diag(naive_vcov))robust_se <-sqrt(diag(robust_vcov))
# Your code here
Your answer here
Part (c) - WRITE: Small Sample Corrections (5 pts)
With only 59 subjects in the epilepsy study, small sample bias in the sandwich estimator may be a concern. Describe two approaches to address this issue and explain when they might be preferred.
Your answer here
Part (d) - CODE: Apply Small Sample Correction (5 pts)
Apply a small sample correction to the GEE analysis using an appropriate R package or manual implementation. Compare the corrected and uncorrected results.
# Your code here
Your answer here
Question 5: Model Selection and Comprehensive Analysis (20 points)
Part (a) - WRITE: GEE vs. GLMM (5 pts)
Compare and contrast GEE (marginal models) and GLMMs (conditional models) for analyzing the epilepsy data. Address:
Interpretation of coefficients
Assumptions required
When each approach is preferred
Your answer here
Part (b) - CODE: Fit GLMM for Comparison (5 pts)
Fit a Poisson GLMM with random intercepts using lme4::glmer(). Compare the fixed effect estimates to the GEE results and explain any differences.
# Your code here
Your answer here
Part (c) - CODE: Model Diagnostics for GLMM (5 pts)