tlc <- tlc |>
dplyr::mutate(elevated_w6 = as.integer(w6 >= 20)) # 1 = elevated lead at wk 6
table(Treatment = tlc$trt, Elevated_wk6 = tlc$elevated_w6)Fitzmaurice, Laird & Ware (2011) - Applied Longitudinal Data Analysis
By the end of this lecture, you will be able to:
Note
Assessment map. This lecture is assessed on HW4 (logistic and Poisson interpretation, overdispersion remedies) and Quiz 4 (GLM components, odds ratios, rate ratios).
Before we generalize, recall what we already have:
| From | What carries forward |
|---|---|
| LM (Ch. 3-6) | Linear predictor \(X_{ij}\boldsymbol\beta\); least squares; residual diagnostics |
| L10 (Diagnostics) | Residuals, transformed/Mahalanobis residual diagnostics, cluster-influence ideas |
| LM (Ch. 3-6) | Ordinary-regression diagnostics: leverage, Cook’s distance |
| Notation box | \(Y_{ij}\), \(\mathbf{X}_i\), \(\boldsymbol\beta\), indices \(i=1,\dots,N\) subjects |
The leap today: the response need not be Normal, and effects act on a transformed (link) scale, not directly on the mean.
| Part | Topic | Day |
|---|---|---|
| 1 | Introduction to GLMs | 1 |
| 2 | Logistic regression (binary) | 1 |
| 3 | Poisson regression (counts) | 1 |
| 4 | Ordinal and multinomial models (extension) | 2 |
| 5 | GLM diagnostics | 2 |
| 6 | Extensions and advanced topics (extension) | 2 |
| 7 | Appendices (technical details) | - |
| Symbol | Meaning |
|---|---|
| \(Y_{ij}\) | response for subject \(i\) at occasion \(j\) (scalar) |
| \(\mathbf{Y}_i\) | response vector for subject \(i\) \((n_i \times 1)\) |
| \(\mathbf{X}_i,\ \boldsymbol\beta\) | fixed-effect design \(n_i \times p\) and coefficients |
| \(\mathbf{Z}_i,\ \mathbf{b}_i\) | random-effect design and random effects, \(\mathbf{b}_i \sim N(0, G)\) |
| \(G\) | between-subject random-effects covariance (FLW’s letter, Ch. 8) |
| \(R_i\) | within-subject residual covariance, \(\boldsymbol\varepsilon_i \sim N(0, R_i)\) |
| \(\Sigma_i = \mathbf{Z}_i G \mathbf{Z}_i^\top + R_i\) | marginal (total) covariance \(\operatorname{Cov}(\mathbf{Y}_i)\) for subject \(i\) |
| \(\rho\) | correlation parameter (AR(1), exchangeable) |
Indices: \(i\) subject \(1,\dots,N\); \(j\) occasion/time \(1,\dots,n_i\) (\(n\) when balanced); \(k\) state (transition models); \(g\) group. Bold = vector or matrix.
| Symbol | Meaning |
|---|---|
| \(\pi_{ij}\) | probability of a binary outcome, subject \(i\) occasion \(j\) |
| \(\theta_i\) | natural parameter (a re-parameterization of the mean) |
| \(\eta_{ij} = X_{ij}\boldsymbol\beta\) | linear predictor (row form \(X_{ij}\), \(1\times p\)) |
| \(g(\mu)=\eta\) | link function; \(\mu=g^{-1}(\eta)\) is the inverse link |
| \(\phi\) | dispersion (scale) parameter; \(\mathrm{Var}(Y)=\phi\,v(\mu)\) |
| \(v(\mu)\) | GLM mean-variance function (NOT a covariance matrix) |
| \(\kappa_k\) | ordinal cumulative-logit cutpoints, \(k=1,\dots,K-1\) (FLW writes \(\alpha_k\); we use \(\kappa_k\) to avoid the GEE \(\alpha\) and the natural parameter \(\theta\)) |
Warning
\(v(\mu)\) here is the scalar variance function of the mean (e.g. \(v(\mu)=\mu\) for Poisson), NOT the GEE working covariance \(V_i\) from the notation box.
Framework, exponential family, link functions, estimation. Pace: ~20-25 min.
| Linear Model (LM) Assumption | GLM Extension |
|---|---|
| Normally distributed \(Y\) | Exponential-family distributions |
| Constant variance | Variance depends on the mean |
| Additive effects on the mean | Additive effects on a transformed (link) scale |
Non-normal outcomes: binary, counts, proportions, rates.
Two plain sentences before any density:
Everything you need follows from one cumulant function:
| Quantity | Formula |
|---|---|
| Mean | \(\mathbb{E}(Y_i) = \mu_i\) |
| Variance | \(\mathrm{Var}(Y_i) = \phi\,V(\mu_i)\) |
Each response \(Y_i\) has a density in the exponential family. Using FLW’s notation (eq. 11.3, p. 328):
\[ f(y_i \mid \theta_i,\phi) = \exp\!\left[ \frac{y_i\theta_i - a(\theta_i)}{\phi} + b(y_i,\phi) \right] \]
with \(\mathbb{E}(Y_i)=\mu_i=a'(\theta_i)\) and \(\mathrm{Var}(Y_i)=\phi\,a''(\theta_i)\).
Note
Notation caution. FLW (eq. 11.3-11.4) writes the cumulant function as \(a(\theta)\) and the data-only term as \(b(y,\phi)\), with the scale entering as \(\phi\) directly. Many texts (McCullagh & Nelder) instead write \(\{y\theta - b(\theta)\}/a(\phi) + c(y,\phi)\). The roles are the same; only the letters differ. We follow FLW here.
For Poisson with mean \(\mu\) (here \(\phi=1\)):
Insight: for Poisson, mean \(=\) variance. This falls out of the family automatically.
| Component | Description | Example |
|---|---|---|
| Random | \(Y_i\) from exponential family | binary, count |
| Systematic | Linear predictor \(\eta_{ij} = X_{ij}\boldsymbol{\beta}\) | \(\eta=\beta_0+\beta_1 x\) |
| Link | \(g(\mu_i)=\eta_i\) | logit, log, identity |
Here \(X_{ij}\) is the row of covariates (\(1\times p\)), so the mean model is \(X_{ij}\boldsymbol\beta\).
| Distribution | Mean-Variance | Canonical Link |
|---|---|---|
| Normal | \(\mu,\;\sigma^2\) | identity |
| Binomial | \(\pi,\;\pi(1-\pi)\) | logit |
| Poisson | \(\mu,\;\mu\) | log |
| Gamma | \(\mu,\;\mu^2\) | inverse |
GLM coefficients \(\boldsymbol\beta\) are estimated by maximum likelihood (FLW Section 11.7). FLW states only this; the algorithm below is standard GLM theory.
Note
Beyond FLW Ch. 11 (instructor extension). The IRLS algorithm and the score / Fisher-information formulas are standard GLM theory (McCullagh & Nelder 1989; Dobson & Barnett 2018). FLW Ch. 11 only states that estimation is by ML. The intuition slides that follow are supplementary; full details are in Appendix A-B.
Think of IRLS as “smart” weighted regression that learns as it goes:
| Step | What Happens | Analogy |
|---|---|---|
| 1. Initialize | Start with a guess (e.g. \(\hat\mu_i=\bar y\)) | First draft |
| 2. Compute weights | Upweight observations we are more certain about | Focus attention |
| 3. Fit weighted regression | Solve linear system with current weights | Refine estimate |
| 4. Update predictions | New \(\hat\mu_i\) from updated \(\hat{\boldsymbol\beta}\) | Revised draft |
| 5. Repeat | Until estimates stabilize | Final answer |
Key insight: higher-variance observations get lower weights, so uncertain points have less influence.
Unlike OLS, GLMs have a nonlinear link between parameters and predictions.
OLS: one-shot solution \((\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{Y}\).
GLM: weights depend on \(\mu_i\), which depends on \(\boldsymbol\beta\), which depends on the weights…
Solution: iterate to convergence. Start somewhere reasonable, improve step by step. In R, glm() handles this automatically (it warns if there are convergence problems).
Check Your Understanding: GLM Fundamentals
Answers:
Binary outcomes: model, odds ratios, fitted probabilities, real-data case study. Pace: ~25-30 min.
Model on the log-odds scale:
\[ \log\!\left(\frac{\pi}{1-\pi}\right) = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p \]
| Quantity | Interpretation |
|---|---|
| \(\beta_j\) | Change in log-odds per unit increase in \(x_j\) |
| \(e^{\beta_j}\) | Odds ratio (OR) |
| \(e^{\beta_j} > 1\) | Odds of \(Y=1\) increase |
| \(e^{\beta_j} < 1\) | Odds of \(Y=1\) decrease |
Note
Real FLW dataset (instructor-defined binary endpoint). TLC is FLW’s lead-exposure trial (Ch. 5/6): 100 children randomized to succimer or placebo, blood lead measured at weeks 0, 1, 4, 6. FLW Ch. 5/6 analyze it as a continuous longitudinal response. Here we define an instructor binary endpoint – elevated lead at week 6 (\(\ge 20\ \mu g/dL\)) – to illustrate logistic regression on real data. The cutoff is a teaching choice, not an FLW analysis.
Elevated_wk6
Treatment 0 1
Placebo 13 37
Succimer 27 23
Call:
glm(formula = elevated_w6 ~ trt + w0, family = binomial, data = tlc)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.07274 1.64543 -3.691 0.000224 ***
trtSuccimer -1.70470 0.52826 -3.227 0.001251 **
w0 0.28754 0.06802 4.227 2.37e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 134.602 on 99 degrees of freedom
Residual deviance: 99.233 on 97 degrees of freedom
AIC: 105.23
Number of Fisher Scoring iterations: 5
# A tibble: 3 × 6
term estimate OR OR_low OR_high p.value
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -6.07 0.00230 0.0000702 0.0467 0.000224
2 trtSuccimer -1.70 0.182 0.0606 0.490 0.00125
3 w0 0.288 1.33 1.18 1.54 0.0000237
| Term | OR | Interpretation |
|---|---|---|
(Intercept) |
– | Baseline log-odds when covariates are 0 |
trtSuccimer |
0.18 | Succimer multiplies the odds of elevated lead at wk 6 by 0.18 vs placebo |
w0 |
1.33 | Each 1 \(\mu g/dL\) higher baseline lead multiplies the odds by 1.33 |
In a sentence: adjusting for baseline lead, the succimer odds ratio for remaining elevated at week 6 is 0.18 (below 1, i.e. lower odds than placebo), and the baseline-lead odds ratio is 1.33 (above 1, i.e. higher baseline raises the odds).
Logistic inverse link (row form):
\[ \hat\pi_{ij} = \frac{e^{X_{ij}\hat{\boldsymbol\beta}}}{1 + e^{X_{ij}\hat{\boldsymbol\beta}}} \]
In R, predict(fit, type = "response") returns \(\hat\pi_{ij}\).
trt w0 elevated_w6 pred_prob
1 Placebo 30.8 1 0.9417841
2 Succimer 26.5 1 0.4607010
3 Succimer 25.8 1 0.4112497
4 Placebo 24.7 1 0.7368379
5 Succimer 20.4 0 0.1288079
6 Succimer 20.4 0 0.1288079
newdat <- expand.grid(
trt = levels(tlc$trt),
w0 = seq(min(tlc$w0), max(tlc$w0), length = 100)
)
newdat$pred_prob <- predict(fit_logit, newdata = newdat, type = "response")
ggplot(newdat, aes(w0, pred_prob, color = trt)) +
geom_line(linewidth = 1.2) +
labs(y = "Predicted P(elevated at wk 6)", x = "Baseline lead (wk 0)",
title = "Logistic Fitted Probabilities: TLC Trial",
color = "Treatment") +
theme_minimal()The same TLC logistic model in SAS (FLW Section 11.6 uses PROC GENMOD/LOGISTIC):
proc logistic data=tlc descending;
class trt (ref="Placebo") / param=ref;
model elevated_w6 = trt w0;
/* odds ratios printed automatically; oddsratio for custom contrasts */
run;
Note
Shown for reference only, not executed. descending models \(P(Y=1)\); class ... param=ref matches R’s reference coding.
Note
Beyond FLW Ch. 11 (instructor extension). Classification thresholds, confusion matrices, sensitivity/specificity, ROC/AUC, calibration plots, the Hosmer-Lemeshow test, and residual/influence diagnostics for GLMs are not in FLW Ch. 11 (which covers odds-ratio interpretation, ML, and Wald tests for binary data). They are standard applied-logistic tools: see Hosmer, Lemeshow & Sturdivant (2013) and Pepe (2003) for ROC/calibration; Agresti (2015) for the modeling context. Recognize them; they are enrichment, not examinable Ch. 11 content.
Choose a threshold \(t\) and predict:
\[ \widehat Y = \begin{cases} 1, & \hat\pi \ge t \\ 0, & \text{otherwise} \end{cases} \]
Default \(t=0.5\), but it can be tuned to the application’s costs.
| Metric | Definition |
|---|---|
| Sensitivity (TPR) | \(P(\widehat Y=1 \mid Y=1)\) |
| Specificity (TNR) | \(P(\widehat Y=0 \mid Y=0)\) |
| False positive rate (FPR) | \(1 - \text{Specificity} = P(\widehat Y=1 \mid Y=0)\) |
Varying the threshold traces an ROC curve (TPR on the \(y\)-axis vs FPR on the \(x\)-axis).
Note
Reading it: a curve hugging the top-left corner = good discrimination; the diagonal = no better than chance (AUC 0.5). AUC = probability a random positive is ranked above a random negative.
Note
Beyond FLW (rule-of-thumb thresholds). These cutoffs follow Hosmer, Lemeshow & Sturdivant (2013); they are conventions, not FLW content, and are sensitive to context.
| AUC Range | Discrimination (Hosmer-Lemeshow convention) |
|---|---|
| 0.5 | No discrimination |
| 0.7-0.8 | Acceptable |
| 0.8-0.9 | Excellent |
| > 0.9 | Outstanding (rare) |
dd <- tlc |>
dplyr::mutate(dec = cut(pred_prob,
breaks = quantile(pred_prob, probs = seq(0, 1, 0.1), na.rm = TRUE),
include.lowest = TRUE)) |>
dplyr::group_by(dec) |>
dplyr::summarise(mean_pred = mean(pred_prob),
obs_rate = mean(elevated_w6),
n = dplyr::n(), .groups = "drop")
ggplot(dd, aes(mean_pred, obs_rate, size = n)) +
geom_point(alpha = 0.8) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
labs(title = "TLC Logistic Calibration (deciles of predicted risk)",
x = "Mean predicted probability", y = "Observed event rate") +
theme_minimal()Note
Reading it: points on the dashed 45-degree line = well calibrated (predicted matches observed). Points above the line = model under-predicts risk; below = over-predicts. A high AUC does NOT guarantee good calibration.
res_df <- data.frame(
fitted = fitted(fit_logit),
dev = residuals(fit_logit, type = "deviance"),
pear = residuals(fit_logit, type = "pearson")
)
p1 <- ggplot(res_df, aes(fitted, dev)) + geom_point(alpha = .6) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "Deviance Residuals vs Fitted") + theme_bw()
p2 <- ggplot(res_df, aes(fitted, pear)) + geom_point(alpha = .6) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "Pearson Residuals vs Fitted") + theme_bw()
print(p1); print(p2)Note
Reading it: for binary data residuals split into two bands (one per outcome), so do not expect a random cloud. Look for points far from 0 (poorly fit observations) and any systematic curvature (possible link misspecification), not for constant spread.
lev <- hatvalues(fit_logit)
cd <- cooks.distance(fit_logit)
par(mfrow = c(1, 2))
plot(lev, type = "h", main = "Leverage (hat values)", ylab = "hat")
abline(h = 2 * mean(lev), col = "red", lty = 2)
plot(cd, type = "h", main = "Cook's Distance", ylab = "Cook's D")
abline(h = 4 / length(cd), col = "red", lty = 2)Note
Reading it: the red lines are common rules of thumb – leverage above \(2\bar h\) (twice the average hat value) is high-leverage; Cook’s D above \(4/n\) flags an influential point. Points crossing both deserve a look (refit without them to gauge sensitivity).
Note
Reading it: large p-value = no evidence of poor fit; small p-value = misfit. Caveat: the result is sensitive to the number of groups \(g\) (try a few). This is an enrichment GOF tool; FLW’s native fit assessment is deviance / likelihood-ratio tests.
Hosmer and Lemeshow goodness of fit (GOF) test
data: tlc$elevated_w6, fitted(fit_logit)
X-squared = 5.4331, df = 8, p-value = 0.7104
| Step | Action |
|---|---|
| Coefficients | Exponentiate to odds ratios |
| Predictions | Use type="response" for probabilities |
| Classification* | Choose a threshold by context |
| Discrimination* | ROC / AUC |
| Calibration* | Predicted vs observed |
| Diagnostics* | Residuals, leverage, influence |
* enrichment beyond FLW Ch. 11.
Check Your Understanding: Logistic Regression
Answers:
Count outcomes: rates, offsets, overdispersion, quasi-Poisson and NB remedies, on real data. Pace: ~25-30 min.
Counts (number of events) call for the Poisson family.
| Aspect | Consideration |
|---|---|
| Distribution | Poisson: \(\mathrm{Var}(Y)=\mu\) |
| Rate modeling | use an offset(log(exposure)) |
| Common issue | overdispersion (\(\mathrm{Var}>\mu\)) |
| Remedies | quasi-Poisson, negative binomial |
Note
Real FLW dataset (Thall & Vail 1990; FLW Ch. 11/13/14). 59 epileptics randomized to progabide or placebo. Baseline = seizure count over an 8-week run-in; outcome = total seizures over the four 2-week post-randomization visits (8 weeks). We model the rate with an offset(log(8)) for the 8-week exposure window.
'data.frame': 59 obs. of 5 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ trtf : Factor w/ 2 levels "Placebo","Progabide": 1 1 1 1 1 1 1 1 1 1 ...
$ age : int 31 30 25 36 22 29 31 36 37 28 ...
$ base : int 11 11 6 8 66 27 12 52 23 10 ...
$ total: int 14 14 11 13 55 22 12 95 22 33 ...
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 11.50 16.00 33.08 36.00 302.00
Call:
glm(formula = total ~ trtf + base + age + offset(logexp), family = poisson,
data = ep)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.0749927 0.1198240 0.626 0.531
trtfProgabide -0.1928933 0.0464638 -4.151 3.30e-05 ***
base 0.0225023 0.0005121 43.945 < 2e-16 ***
age 0.0162527 0.0035126 4.627 3.71e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 2119.60 on 58 degrees of freedom
Residual deviance: 569.28 on 55 degrees of freedom
AIC: 860.71
Number of Fisher Scoring iterations: 5
Interpretation: adjusting for baseline count and age, progabide is associated with a seizure rate 0.82 times that of placebo (a 18% lower rate point estimate). Each extra baseline seizure multiplies the rate by 1.023.
Warning
We will see the SEs here are too small (invalid inference) because of overdispersion; do not over-read the significance until we correct it.
| Term | Rate Ratio | Interpretation |
|---|---|---|
(Intercept) |
– | Baseline log-rate (placebo, base=0, age=0) |
trtfProgabide |
0.82 | Progabide vs placebo seizure-rate ratio |
base |
1.023 | Each extra baseline seizure scales the rate |
age |
1.016 | Per-year multiplicative change in rate |
offset(logexp) |
fixed | Adjusts for the 8-week exposure window |
Note on the offset: the offset makes us model rates (events per unit time), not raw counts, so subjects with different follow-up are compared fairly.
newd <- expand.grid(
age = seq(min(ep$age), max(ep$age), length.out = 100),
trtf = levels(ep$trtf),
base = median(ep$base),
logexp = log(8)
)
newd$pred <- predict(fit_pois, newdata = newd, type = "response")
ggplot(newd, aes(age, pred, color = trtf)) +
geom_line(linewidth = 1.1) +
labs(y = "Predicted seizures / 8 weeks", x = "Age",
title = "Poisson GLM: Predicted Seizure Count (at median baseline)",
color = "Treatment") +
theme_minimal()Note
Reading it: under a correct Poisson model the Pearson residuals should sit roughly within \(\pm 2\) with constant spread. A fanning pattern (spread growing with the fitted value) and many residuals far beyond \(\pm 2\) signal overdispersion – the variance grows faster than the mean.
If \(\hat\phi > 1\), the variance exceeds the mean (overdispersion). Here \(\hat\phi\) is far above 1, so the Poisson SEs are too small (invalid inference).
Remedy: quasi-Poisson or negative binomial.
[1] 12.04789
Call:
glm(formula = total ~ trtf + base + age + offset(logexp), family = quasipoisson,
data = ep)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.074993 0.415910 0.180 0.858
trtfProgabide -0.192893 0.161276 -1.196 0.237
base 0.022502 0.001777 12.660 <2e-16 ***
age 0.016253 0.012192 1.333 0.188
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasipoisson family taken to be 12.04789)
Null deviance: 2119.60 on 58 degrees of freedom
Residual deviance: 569.28 on 55 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 5
| Model | Dispersion | SE behavior | Consequence |
|---|---|---|---|
| Poisson | fixed \(=1\) | too small if \(\mathrm{Var}>\mu\) | inflated significance (invalid inference) |
| Quasi-Poisson | estimated \(\hat\phi\) | scaled by \(\sqrt{\hat\phi}\) | coefficients unchanged, SEs corrected |
Note
NB models the variance as \(\mu + \mu^2/k\) (an extra parameter \(k\)), a more flexible mean-variance relationship than quasi-Poisson’s constant scaling. Both correct the SEs; they assume different variance forms.
Call:
glm.nb(formula = total ~ trtf + base + age + offset(logexp),
data = ep, init.theta = 3.328492494, link = log)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.019042 0.358599 0.053 0.958
trtfProgabide -0.214854 0.153943 -1.396 0.163
base 0.027621 0.002835 9.744 <2e-16 ***
age 0.011383 0.010683 1.066 0.287
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(3.3285) family taken to be 1)
Null deviance: 179.749 on 58 degrees of freedom
Residual deviance: 63.647 on 55 degrees of freedom
AIC: 477.23
Number of Fisher Scoring iterations: 1
Theta: 3.328
Std. Err.: 0.701
2 x log-likelihood: -467.225
The epilepsy Poisson rate model in SAS (FLW Section 11.6, Table 11.9 style):
data ep; set ep; logexp = log(8); run;
proc genmod data=ep;
class trtf (ref="Placebo") / param=ref;
model total = trtf base age / dist=poisson link=log offset=logexp;
/* for overdispersion, add dscale to scale the SEs by sqrt(Pearson/df) */
run;
/* negative binomial alternative */
proc genmod data=ep;
class trtf (ref="Placebo") / param=ref;
model total = trtf base age / dist=negbin link=log offset=logexp;
run;
Note
Shown for reference only, not executed. offset=logexp mirrors R’s offset(log(8)); dscale is SAS’s overdispersion scale factor.
Check Your Understanding: Poisson Regression
Answers:
Note
Beyond FLW Ch. 11 (instructor extension). Proportional-odds (ordinal) and multinomial regression are introduced briefly in FLW only via the cumulative-logit example; the modeling depth here (polr, multinom, PO assumption) follows McCullagh (1980) and Agresti (2015). Recognize the models; treat as enrichment. Pace: ~20 min.
Ordered categories (e.g. symptom severity: Low \(<\) Med \(<\) High).
Proportional-odds model (POM), R/polr convention:
\[ \log\!\left(\frac{P(Y\le k)}{P(Y>k)}\right)=\kappa_k - X_{ij}\boldsymbol{\beta}, \qquad k=1,\dots,K-1 \]
Note
Sign convention (read carefully). The \(\kappa_k\) are \(K-1\) increasing cutpoints (FLW eq. (11.1), p. 313, writes these as \(\alpha_k\); we use \(\kappa_k\) to avoid the GEE working-correlation \(\alpha\) and the GLM natural parameter \(\theta\)). R’s polr uses the minus-sign form above, so a positive \(\beta\) pushes probability toward HIGHER categories. FLW’s eq. (11.1) uses the plus-sign form \(\mathrm{logit}\,P(Y\le k)=\alpha_k + X_{ij}\boldsymbol\beta\), under which the sign is reversed. FLW (p. 313) explicitly warns the two conventions differ. We use the polr (minus) form so code and slides agree.
Note
Simulated to isolate the proportional-odds mechanism (no real FLW ordinal example is loaded here; the real case studies are TLC and epilepsy above).
set.seed(667)
N <- 250
dat_ord <- tibble(id = 1:N, trt = rbinom(N, 1, 0.5), time = rnorm(N)) %>%
mutate(eta = -0.5 + 1.1 * trt + 0.6 * time,
p_low = plogis(1 - eta),
p_med = plogis(3 - eta) - plogis(1 - eta),
u = runif(N),
ord = case_when(u < p_low ~ "L",
u < p_low + p_med ~ "M",
TRUE ~ "H"),
trt = factor(trt, labels = c("Ctl", "Tx")),
ord = factor(ord, ordered = TRUE, levels = c("L", "M", "H")))
head(dat_ord)# A tibble: 6 × 8
id trt time eta p_low p_med u ord
<int> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <ord>
1 1 Ctl 0.761 -0.0437 0.740 0.215 0.909 M
2 2 Ctl 1.37 0.320 0.664 0.272 0.532 L
3 3 Tx -1.46 -0.278 0.782 0.182 0.249 L
4 4 Tx 0.302 0.781 0.554 0.347 0.916 H
5 5 Tx -1.16 -0.0989 0.750 0.207 0.251 L
6 6 Tx 0.457 0.874 0.531 0.362 0.128 L
Call:
MASS::polr(formula = ord ~ trt * time, data = dat_ord, Hess = TRUE)
Coefficients:
Value Std. Error t value
trtTx 0.45565 0.2943 1.5481
time 0.68829 0.2327 2.9576
trtTx:time -0.07056 0.2972 -0.2374
Intercepts:
Value Std. Error t value
L|M 1.1134 0.2233 4.9870
M|H 2.9010 0.3039 9.5453
Residual Deviance: 377.5479
AIC: 387.5479
Models the log-odds of lower categories (\(\le k\)) vs higher (\(>k\)). Under the minus-sign (polr) convention:
| Sign of \(\beta_j\) | Effect |
|---|---|
| \(\beta_j > 0\) | Increases odds of HIGHER categories |
| \(\beta_j < 0\) | Increases odds of LOWER categories |
Example: if \(\hat\beta_{\text{Tx}}=0.8\), then OR \(=e^{0.8}=2.23\): treatment has 2.23x the odds of being in a higher category, and one \(\beta\) applies across all cutpoints (the proportional-odds assumption).
suppressPackageStartupMessages(library(nnet))
fit_multi <- multinom(ord ~ trt * time, data = dat_ord, trace = FALSE)
# AIC comparison: POM (shared beta) vs multinomial (separate beta per category)
data.frame(model = c("Ordinal POM", "Multinomial"),
params = c(length(coef(fit_polr)) + length(fit_polr$zeta),
length(as.vector(coef(fit_multi)))),
AIC = c(AIC(fit_polr), AIC(fit_multi)))Note
Multinomial treats the categories as UNORDERED: it estimates a separate \(\boldsymbol\beta\) for each category (relative to a reference) and discards the ordering \(L<M<H\). That flexibility costs parameters (the count jumps, shown above). The POM is more parsimonious if proportional odds holds.
model params AIC
1 Ordinal POM 5 387.5479
2 Multinomial 8 389.6573
A quick check: fit separate binary logits at each cutpoint and compare slopes (if PO holds, the slopes should be similar). A formal Brant test is also available.
# Cutpoint 1: P(Y > L); Cutpoint 2: P(Y > M)
ge_M <- as.integer(dat_ord$ord >= "M") # not Low
ge_H <- as.integer(dat_ord$ord >= "H") # High
s1 <- coef(glm(ge_M ~ trt + time, data = dat_ord, family = binomial))
s2 <- coef(glm(ge_H ~ trt + time, data = dat_ord, family = binomial))
round(rbind(`cut: >L` = s1, `cut: >M` = s2), 3)Note
Reading it: similar slope coefficients across the two cutpoints support proportional odds; very different slopes argue for multinomial (or a partial-PO model). Compare the trt and time columns across rows.
(Intercept) trtTx time
cut: >L -1.062 0.362 0.617
cut: >M -3.386 1.080 0.735
newo <- expand.grid(trt = c("Ctl", "Tx"), time = seq(-2, 2, length = 100))
preds <- predict(fit_polr, newdata = newo, type = "probs")
pred_df <- cbind(newo, preds) %>%
pivot_longer(cols = c(L, M, H), names_to = "level", values_to = "prob")
ggplot(pred_df, aes(time, prob, color = level)) +
geom_line(linewidth = 1) + facet_wrap(~trt) +
labs(title = "POM Predicted Category Probabilities", y = "Probability") +
theme_minimal()Note
Beyond FLW Ch. 11 (instructor extension). Deviance/Pearson residuals, leverage, Cook’s distance, and influence plots for GLMs are not defined in FLW Ch. 11; the chapter’s only fit tool is the overdispersion scale factor (Section 11.5). Leverage and Cook’s distance are the standard ordinary-regression diagnostics (LM, Ch. 3-6 / standard logistic-regression literature) extended to GLMs here; the \(2\bar h\) and \(4/n\) rules are stated locally below. Pace: ~15 min.
Single reference table (used by both the logistic and Poisson diagnostics above):
| Goal | What to Check | Tool |
|---|---|---|
| Link adequacy | Does \(g(\mu)\) linearize the relationship? | residual-vs-fitted curvature |
| Mean-variance | Does variance match the assumed \(v(\mu)\)? | Pearson residual spread, \(\hat\phi\) |
| Influential points | Outliers affecting estimates? | leverage (\(2\bar h\)), Cook’s D (\(4/n\)) |
| Overdispersion | Variance exceeds model expectation? | \(\hat\phi=\) Pearson \(\chi^2/\)df |
| Goodness-of-fit | Does the model fit overall? | deviance / LRT (FLW); H-L (enrichment) |
\[ r_i^{(P)} = \frac{y_i - \hat\mu_i}{\sqrt{\hat V(\hat\mu_i)}}, \qquad r_i^{(D)} = \operatorname{sign}(y_i-\hat\mu_i)\sqrt{2\left[\ell(y_i)-\ell(\hat\mu_i)\right]} \]
| Residual Type | Based On |
|---|---|
| Pearson | scaled by estimated variance |
| Deviance | model log-likelihood contribution |
Residual deviance measures fit relative to the saturated model (the model with one parameter per observation, i.e. a perfect fit – the deviance is twice the log-likelihood gap between your model and that perfect fit).
Compare nested models via the likelihood-ratio test:
\[ D = 2(\ell_{\text{full}} - \ell_{\text{reduced}}) \sim \chi^2_{df} \]
For nested models, smaller deviance indicates better fit. (This is FLW-native, Section 11.7.)
Note
Beyond FLW Ch. 11 (instructor extension): complementary log-log and probit link choices are general GLM theory (McCullagh & Nelder 1989), not part of the Ch. 11 review.
| Link | Interpretation | Common Use |
|---|---|---|
| Identity | direct mean | Normal |
| Logit | log-odds | binary |
| Log | log-rate | count |
| Probit | latent normal | binary / ordinal |
| Complementary log-log | asymmetric, rare events | survival-type data |
Note
Beyond FLW Ch. 11 (instructor extension). Everything in Part 6 is enrichment: skim it, recognize the names, and know where to look. Pace: ~10 min, lecture-light.
Note
Beyond FLW. FLW (p. 321) mentions zero-inflated Poisson (ZIP) but states a detailed treatment is “beyond the scope of this chapter”; hurdle models are not in FLW at all (see Zeileis, Kleiber & Jackman 2008).
When counts have far more zeros than Poisson allows, use a two-part model:
| Part | Purpose |
|---|---|
| 1 | binary: structural zero vs “at risk” |
| 2 | count: conditional on being at risk |
# Illustrative ZIP on the epilepsy counts (note: these are not zero-inflated,
# so this is a demonstration of syntax, not a recommended model here).
if (requireNamespace("pscl", quietly = TRUE)) {
zi <- pscl::zeroinfl(total ~ trtf + base | 1, data = ep, dist = "poisson")
round(coef(zi), 3)
} else {
cat("Install 'pscl' for zero-inflated models.\n")
} count_(Intercept) count_trtfProgabide count_base zero_(Intercept)
2.679 -0.206 0.022 -4.060
Note
Beyond FLW Ch. 11 (instructor extension). Sandwich/robust SEs are the bridge to GEE (Ch. 12-13); here they are previewed (see Zeger & Liang 1986).
For misspecified-variance or correlated data, robust (sandwich) SEs give valid inference even if the variance model is wrong.
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.0749927 0.3150986 0.2380 0.8119
trtfProgabide -0.1928933 0.1717332 -1.1232 0.2613
base 0.0225023 0.0012499 18.0034 <2e-16 ***
age 0.0162527 0.0099523 1.6331 0.1025
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Single-level GLMs assume independent observations. Longitudinal data have within-subject correlation; Ch. 12-13 relax independence using GEE.
The GEE working covariance (NOT the true \(\mathrm{Var}(\mathbf{Y}_i)\)) is
\[ V_i = \mathbf{A}_i^{1/2}\,\operatorname{Corr}(\mathbf{Y}_i)\,\mathbf{A}_i^{1/2}, \]
where \(\mathbf{A}_i\) is the diagonal of marginal variances and \(\operatorname{Corr}(\mathbf{Y}_i)\) is a working correlation.
Note
Robust (sandwich) inference stays valid even if the working correlation is wrong. Standard GEE requires MCAR for valid inference; IPW-GEE extends to MAR (developed in the missing-data chapters, course L17-L18).
Note
Beyond FLW Ch. 11 (instructor extension): AIC-based model selection is general practice (Akaike 1974), not part of the Ch. 11 review, which uses deviance/LRT for nested comparisons.
\[ \text{AIC} = -2\ell + 2p \]
Smaller AIC = better fit/complexity tradeoff. Use the LRT (deviance) for nested models; AIC for non-nested comparison.
| Pitfall | Issue |
|---|---|
| Misinterpreting coefficients | odds/log-odds are not probabilities |
| Ignoring overdispersion | SEs too small, inflated significance |
| Log link with zero/negative values | model fails |
| Forgetting the offset | compares counts, not rates |
| Confusing nested vs non-nested comparison | wrong test (LRT vs AIC) |
| Misreading the ordinal sign convention | direction flips between polr and FLW |
| Aspect | Key Point |
|---|---|
| Framework | GLMs extend LMs via a link + a mean-variance structure |
| Outcomes | binary, count, ordinal handled in one framework |
| FLW core | logistic ORs, Poisson rate ratios, overdispersion scale factor, deviance/LRT |
| Enrichment | ROC/AUC, calibration, H-L, ordinal/multinomial, ZIP, sandwich, AIC |
Next Lecture: Marginal Models (Ch. 12); GEE estimation follows in Ch. 13.
Beyond FLW Ch. 11; reference only.
Note
Beyond FLW Ch. 11. Standard GLM theory (McCullagh & Nelder 1989).
Each iteration solves a weighted least-squares problem:
\[ \boldsymbol{\beta}^{(t+1)} = (\mathbf{X}^\top W^{(t)} \mathbf{X})^{-1} \mathbf{X}^\top W^{(t)} z^{(t)} \]
with working response and weights
\[ z^{(t)} = \eta^{(t)} + (y - \mu^{(t)}) \frac{d\eta}{d\mu}, \qquad W^{(t)} = \operatorname{diag}\!\left[\left(\tfrac{d\mu}{d\eta}\right)^2 \big/ v(\mu)\right]. \]
Here \(d\mu/d\eta\) is the slope of the inverse link; \(v(\mu)\) is the variance function.
Note
Beyond FLW Ch. 11. Standard GLM theory; below assumes the canonical link with unit dispersion (\(\phi=1\)). The general forms carry \(\phi\) and the \((\partial\mu_i/\partial\eta_i)^2\) weight.
For the canonical link (rows \(X_{ij}\), \(1\times p\)):
\[ U(\boldsymbol{\beta}) = \sum_i \sum_j X_{ij}^\top (Y_{ij} - \mu_{ij}), \qquad I(\boldsymbol{\beta}) = \sum_i \sum_j \frac{X_{ij}^\top X_{ij}}{\operatorname{Var}(Y_{ij})} . \]
Used for standard errors and likelihood inference.
Using FLW’s labeling: cumulant function \(a(\theta)\), data-only term \(b(y,\phi)\), scale \(\phi\).
| Distribution | \(\phi\) | \(a(\theta)\) (cumulant) | \(b(y,\phi)\) (data-only) |
|---|---|---|---|
| Normal | \(\sigma^2\) | \(\theta^2/2\) | \(-y^2/(2\sigma^2) - \tfrac12\log(2\pi\sigma^2)\) |
| Binomial | 1 | \(\log(1+e^\theta)\) | \(\log\binom{n}{y}\) |
| Poisson | 1 | \(e^\theta\) | \(-\log(y!)\) |
Recall \(\mathbb{E}(Y)=a'(\theta)\) and \(\mathrm{Var}(Y)=\phi\,a''(\theta)\).
BIOS 667 · UNC Gillings - Lecture 11 (Ch.11)