BIOS 667 - Lecture 11: Generalized Linear Models (Ch. 11)

Fitzmaurice, Laird & Ware (2011) - Applied Longitudinal Data Analysis

Naim Rashid

Lecture Objectives

By the end of this lecture, you will be able to:

Describe the three components of a GLM (random, systematic, link)
Fit and interpret logistic regression for binary outcomes
Fit and interpret Poisson regression for counts, including overdispersion
Apply ordinal and multinomial regression for categorical outcomes (instructor extension)
Diagnose GLM fit using residuals, influence measures, and deviance

Note

Assessment map. This lecture is assessed on HW4 (logistic and Poisson interpretation, overdispersion remedies) and Quiz 4 (GLM components, odds ratios, rate ratios).

Prerequisites and Recall

Before we generalize, recall what we already have:

From	What carries forward
LM (Ch. 3-6)	Linear predictor \(X_{ij}\boldsymbol\beta\); least squares; residual diagnostics
L10 (Diagnostics)	Residuals, transformed/Mahalanobis residual diagnostics, cluster-influence ideas
LM (Ch. 3-6)	Ordinary-regression diagnostics: leverage, Cook’s distance
Notation box	\(Y_{ij}\), \(\mathbf{X}_i\), \(\boldsymbol\beta\), indices \(i=1,\dots,N\) subjects

The leap today: the response need not be Normal, and effects act on a transformed (link) scale, not directly on the mean.

Roadmap

Part	Topic	Day
1	Introduction to GLMs	1
2	Logistic regression (binary)	1
3	Poisson regression (counts)	1
4	Ordinal and multinomial models (extension)	2
5	GLM diagnostics	2
6	Extensions and advanced topics (extension)	2
7	Appendices (technical details)	-

Notation reference

Symbol	Meaning
\(Y_{ij}\)	response for subject \(i\) at occasion \(j\) (scalar)
\(\mathbf{Y}_i\)	response vector for subject \(i\) \((n_i \times 1)\)
\(\mathbf{X}_i,\ \boldsymbol\beta\)	fixed-effect design \(n_i \times p\) and coefficients
\(\mathbf{Z}_i,\ \mathbf{b}_i\)	random-effect design and random effects, \(\mathbf{b}_i \sim N(0, G)\)
\(G\)	between-subject random-effects covariance (FLW’s letter, Ch. 8)
\(R_i\)	within-subject residual covariance, \(\boldsymbol\varepsilon_i \sim N(0, R_i)\)
\(\Sigma_i = \mathbf{Z}_i G \mathbf{Z}_i^\top + R_i\)	marginal (total) covariance \(\operatorname{Cov}(\mathbf{Y}_i)\) for subject \(i\)
\(\rho\)	correlation parameter (AR(1), exchangeable)

Indices: \(i\) subject \(1,\dots,N\); \(j\) occasion/time \(1,\dots,n_i\) (\(n\) when balanced); \(k\) state (transition models); \(g\) group. Bold = vector or matrix.

Notation (this lecture)

Symbol	Meaning
\(\pi_{ij}\)	probability of a binary outcome, subject \(i\) occasion \(j\)
\(\theta_i\)	natural parameter (a re-parameterization of the mean)
\(\eta_{ij} = X_{ij}\boldsymbol\beta\)	linear predictor (row form \(X_{ij}\), \(1\times p\))
\(g(\mu)=\eta\)	link function; \(\mu=g^{-1}(\eta)\) is the inverse link
\(\phi\)	dispersion (scale) parameter; \(\mathrm{Var}(Y)=\phi\,v(\mu)\)
\(v(\mu)\)	GLM mean-variance function (NOT a covariance matrix)
\(\kappa_k\)	ordinal cumulative-logit cutpoints, \(k=1,\dots,K-1\) (FLW writes \(\alpha_k\); we use \(\kappa_k\) to avoid the GEE \(\alpha\) and the natural parameter \(\theta\))

Warning

\(v(\mu)\) here is the scalar variance function of the mean (e.g. \(v(\mu)=\mu\) for Poisson), NOT the GEE working covariance \(V_i\) from the notation box.

Part 1 (Day 1): Introduction to GLMs

Framework, exponential family, link functions, estimation. Pace: ~20-25 min.

Why Generalized Linear Models?

Linear Model (LM) Assumption	GLM Extension
Normally distributed \(Y\)	Exponential-family distributions
Constant variance	Variance depends on the mean
Additive effects on the mean	Additive effects on a transformed (link) scale

Non-normal outcomes: binary, counts, proportions, rates.

GLM Framework: the idea first

Two plain sentences before any density:

\(\theta_i\) is a re-parameterization of the mean, called the natural parameter.
\(\phi\) is a scale / dispersion constant that sets the variance level.

Everything you need follows from one cumulant function:

Quantity	Formula
Mean	\(\mathbb{E}(Y_i) = \mu_i\)
Variance	\(\mathrm{Var}(Y_i) = \phi\,V(\mu_i)\)

GLM Framework: the exponential family

Each response \(Y_i\) has a density in the exponential family. Using FLW’s notation (eq. 11.3, p. 328):

\[ f(y_i \mid \theta_i,\phi) = \exp\!\left[ \frac{y_i\theta_i - a(\theta_i)}{\phi} + b(y_i,\phi) \right] \]

with \(\mathbb{E}(Y_i)=\mu_i=a'(\theta_i)\) and \(\mathrm{Var}(Y_i)=\phi\,a''(\theta_i)\).

Note

Notation caution. FLW (eq. 11.3-11.4) writes the cumulant function as \(a(\theta)\) and the data-only term as \(b(y,\phi)\), with the scale entering as \(\phi\) directly. Many texts (McCullagh & Nelder) instead write \(\{y\theta - b(\theta)\}/a(\phi) + c(y,\phi)\). The roles are the same; only the letters differ. We follow FLW here.

Exponential Family: a concrete example

For Poisson with mean \(\mu\) (here \(\phi=1\)):

Natural parameter: \(\theta = \log(\mu)\)
Cumulant function: \(a(\theta) = e^\theta = \mu\)
\(a'(\theta) = e^\theta = \mu\) (the mean)
\(a''(\theta) = e^\theta = \mu\) (the variance)

Insight: for Poisson, mean \(=\) variance. This falls out of the family automatically.

GLM Components

Component	Description	Example
Random	\(Y_i\) from exponential family	binary, count
Systematic	Linear predictor \(\eta_{ij} = X_{ij}\boldsymbol{\beta}\)	\(\eta=\beta_0+\beta_1 x\)
Link	\(g(\mu_i)=\eta_i\)	logit, log, identity

Here \(X_{ij}\) is the row of covariates (\(1\times p\)), so the mean model is \(X_{ij}\boldsymbol\beta\).

Canonical Links

Distribution	Mean-Variance	Canonical Link
Normal	\(\mu,\;\sigma^2\)	identity
Binomial	\(\pi,\;\pi(1-\pi)\)	logit
Poisson	\(\mu,\;\mu\)	log
Gamma	\(\mu,\;\mu^2\)	inverse

Estimation: maximum likelihood

GLM coefficients \(\boldsymbol\beta\) are estimated by maximum likelihood (FLW Section 11.7). FLW states only this; the algorithm below is standard GLM theory.

Note

Beyond FLW Ch. 11 (instructor extension). The IRLS algorithm and the score / Fisher-information formulas are standard GLM theory (McCullagh & Nelder 1989; Dobson & Barnett 2018). FLW Ch. 11 only states that estimation is by ML. The intuition slides that follow are supplementary; full details are in Appendix A-B.

IRLS: The Intuition

Think of IRLS as “smart” weighted regression that learns as it goes:

Step	What Happens	Analogy
1. Initialize	Start with a guess (e.g. \(\hat\mu_i=\bar y\))	First draft
2. Compute weights	Upweight observations we are more certain about	Focus attention
3. Fit weighted regression	Solve linear system with current weights	Refine estimate
4. Update predictions	New \(\hat\mu_i\) from updated \(\hat{\boldsymbol\beta}\)	Revised draft
5. Repeat	Until estimates stabilize	Final answer

Key insight: higher-variance observations get lower weights, so uncertain points have less influence.

IRLS: Why Iteration?

Unlike OLS, GLMs have a nonlinear link between parameters and predictions.

OLS: one-shot solution \((\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{Y}\).

GLM: weights depend on \(\mu_i\), which depends on \(\boldsymbol\beta\), which depends on the weights…

Solution: iterate to convergence. Start somewhere reasonable, improve step by step. In R, glm() handles this automatically (it warns if there are convergence problems).

Check Your Understanding: GLM Fundamentals

What are the three components of a GLM, and what does each do?
Why does the Poisson distribution have mean equal to variance? (Hint: the exponential family.)
In IRLS, why iterate rather than compute a one-shot solution like OLS?

Answers:

1. Random component: the distribution of \(Y_i\) (binomial, Poisson, …). (2) Systematic component: the linear predictor \(\eta_{ij}=X_{ij}\boldsymbol\beta\). (3) Link: \(g(\mu_i)=\eta_i\) connects the mean to the linear predictor.
For Poisson the cumulant function is \(a(\theta)=e^\theta\). Mean \(=a'(\theta)=e^\theta=\mu\) and variance \(=a''(\theta)=e^\theta=\mu\), so they are equal.
The weights depend on \(\mu_i\), which depends on \(\boldsymbol\beta\). There is no closed form, so we alternate between updating weights and updating \(\boldsymbol\beta\) until convergence.

Part 2 (Day 1): Logistic Regression

Binary outcomes: model, odds ratios, fitted probabilities, real-data case study. Pace: ~25-30 min.

Interpreting Logistic Coefficients

Model on the log-odds scale:

\[ \log\!\left(\frac{\pi}{1-\pi}\right) = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p \]

Quantity	Interpretation
\(\beta_j\)	Change in log-odds per unit increase in \(x_j\)
\(e^{\beta_j}\)	Odds ratio (OR)
\(e^{\beta_j} > 1\)	Odds of \(Y=1\) increase
\(e^{\beta_j} < 1\)	Odds of \(Y=1\) decrease

Case Study: TLC Lead-Exposure Trial (real data)

Note

Real FLW dataset (instructor-defined binary endpoint). TLC is FLW’s lead-exposure trial (Ch. 5/6): 100 children randomized to succimer or placebo, blood lead measured at weeks 0, 1, 4, 6. FLW Ch. 5/6 analyze it as a continuous longitudinal response. Here we define an instructor binary endpoint – elevated lead at week 6 (\(\ge 20\ \mu g/dL\)) – to illustrate logistic regression on real data. The cutoff is a teaching choice, not an FLW analysis.

tlc <- tlc |>
  dplyr::mutate(elevated_w6 = as.integer(w6 >= 20))   # 1 = elevated lead at wk 6
table(Treatment = tlc$trt, Elevated_wk6 = tlc$elevated_w6)

Case Study: TLC Lead-Exposure Trial (real data)

          Elevated_wk6
Treatment   0  1
  Placebo  13 37
  Succimer 27 23

Logistic GLM: Fit

fit_logit <- glm(elevated_w6 ~ trt + w0, data = tlc, family = binomial)
summary(fit_logit)

Logistic GLM: Fit


Call:
glm(formula = elevated_w6 ~ trt + w0, family = binomial, data = tlc)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -6.07274    1.64543  -3.691 0.000224 ***
trtSuccimer -1.70470    0.52826  -3.227 0.001251 ** 
w0           0.28754    0.06802   4.227 2.37e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 134.602  on 99  degrees of freedom
Residual deviance:  99.233  on 97  degrees of freedom
AIC: 105.23

Number of Fisher Scoring iterations: 5

Logistic: Odds Ratios and CIs

or_tab <- broom::tidy(fit_logit, conf.int = TRUE) |>
  dplyr::mutate(OR = exp(estimate),
                OR_low = exp(conf.low),
                OR_high = exp(conf.high)) |>
  dplyr::select(term, estimate, OR, OR_low, OR_high, p.value)
or_tab

Logistic: Odds Ratios and CIs

# A tibble: 3 × 6
  term        estimate      OR    OR_low OR_high   p.value
  <chr>          <dbl>   <dbl>     <dbl>   <dbl>     <dbl>
1 (Intercept)   -6.07  0.00230 0.0000702  0.0467 0.000224 
2 trtSuccimer   -1.70  0.182   0.0606     0.490  0.00125  
3 w0             0.288 1.33    1.18       1.54   0.0000237

Plain-Language Interpretation: Logistic Output

Term	OR	Interpretation
`(Intercept)`	–	Baseline log-odds when covariates are 0
`trtSuccimer`	0.18	Succimer multiplies the odds of elevated lead at wk 6 by 0.18 vs placebo
`w0`	1.33	Each 1 \(\mu g/dL\) higher baseline lead multiplies the odds by 1.33

In a sentence: adjusting for baseline lead, the succimer odds ratio for remaining elevated at week 6 is 0.18 (below 1, i.e. lower odds than placebo), and the baseline-lead odds ratio is 1.33 (above 1, i.e. higher baseline raises the odds).

From Coefficients to Predicted Probabilities

Logistic inverse link (row form):

\[ \hat\pi_{ij} = \frac{e^{X_{ij}\hat{\boldsymbol\beta}}}{1 + e^{X_{ij}\hat{\boldsymbol\beta}}} \]

In R, predict(fit, type = "response") returns \(\hat\pi_{ij}\).

tlc$pred_prob <- predict(fit_logit, type = "response")
head(tlc[, c("trt", "w0", "elevated_w6", "pred_prob")])

From Coefficients to Predicted Probabilities

       trt   w0 elevated_w6 pred_prob
1  Placebo 30.8           1 0.9417841
2 Succimer 26.5           1 0.4607010
3 Succimer 25.8           1 0.4112497
4  Placebo 24.7           1 0.7368379
5 Succimer 20.4           0 0.1288079
6 Succimer 20.4           0 0.1288079

Visualize Fitted Probabilities

newdat <- expand.grid(
  trt = levels(tlc$trt),
  w0  = seq(min(tlc$w0), max(tlc$w0), length = 100)
)
newdat$pred_prob <- predict(fit_logit, newdata = newdat, type = "response")
ggplot(newdat, aes(w0, pred_prob, color = trt)) +
  geom_line(linewidth = 1.2) +
  labs(y = "Predicted P(elevated at wk 6)", x = "Baseline lead (wk 0)",
       title = "Logistic Fitted Probabilities: TLC Trial",
       color = "Treatment") +
  theme_minimal()

Visualize Fitted Probabilities

Static SAS: PROC LOGISTIC (reference only)

The same TLC logistic model in SAS (FLW Section 11.6 uses PROC GENMOD/LOGISTIC):

proc logistic data=tlc descending;
   class trt (ref="Placebo") / param=ref;
   model elevated_w6 = trt w0;
   /* odds ratios printed automatically; oddsratio for custom contrasts */
run;

Note

Shown for reference only, not executed. descending models \(P(Y=1)\); class ... param=ref matches R’s reference coding.

Part 2b (Day 2 enrichment): Classification and Calibration

Note

Beyond FLW Ch. 11 (instructor extension). Classification thresholds, confusion matrices, sensitivity/specificity, ROC/AUC, calibration plots, the Hosmer-Lemeshow test, and residual/influence diagnostics for GLMs are not in FLW Ch. 11 (which covers odds-ratio interpretation, ML, and Wald tests for binary data). They are standard applied-logistic tools: see Hosmer, Lemeshow & Sturdivant (2013) and Pepe (2003) for ROC/calibration; Agresti (2015) for the modeling context. Recognize them; they are enrichment, not examinable Ch. 11 content.

Using Probabilities for Classification

Choose a threshold \(t\) and predict:

\[ \widehat Y = \begin{cases} 1, & \hat\pi \ge t \\ 0, & \text{otherwise} \end{cases} \]

Default \(t=0.5\), but it can be tuned to the application’s costs.

tlc$pred_class_05 <- ifelse(tlc$pred_prob >= 0.5, 1, 0)
table(Predicted = tlc$pred_class_05, Observed = tlc$elevated_w6)

Using Probabilities for Classification

         Observed
Predicted  0  1
        0 25 15
        1 15 45

mean(tlc$pred_class_05 == tlc$elevated_w6)   # accuracy at t = 0.5

[1] 0.7

Sensitivity, Specificity, and FPR

Metric	Definition
Sensitivity (TPR)	\(P(\widehat Y=1 \mid Y=1)\)
Specificity (TNR)	\(P(\widehat Y=0 \mid Y=0)\)
False positive rate (FPR)	\(1 - \text{Specificity} = P(\widehat Y=1 \mid Y=0)\)

Varying the threshold traces an ROC curve (TPR on the \(y\)-axis vs FPR on the \(x\)-axis).

ROC Curve and AUC

suppressPackageStartupMessages(library(pROC))
roc_obj <- roc(tlc$elevated_w6, tlc$pred_prob, quiet = TRUE)
plot(roc_obj, main = paste0("TLC Logistic ROC (AUC = ", round(auc(roc_obj), 3), ")"))

Note

Reading it: a curve hugging the top-left corner = good discrimination; the diagonal = no better than chance (AUC 0.5). AUC = probability a random positive is ranked above a random negative.

ROC Curve and AUC

Interpreting AUC

Note

Beyond FLW (rule-of-thumb thresholds). These cutoffs follow Hosmer, Lemeshow & Sturdivant (2013); they are conventions, not FLW content, and are sensitive to context.

AUC Range	Discrimination (Hosmer-Lemeshow convention)
0.5	No discrimination
0.7-0.8	Acceptable
0.8-0.9	Excellent
> 0.9	Outstanding (rare)

Calibration: Predicted vs Observed

dd <- tlc |>
  dplyr::mutate(dec = cut(pred_prob,
                          breaks = quantile(pred_prob, probs = seq(0, 1, 0.1), na.rm = TRUE),
                          include.lowest = TRUE)) |>
  dplyr::group_by(dec) |>
  dplyr::summarise(mean_pred = mean(pred_prob),
                   obs_rate  = mean(elevated_w6),
                   n = dplyr::n(), .groups = "drop")
ggplot(dd, aes(mean_pred, obs_rate, size = n)) +
  geom_point(alpha = 0.8) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  labs(title = "TLC Logistic Calibration (deciles of predicted risk)",
       x = "Mean predicted probability", y = "Observed event rate") +
  theme_minimal()

Note

Reading it: points on the dashed 45-degree line = well calibrated (predicted matches observed). Points above the line = model under-predicts risk; below = over-predicts. A high AUC does NOT guarantee good calibration.

Calibration: Predicted vs Observed

Logistic Diagnostics: Residuals

res_df <- data.frame(
  fitted = fitted(fit_logit),
  dev    = residuals(fit_logit, type = "deviance"),
  pear   = residuals(fit_logit, type = "pearson")
)
p1 <- ggplot(res_df, aes(fitted, dev)) + geom_point(alpha = .6) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Deviance Residuals vs Fitted") + theme_bw()
p2 <- ggplot(res_df, aes(fitted, pear)) + geom_point(alpha = .6) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Pearson Residuals vs Fitted") + theme_bw()
print(p1); print(p2)

Note

Reading it: for binary data residuals split into two bands (one per outcome), so do not expect a random cloud. Look for points far from 0 (poorly fit observations) and any systematic curvature (possible link misspecification), not for constant spread.

Logistic Diagnostics: Residuals

Influence and Leverage

lev <- hatvalues(fit_logit)
cd  <- cooks.distance(fit_logit)
par(mfrow = c(1, 2))
plot(lev, type = "h", main = "Leverage (hat values)", ylab = "hat")
abline(h = 2 * mean(lev), col = "red", lty = 2)
plot(cd, type = "h", main = "Cook's Distance", ylab = "Cook's D")
abline(h = 4 / length(cd), col = "red", lty = 2)

Note

Reading it: the red lines are common rules of thumb – leverage above \(2\bar h\) (twice the average hat value) is high-leverage; Cook’s D above \(4/n\) flags an influential point. Points crossing both deserve a look (refit without them to gauge sensitivity).

Influence and Leverage

par(mfrow = c(1, 1))

Goodness-of-Fit: Hosmer-Lemeshow

if (requireNamespace("ResourceSelection", quietly = TRUE)) {
  hl <- ResourceSelection::hoslem.test(tlc$elevated_w6, fitted(fit_logit), g = 10)
  hl
} else {
  cat("Install 'ResourceSelection' for the Hosmer-Lemeshow test.\n")
}

Note

Reading it: large p-value = no evidence of poor fit; small p-value = misfit. Caveat: the result is sensitive to the number of groups \(g\) (try a few). This is an enrichment GOF tool; FLW’s native fit assessment is deviance / likelihood-ratio tests.

Goodness-of-Fit: Hosmer-Lemeshow


    Hosmer and Lemeshow goodness of fit (GOF) test

data:  tlc$elevated_w6, fitted(fit_logit)
X-squared = 5.4331, df = 8, p-value = 0.7104

Logistic Models: Quick Checklist

Step	Action
Coefficients	Exponentiate to odds ratios
Predictions	Use `type="response"` for probabilities
Classification*	Choose a threshold by context
Discrimination*	ROC / AUC
Calibration*	Predicted vs observed
Diagnostics*	Residuals, leverage, influence

* enrichment beyond FLW Ch. 11.

Check Your Understanding: Logistic Regression

If \(\hat\beta = 0.7\) for treatment, what is the odds ratio and how do you read it?
If \(P(Y=1)=0.2\), what are the odds?
An AUC of 0.75 means what in plain English?

Answers:

OR \(=e^{0.7}=2.01\): the odds of the outcome in the treatment group are about twice the odds in control, adjusting for other covariates.
Odds \(=0.2/0.8=0.25\) (“1 to 4”).
Pick one positive and one negative at random; there is a 75% chance the model gives the positive the higher predicted probability (acceptable discrimination).

Part 3 (Day 1): Poisson Regression

Count outcomes: rates, offsets, overdispersion, quasi-Poisson and NB remedies, on real data. Pace: ~25-30 min.

Transition to Count Outcomes

Counts (number of events) call for the Poisson family.

Aspect	Consideration
Distribution	Poisson: \(\mathrm{Var}(Y)=\mu\)
Rate modeling	use an `offset(log(exposure))`
Common issue	overdispersion (\(\mathrm{Var}>\mu\))
Remedies	quasi-Poisson, negative binomial

Case Study: Epilepsy Seizure-Count Trial (real data)

Note

Real FLW dataset (Thall & Vail 1990; FLW Ch. 11/13/14). 59 epileptics randomized to progabide or placebo. Baseline = seizure count over an 8-week run-in; outcome = total seizures over the four 2-week post-randomization visits (8 weeks). We model the rate with an offset(log(8)) for the 8-week exposure window.

str(ep[, c("id", "trtf", "age", "base", "total")])

Case Study: Epilepsy Seizure-Count Trial (real data)

'data.frame':   59 obs. of  5 variables:
 $ id   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ trtf : Factor w/ 2 levels "Placebo","Progabide": 1 1 1 1 1 1 1 1 1 1 ...
 $ age  : int  31 30 25 36 22 29 31 36 37 28 ...
 $ base : int  11 11 6 8 66 27 12 52 23 10 ...
 $ total: int  14 14 11 13 55 22 12 95 22 33 ...

summary(ep$total)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   11.50   16.00   33.08   36.00  302.00

Fit Poisson GLM (with offset)

ep$logexp <- log(8)   # 8 post-randomization weeks of observation
fit_pois <- glm(total ~ trtf + base + age + offset(logexp),
                data = ep, family = poisson)
summary(fit_pois)

Fit Poisson GLM (with offset)


Call:
glm(formula = total ~ trtf + base + age + offset(logexp), family = poisson, 
    data = ep)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)    0.0749927  0.1198240   0.626    0.531    
trtfProgabide -0.1928933  0.0464638  -4.151 3.30e-05 ***
base           0.0225023  0.0005121  43.945  < 2e-16 ***
age            0.0162527  0.0035126   4.627 3.71e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 2119.60  on 58  degrees of freedom
Residual deviance:  569.28  on 55  degrees of freedom
AIC: 860.71

Number of Fisher Scoring iterations: 5

Poisson GLM: Interpret Rate Ratios

\(e^{\hat\beta_{\text{Progabide}}} =\) 0.82
\(e^{\hat\beta_{\text{base}}} =\) 1.023 per additional baseline seizure

Interpretation: adjusting for baseline count and age, progabide is associated with a seizure rate 0.82 times that of placebo (a 18% lower rate point estimate). Each extra baseline seizure multiplies the rate by 1.023.

Warning

We will see the SEs here are too small (invalid inference) because of overdispersion; do not over-read the significance until we correct it.

Plain-Language Interpretation: Poisson Output

Term	Rate Ratio	Interpretation
`(Intercept)`	–	Baseline log-rate (placebo, base=0, age=0)
`trtfProgabide`	0.82	Progabide vs placebo seizure-rate ratio
`base`	1.023	Each extra baseline seizure scales the rate
`age`	1.016	Per-year multiplicative change in rate
`offset(logexp)`	fixed	Adjusts for the 8-week exposure window

Note on the offset: the offset makes us model rates (events per unit time), not raw counts, so subjects with different follow-up are compared fairly.

Visualize Fitted Rates

newd <- expand.grid(
  age   = seq(min(ep$age), max(ep$age), length.out = 100),
  trtf  = levels(ep$trtf),
  base  = median(ep$base),
  logexp = log(8)
)
newd$pred <- predict(fit_pois, newdata = newd, type = "response")
ggplot(newd, aes(age, pred, color = trtf)) +
  geom_line(linewidth = 1.1) +
  labs(y = "Predicted seizures / 8 weeks", x = "Age",
       title = "Poisson GLM: Predicted Seizure Count (at median baseline)",
       color = "Treatment") +
  theme_minimal()

Visualize Fitted Rates

Poisson Diagnostics: Pearson Residuals

pp <- tibble(fitted = fitted(fit_pois),
             rpear  = residuals(fit_pois, type = "pearson"))
ggplot(pp, aes(fitted, rpear)) +
  geom_point(alpha = .6) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Epilepsy Poisson: Pearson Residuals vs Fitted") +
  theme_bw()

Note

Reading it: under a correct Poisson model the Pearson residuals should sit roughly within \(\pm 2\) with constant spread. A fanning pattern (spread growing with the fitted value) and many residuals far beyond \(\pm 2\) signal overdispersion – the variance grows faster than the mean.

Poisson Diagnostics: Pearson Residuals

Checking for Overdispersion

phi_hat <- sum(residuals(fit_pois, type = "pearson")^2) / fit_pois$df.residual
phi_hat

If \(\hat\phi > 1\), the variance exceeds the mean (overdispersion). Here \(\hat\phi\) is far above 1, so the Poisson SEs are too small (invalid inference).

Remedy: quasi-Poisson or negative binomial.

Checking for Overdispersion

[1] 12.04789

Quasi-Poisson Model

fit_qp <- glm(total ~ trtf + base + age + offset(logexp),
              data = ep, family = quasipoisson)
summary(fit_qp)

Quasi-Poisson Model


Call:
glm(formula = total ~ trtf + base + age + offset(logexp), family = quasipoisson, 
    data = ep)

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    0.074993   0.415910   0.180    0.858    
trtfProgabide -0.192893   0.161276  -1.196    0.237    
base           0.022502   0.001777  12.660   <2e-16 ***
age            0.016253   0.012192   1.333    0.188    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 12.04789)

    Null deviance: 2119.60  on 58  degrees of freedom
Residual deviance:  569.28  on 55  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

Compare Poisson vs Quasi-Poisson

Model	Dispersion	SE behavior	Consequence
Poisson	fixed \(=1\)	too small if \(\mathrm{Var}>\mu\)	inflated significance (invalid inference)
Quasi-Poisson	estimated \(\hat\phi\)	scaled by \(\sqrt{\hat\phi}\)	coefficients unchanged, SEs corrected

Negative Binomial

fit_nb <- glm.nb(total ~ trtf + base + age + offset(logexp), data = ep)
summary(fit_nb)

Note

NB models the variance as \(\mu + \mu^2/k\) (an extra parameter \(k\)), a more flexible mean-variance relationship than quasi-Poisson’s constant scaling. Both correct the SEs; they assume different variance forms.

Negative Binomial


Call:
glm.nb(formula = total ~ trtf + base + age + offset(logexp), 
    data = ep, init.theta = 3.328492494, link = log)

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)    0.019042   0.358599   0.053    0.958    
trtfProgabide -0.214854   0.153943  -1.396    0.163    
base           0.027621   0.002835   9.744   <2e-16 ***
age            0.011383   0.010683   1.066    0.287    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(3.3285) family taken to be 1)

    Null deviance: 179.749  on 58  degrees of freedom
Residual deviance:  63.647  on 55  degrees of freedom
AIC: 477.23

Number of Fisher Scoring iterations: 1

              Theta:  3.328 
          Std. Err.:  0.701 

 2 x log-likelihood:  -467.225

Static SAS: PROC GENMOD with offset (reference only)

The epilepsy Poisson rate model in SAS (FLW Section 11.6, Table 11.9 style):

data ep; set ep; logexp = log(8); run;

proc genmod data=ep;
   class trtf (ref="Placebo") / param=ref;
   model total = trtf base age / dist=poisson link=log offset=logexp;
   /* for overdispersion, add  dscale  to scale the SEs by sqrt(Pearson/df) */
run;

/* negative binomial alternative */
proc genmod data=ep;
   class trtf (ref="Placebo") / param=ref;
   model total = trtf base age / dist=negbin link=log offset=logexp;
run;

Note

Shown for reference only, not executed. offset=logexp mirrors R’s offset(log(8)); dscale is SAS’s overdispersion scale factor.

Check Your Understanding: Poisson Regression

What does \(\hat\phi = 2.5\) tell you about a Poisson model?
If \(\hat\beta=0.68\) for “chronic condition,” what is the rate ratio and its reading?
Key difference between quasi-Poisson and negative binomial?

Answers:

Overdispersion: observed variance about 2.5x the Poisson assumption (variance = mean). The Poisson SEs are too small, inflating significance.
Rate ratio \(=e^{0.68}=1.97\): the event rate is about 1.97x for those with the condition, holding other covariates fixed.
Quasi-Poisson scales the variance by a constant \(\phi\) (SEs \(\times\sqrt\phi\), coefficients unchanged); NB adds a parameter so \(\mathrm{Var}=\mu+\mu^2/k\). Both fix the SEs but assume different variance structures.

Part 4 (Day 2): Ordinal and Multinomial Models

Note

Beyond FLW Ch. 11 (instructor extension). Proportional-odds (ordinal) and multinomial regression are introduced briefly in FLW only via the cumulative-logit example; the modeling depth here (polr, multinom, PO assumption) follows McCullagh (1980) and Agresti (2015). Recognize the models; treat as enrichment. Pace: ~20 min.

Ordinal Response Setup

Ordered categories (e.g. symptom severity: Low \(<\) Med \(<\) High).

Proportional-odds model (POM), R/polr convention:

\[ \log\!\left(\frac{P(Y\le k)}{P(Y>k)}\right)=\kappa_k - X_{ij}\boldsymbol{\beta}, \qquad k=1,\dots,K-1 \]

Note

Sign convention (read carefully). The \(\kappa_k\) are \(K-1\) increasing cutpoints (FLW eq. (11.1), p. 313, writes these as \(\alpha_k\); we use \(\kappa_k\) to avoid the GEE working-correlation \(\alpha\) and the GLM natural parameter \(\theta\)). R’s polr uses the minus-sign form above, so a positive \(\beta\) pushes probability toward HIGHER categories. FLW’s eq. (11.1) uses the plus-sign form \(\mathrm{logit}\,P(Y\le k)=\alpha_k + X_{ij}\boldsymbol\beta\), under which the sign is reversed. FLW (p. 313) explicitly warns the two conventions differ. We use the polr (minus) form so code and slides agree.

Simulate Ordinal Data

Note

Simulated to isolate the proportional-odds mechanism (no real FLW ordinal example is loaded here; the real case studies are TLC and epilepsy above).

set.seed(667)
N <- 250
dat_ord <- tibble(id = 1:N, trt = rbinom(N, 1, 0.5), time = rnorm(N)) %>%
  mutate(eta = -0.5 + 1.1 * trt + 0.6 * time,
         p_low = plogis(1 - eta),
         p_med = plogis(3 - eta) - plogis(1 - eta),
         u = runif(N),
         ord = case_when(u < p_low ~ "L",
                         u < p_low + p_med ~ "M",
                         TRUE ~ "H"),
         trt = factor(trt, labels = c("Ctl", "Tx")),
         ord = factor(ord, ordered = TRUE, levels = c("L", "M", "H")))
head(dat_ord)

Simulate Ordinal Data

# A tibble: 6 × 8
     id trt     time     eta p_low p_med     u ord  
  <int> <fct>  <dbl>   <dbl> <dbl> <dbl> <dbl> <ord>
1     1 Ctl    0.761 -0.0437 0.740 0.215 0.909 M    
2     2 Ctl    1.37   0.320  0.664 0.272 0.532 L    
3     3 Tx    -1.46  -0.278  0.782 0.182 0.249 L    
4     4 Tx     0.302  0.781  0.554 0.347 0.916 H    
5     5 Tx    -1.16  -0.0989 0.750 0.207 0.251 L    
6     6 Tx     0.457  0.874  0.531 0.362 0.128 L

Fit Ordinal Logistic Model

fit_polr <- MASS::polr(ord ~ trt * time, data = dat_ord, Hess = TRUE)
summary(fit_polr)

Fit Ordinal Logistic Model

Call:
MASS::polr(formula = ord ~ trt * time, data = dat_ord, Hess = TRUE)

Coefficients:
              Value Std. Error t value
trtTx       0.45565     0.2943  1.5481
time        0.68829     0.2327  2.9576
trtTx:time -0.07056     0.2972 -0.2374

Intercepts:
    Value   Std. Error t value
L|M  1.1134  0.2233     4.9870
M|H  2.9010  0.3039     9.5453

Residual Deviance: 377.5479 
AIC: 387.5479

What the Coefficients Mean

Models the log-odds of lower categories (\(\le k\)) vs higher (\(>k\)). Under the minus-sign (polr) convention:

Sign of \(\beta_j\)	Effect
\(\beta_j > 0\)	Increases odds of HIGHER categories
\(\beta_j < 0\)	Increases odds of LOWER categories

Example: if \(\hat\beta_{\text{Tx}}=0.8\), then OR \(=e^{0.8}=2.23\): treatment has 2.23x the odds of being in a higher category, and one \(\beta\) applies across all cutpoints (the proportional-odds assumption).

Comparing Ordinal vs Multinomial

suppressPackageStartupMessages(library(nnet))
fit_multi <- multinom(ord ~ trt * time, data = dat_ord, trace = FALSE)
# AIC comparison: POM (shared beta) vs multinomial (separate beta per category)
data.frame(model = c("Ordinal POM", "Multinomial"),
           params = c(length(coef(fit_polr)) + length(fit_polr$zeta),
                      length(as.vector(coef(fit_multi)))),
           AIC = c(AIC(fit_polr), AIC(fit_multi)))

Note

Multinomial treats the categories as UNORDERED: it estimates a separate \(\boldsymbol\beta\) for each category (relative to a reference) and discards the ordering \(L<M<H\). That flexibility costs parameters (the count jumps, shown above). The POM is more parsimonious if proportional odds holds.

Comparing Ordinal vs Multinomial

        model params      AIC
1 Ordinal POM      5 387.5479
2 Multinomial      8 389.6573

Is Proportional Odds Reasonable?

A quick check: fit separate binary logits at each cutpoint and compare slopes (if PO holds, the slopes should be similar). A formal Brant test is also available.

# Cutpoint 1: P(Y > L);  Cutpoint 2: P(Y > M)
ge_M <- as.integer(dat_ord$ord >= "M")   # not Low
ge_H <- as.integer(dat_ord$ord >= "H")   # High
s1 <- coef(glm(ge_M ~ trt + time, data = dat_ord, family = binomial))
s2 <- coef(glm(ge_H ~ trt + time, data = dat_ord, family = binomial))
round(rbind(`cut: >L` = s1, `cut: >M` = s2), 3)

Note

Reading it: similar slope coefficients across the two cutpoints support proportional odds; very different slopes argue for multinomial (or a partial-PO model). Compare the trt and time columns across rows.

Is Proportional Odds Reasonable?

        (Intercept) trtTx  time
cut: >L      -1.062 0.362 0.617
cut: >M      -3.386 1.080 0.735

Predicted Category Probabilities

newo <- expand.grid(trt = c("Ctl", "Tx"), time = seq(-2, 2, length = 100))
preds <- predict(fit_polr, newdata = newo, type = "probs")
pred_df <- cbind(newo, preds) %>%
  pivot_longer(cols = c(L, M, H), names_to = "level", values_to = "prob")
ggplot(pred_df, aes(time, prob, color = level)) +
  geom_line(linewidth = 1) + facet_wrap(~trt) +
  labs(title = "POM Predicted Category Probabilities", y = "Probability") +
  theme_minimal()

Predicted Category Probabilities

Part 5 (Day 2): GLM Diagnostics

Note

Beyond FLW Ch. 11 (instructor extension). Deviance/Pearson residuals, leverage, Cook’s distance, and influence plots for GLMs are not defined in FLW Ch. 11; the chapter’s only fit tool is the overdispersion scale factor (Section 11.5). Leverage and Cook’s distance are the standard ordinary-regression diagnostics (LM, Ch. 3-6 / standard logistic-regression literature) extended to GLMs here; the \(2\bar h\) and \(4/n\) rules are stated locally below. Pace: ~15 min.

Diagnostic Goals for GLMs

Single reference table (used by both the logistic and Poisson diagnostics above):

Goal	What to Check	Tool
Link adequacy	Does \(g(\mu)\) linearize the relationship?	residual-vs-fitted curvature
Mean-variance	Does variance match the assumed \(v(\mu)\)?	Pearson residual spread, \(\hat\phi\)
Influential points	Outliers affecting estimates?	leverage (\(2\bar h\)), Cook’s D (\(4/n\))
Overdispersion	Variance exceeds model expectation?	\(\hat\phi=\) Pearson \(\chi^2/\)df
Goodness-of-fit	Does the model fit overall?	deviance / LRT (FLW); H-L (enrichment)

Deviance and Pearson Residuals

\[ r_i^{(P)} = \frac{y_i - \hat\mu_i}{\sqrt{\hat V(\hat\mu_i)}}, \qquad r_i^{(D)} = \operatorname{sign}(y_i-\hat\mu_i)\sqrt{2\left[\ell(y_i)-\ell(\hat\mu_i)\right]} \]

Residual Type	Based On
Pearson	scaled by estimated variance
Deviance	model log-likelihood contribution

Assessing Model Fit with Deviance

Residual deviance measures fit relative to the saturated model (the model with one parameter per observation, i.e. a perfect fit – the deviance is twice the log-likelihood gap between your model and that perfect fit).

Compare nested models via the likelihood-ratio test:

\[ D = 2(\ell_{\text{full}} - \ell_{\text{reduced}}) \sim \chi^2_{df} \]

For nested models, smaller deviance indicates better fit. (This is FLW-native, Section 11.7.)

Choosing the Link Function

Note

Beyond FLW Ch. 11 (instructor extension): complementary log-log and probit link choices are general GLM theory (McCullagh & Nelder 1989), not part of the Ch. 11 review.

Link	Interpretation	Common Use
Identity	direct mean	Normal
Logit	log-odds	binary
Log	log-rate	count
Probit	latent normal	binary / ordinal
Complementary log-log	asymmetric, rare events	survival-type data

Part 6 (Day 2): Extensions (read on your own)

Note

Beyond FLW Ch. 11 (instructor extension). Everything in Part 6 is enrichment: skim it, recognize the names, and know where to look. Pace: ~10 min, lecture-light.

Zero-Inflated and Hurdle Models

Note

Beyond FLW. FLW (p. 321) mentions zero-inflated Poisson (ZIP) but states a detailed treatment is “beyond the scope of this chapter”; hurdle models are not in FLW at all (see Zeileis, Kleiber & Jackman 2008).

When counts have far more zeros than Poisson allows, use a two-part model:

Part	Purpose
1	binary: structural zero vs “at risk”
2	count: conditional on being at risk

# Illustrative ZIP on the epilepsy counts (note: these are not zero-inflated,
# so this is a demonstration of syntax, not a recommended model here).
if (requireNamespace("pscl", quietly = TRUE)) {
  zi <- pscl::zeroinfl(total ~ trtf + base | 1, data = ep, dist = "poisson")
  round(coef(zi), 3)
} else {
  cat("Install 'pscl' for zero-inflated models.\n")
}

Zero-Inflated and Hurdle Models

  count_(Intercept) count_trtfProgabide          count_base    zero_(Intercept) 
              2.679              -0.206               0.022              -4.060

Quasi-Likelihood and Robust (Sandwich) SEs

Note

Beyond FLW Ch. 11 (instructor extension). Sandwich/robust SEs are the bridge to GEE (Ch. 12-13); here they are previewed (see Zeger & Liang 1986).

For misspecified-variance or correlated data, robust (sandwich) SEs give valid inference even if the variance model is wrong.

if (requireNamespace("sandwich", quietly = TRUE) &&
    requireNamespace("lmtest", quietly = TRUE)) {
  lmtest::coeftest(fit_pois, vcov = sandwich::sandwich)
} else {
  cat("Install 'sandwich' and 'lmtest' for robust SEs.\n")
}

Quasi-Likelihood and Robust (Sandwich) SEs


z test of coefficients:

                Estimate Std. Error z value Pr(>|z|)    
(Intercept)    0.0749927  0.3150986  0.2380   0.8119    
trtfProgabide -0.1928933  0.1717332 -1.1232   0.2613    
base           0.0225023  0.0012499 18.0034   <2e-16 ***
age            0.0162527  0.0099523  1.6331   0.1025    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

GEE Connection (Preview)

Single-level GLMs assume independent observations. Longitudinal data have within-subject correlation; Ch. 12-13 relax independence using GEE.

The GEE working covariance (NOT the true \(\mathrm{Var}(\mathbf{Y}_i)\)) is

\[ V_i = \mathbf{A}_i^{1/2}\,\operatorname{Corr}(\mathbf{Y}_i)\,\mathbf{A}_i^{1/2}, \]

where \(\mathbf{A}_i\) is the diagonal of marginal variances and \(\operatorname{Corr}(\mathbf{Y}_i)\) is a working correlation.

Note

Robust (sandwich) inference stays valid even if the working correlation is wrong. Standard GEE requires MCAR for valid inference; IPW-GEE extends to MAR (developed in the missing-data chapters, course L17-L18).

Model Selection and AIC

Note

Beyond FLW Ch. 11 (instructor extension): AIC-based model selection is general practice (Akaike 1974), not part of the Ch. 11 review, which uses deviance/LRT for nested comparisons.

\[ \text{AIC} = -2\ell + 2p \]

Smaller AIC = better fit/complexity tradeoff. Use the LRT (deviance) for nested models; AIC for non-nested comparison.

Common Pitfalls

Pitfall	Issue
Misinterpreting coefficients	odds/log-odds are not probabilities
Ignoring overdispersion	SEs too small, inflated significance
Log link with zero/negative values	model fails
Forgetting the offset	compares counts, not rates
Confusing nested vs non-nested comparison	wrong test (LRT vs AIC)
Misreading the ordinal sign convention	direction flips between polr and FLW

GLM Summary: Key Takeaways

Aspect	Key Point
Framework	GLMs extend LMs via a link + a mean-variance structure
Outcomes	binary, count, ordinal handled in one framework
FLW core	logistic ORs, Poisson rate ratios, overdispersion scale factor, deviance/LRT
Enrichment	ROC/AUC, calibration, H-L, ordinal/multinomial, ZIP, sandwich, AIC

Next Lecture: Marginal Models (Ch. 12); GEE estimation follows in Ch. 13.

Part 7: Appendices (Technical Details)

Beyond FLW Ch. 11; reference only.

Appendix A: IRLS Algorithm Details

Note

Beyond FLW Ch. 11. Standard GLM theory (McCullagh & Nelder 1989).

Each iteration solves a weighted least-squares problem:

\[ \boldsymbol{\beta}^{(t+1)} = (\mathbf{X}^\top W^{(t)} \mathbf{X})^{-1} \mathbf{X}^\top W^{(t)} z^{(t)} \]

with working response and weights

\[ z^{(t)} = \eta^{(t)} + (y - \mu^{(t)}) \frac{d\eta}{d\mu}, \qquad W^{(t)} = \operatorname{diag}\!\left[\left(\tfrac{d\mu}{d\eta}\right)^2 \big/ v(\mu)\right]. \]

Here \(d\mu/d\eta\) is the slope of the inverse link; \(v(\mu)\) is the variance function.

Appendix B: Score Function and Fisher Information

Note

Beyond FLW Ch. 11. Standard GLM theory; below assumes the canonical link with unit dispersion (\(\phi=1\)). The general forms carry \(\phi\) and the \((\partial\mu_i/\partial\eta_i)^2\) weight.

For the canonical link (rows \(X_{ij}\), \(1\times p\)):

\[ U(\boldsymbol{\beta}) = \sum_i \sum_j X_{ij}^\top (Y_{ij} - \mu_{ij}), \qquad I(\boldsymbol{\beta}) = \sum_i \sum_j \frac{X_{ij}^\top X_{ij}}{\operatorname{Var}(Y_{ij})} . \]

Used for standard errors and likelihood inference.

Appendix C: Exponential Family Examples

Using FLW’s labeling: cumulant function \(a(\theta)\), data-only term \(b(y,\phi)\), scale \(\phi\).

Distribution	\(\phi\)	\(a(\theta)\) (cumulant)	\(b(y,\phi)\) (data-only)
Normal	\(\sigma^2\)	\(\theta^2/2\)	\(-y^2/(2\sigma^2) - \tfrac12\log(2\pi\sigma^2)\)
Binomial	1	\(\log(1+e^\theta)\)	\(\log\binom{n}{y}\)
Poisson	1	\(e^\theta\)	\(-\log(y!)\)

Recall \(\mathbb{E}(Y)=a'(\theta)\) and \(\mathrm{Var}(Y)=\phi\,a''(\theta)\).