2. Generalized linear models¶
Linear regression is the corner case of a generalized linear model (Nelder & Wedderburn, 1972) with an identity link and a normal likelihood. Loosening either gives you a much wider family: logistic regression for binary outcomes, Poisson regression for counts, ordinal regression for graded responses. This chapter does the first two end to end in QVR, with calibration checks on the posterior predictive.
The model template is the same in every case:
where \(g\) is the link function and \(\mathcal{F}\) is the response family. We pick \((g^{-1}, \mathcal{F})\) together.
Logistic regression¶
Binary response, sigmoid link, Bernoulli likelihood.
object Item : FinSet 200
program logistic : Item -> Item [effects=[Sample, Score]]
sample beta_0 <- Normal(0.0, 5.0)
sample beta_1 <- Normal(0.0, 2.0)
let logit = beta_0 + beta_1 * x_design
let p = sigmoid(logit)
observe y : Item <- Bernoulli(p)
return y
export logistic
with pm.Model() as model:
beta_0 = pm.Normal("beta_0", 0.0, 5.0)
beta_1 = pm.Normal("beta_1", 0.0, 2.0)
p = pm.invlogit(beta_0 + beta_1 * x_data)
y = pm.Bernoulli("y", p, observed=y_data)
def logistic(x, y=None):
beta_0 = numpyro.sample("beta_0", dist.Normal(0.0, 5.0))
beta_1 = numpyro.sample("beta_1", dist.Normal(0.0, 2.0))
logit = beta_0 + beta_1 * x
numpyro.sample("y", dist.Bernoulli(logits=logit), obs=y)
QVR's let-arithmetic exposes the standard nonlinearities so you can name the inverse link explicitly. The full builtin set:
| Builtin | Domain | Codomain | Use |
|---|---|---|---|
sigmoid(x) |
\(\mathbb{R}\) | \((0, 1)\) | logistic inverse link |
softmax(x) |
\(\mathbb{R}^K\) | simplex on \(K\) | categorical inverse link |
log_softmax(x) |
\(\mathbb{R}^K\) | log-simplex | numerically stable categorical link |
exp(x) |
\(\mathbb{R}\) | \(\mathbb{R}_{>0}\) | log inverse link (Poisson, Gamma) |
log(x) |
\(\mathbb{R}_{>0}\) | \(\mathbb{R}\) | forward log link |
softplus(x) |
\(\mathbb{R}\) | \(\mathbb{R}_{>0}\) | positive output without the exp blowup |
tanh(x) |
\(\mathbb{R}\) | \((-1, 1)\) | bounded nonlinearity |
relu(x) |
\(\mathbb{R}\) | \(\mathbb{R}_{\ge 0}\) | rectified linear |
clamp(x, lo, hi) |
\(\mathbb{R}\) | \([\text{lo}, \text{hi}]\) | constrain |
sum(x), prod(x), cumsum(x), logsumexp(x) |
tensor | scalar / tensor | reductions over an axis |
(A link function is the deterministic map \(g\) in \(g(\theta_i) = \beta^\top x_i\). We compose with its inverse \(g^{-1}\) to turn the linear predictor into the natural parameter of the response family.)
Fitting is the same as chapter 1:
import torch
from quivers.dsl import loads
from quivers.inference import AutoNormalGuide, ELBO, SVI
LOGISTIC_SRC = """
object Item : FinSet 200
program logistic : Item -> Item
sample beta_0 <- Normal(0.0, 5.0)
sample beta_1 <- Normal(0.0, 2.0)
let logit = beta_0 + beta_1 * x_design
let p = sigmoid(logit)
observe y : Item <- Bernoulli(p)
return y
export logistic
"""
program = loads(LOGISTIC_SRC)
model = program.morphism
torch.manual_seed(0)
x_data = torch.randn(200)
true_p = torch.sigmoid(0.3 + 1.5 * x_data)
y_data = torch.bernoulli(true_p)
guide = AutoNormalGuide(model, observed_names={"y", "x_design"})
elbo = ELBO(num_particles=1)
optimizer = torch.optim.Adam(
list(model.parameters()) + list(guide.parameters()), lr=5e-3,
)
svi = SVI(model, guide, optimizer, elbo)
x_tensor = torch.zeros(1, 1)
observations = {"x_design": x_data, "y": y_data}
for step in range(50): # bump to ~3000 for real fits
svi.step(x_tensor, observations)
Posterior-predictive calibration¶
A logistic regression is well-calibrated if, of all observations where the model predicts \(p = 0.7\), roughly 70% actually have \(y = 1\).
from scipy.stats import binom
# Average the success probability across guide draws of (beta_0, beta_1).
draws = [guide.rsample(x_tensor) for _ in range(500)]
b0 = torch.stack([d["beta_0"] for d in draws]).reshape(-1) # (500,)
b1 = torch.stack([d["beta_1"] for d in draws]).reshape(-1) # (500,)
p_hat = torch.sigmoid(b0[:, None] + b1[:, None] * x_data).mean(dim=0) # (200,)
# Bin and check empirical frequency with a 95% Wilson-style band.
bins = torch.linspace(0, 1, 11)
for lo, hi in zip(bins[:-1], bins[1:]):
mask = (p_hat >= lo) & (p_hat < hi)
n = int(mask.sum())
if n == 0:
continue
empirical = y_data[mask].float().mean().item()
midpoint = (lo + hi).item() / 2
band_lo, band_hi = binom.interval(0.95, n, midpoint)
in_band = "ok" if band_lo / n <= empirical <= band_hi / n else "miss"
print(f" [{lo:.1f}, {hi:.1f}) n={n:3d} predicted={midpoint:.2f}"
f" observed={empirical:.2f} [{band_lo/n:.2f}, {band_hi/n:.2f}] {in_band}")
Poisson regression¶
Count response, log link, Poisson likelihood.
object Item : FinSet 150
program poisson_reg : Item -> Item [effects=[Sample, Score]]
sample beta_0 <- Normal(0.0, 5.0)
sample beta_1 <- Normal(0.0, 2.0)
let log_rate = beta_0 + beta_1 * x_design
let rate = exp(log_rate)
observe y : Item <- Poisson(rate)
return y
export poisson_reg
The fit code is identical to the logistic case; only the model file changes. The Predictive machinery handles the count-valued likelihood with no special configuration.
Common mistakes¶
Most "why does this fail at runtime" questions trace to one of three slips:
observed_namesmismatch. If the body referencesx_designbut you passobserved_names={"y"}, the guide's trace has an unbound free name. The compiler reports it as a free-name error atloadstime only when the name is also free in the program signature. Otherwise the failure is delayed to the firstsvi.step. Rule of thumb: every name that appears in the body but isn't introduced by<-,let, ormarginalizebelongs inobserved_names.- Shape mismatch in host data. If
x_designhas shape(200,)but the objectItemwas declared: 100, the per-row broadcast overItemfails with a torch shape error. The compiler doesn't knowx_design's runtime shape, so this one surfaces at the first forward pass. Sanity-check tensor lengths against object cardinalities. - Re-using an
<-name on the LHS oflet. Once a name is bound by a sample step, you can't reassign it. To name a deterministic transform, pick a fresh name (let mu = beta_0 + beta_1 * x, notlet beta_0 = beta_0 + 1).
What the QVR surface gives you here¶
Three observations from this chapter you may have already noticed:
- No
pyro.plate/numpyro.platewrapping. Plates are inferred from object cardinalities and the domain/codomain typing. If you need an explicit indexed family (say a per-group intercept), chapter 3 introduces the plate-draw syntax (v : G <- Normal(0, sigma)). letis not sampling. PyMCpm.Deterministic, NumPyronumpyro.deterministic, Pyropyro.deterministic: every PPL has a different name for "this is a function of random variables, not itself random." QVR useslet. The compiler tracks the dependency for autograd.- One module, one program. Each
.qvrfile declares zero or more objects, optionalalgebra, and zero or more programs. To fit several models, write several files (or several program blocks in one file;loadsreturns the one tagged withexport).
Try this¶
- Add a constant offset (
offset) to the Poisson model:let log_rate = beta_0 + beta_1 * x + offset. The compiler treatsoffsetas a free variable and exposes it as a model input alongsidex. - Switch the Bernoulli likelihood to a Binomial with a known trial count
n_trials. The Binomial constructor takes(n, p). - Run NUTS instead of SVI. Skip ahead to chapter 6, line "Calling NUTS"; it's a five-line change.
Next¶
Chapter 3 covers hierarchical models: the place where mean-field VI starts to misbehave and where NUTS earns its keep.
References¶
- John A. Nelder and Robert W. M. Wedderburn. 1972. Generalized linear models. Journal of the Royal Statistical Society Series A: General, 135(3):370–384.