6. Choosing an inference algorithm¶
Quivers ships nine variational guides (Hoffman, Blei, Wang & Paisley, 2013 for the SVI substrate), four objectives, two MCMC kernels (HMC, Neal, 2011; NUTS, Hoffman & Gelman, 2014), two hybrid samplers, and four gradient estimators. Some combinations are obviously sensible; some are obvious mistakes; most fall between, and the right call depends on the shape of your posterior. This chapter is a field guide.
The mental model: you pick (a) a family (variational, MCMC, hybrid), then (b) a configuration within that family, then (c) gradient-estimator and tightness knobs. The choices roughly factor; we'll walk down the tree.
Step 1: VI, MCMC, or hybrid?¶
| Posterior shape | First-pass choice |
|---|---|
| Smooth, broadly unimodal, < 1000 latents | Variational inference. |
| Multimodal or pathologically curved | MCMC (NUTS). |
| Hierarchical with funnel-shaped posteriors | NUTS, or VI on the non-centered reparameterisation. |
| High-dim (>1000 latents) and continuous | VI, unless you have a lot of compute. |
| Mixed discrete-continuous with small discrete support | VI with a marginalize block (chapter 4); HMC over the continuous remainder. |
| You don't know yet | Start with AutoNormalGuide + ELBO. It's cheap; the failure modes are diagnostic. |
VI is fast and scales; MCMC is unbiased and respects curvature. Hybrids try to get both. The cost is real: NUTS on a 600-latent hierarchical model takes minutes per chain; AutoNormalGuide on the same model takes seconds. If the VI fit is good enough for your downstream task, ship it.
Step 2: Which variational guide?¶
flowchart TD
start(["Posterior shape"]) --> id{"Identifiable,<br/>weak correlation?"}
id -- yes --> normal["AutoNormalGuide<br/>default"]
id -- no --> corr{"Strong correlations?"}
corr -- yes --> dim{"Cholesky fits in memory?"}
dim -- yes --> mvn["AutoMultivariateNormalGuide"]
dim -- no --> lr["AutoLowRankMultivariateNormalGuide"]
corr -- no --> multi{"Multimodal?"}
multi -- yes --> mix["AutoMixtureGuide<br/>or AutoNormalizingFlow<br/>or AutoIAFGuide"]
multi -- no --> tails{"Non-Gaussian tails?"}
tails -- yes --> flow["AutoIAFGuide<br/>or AutoNeuralSplineGuide"]
tails -- no --> normal2["back to AutoNormalGuide"]
The shipped guides:
| Guide | Family | When |
|---|---|---|
AutoNormalGuide |
Diagonal Normal | Default. Cheap, scales, often good enough. |
AutoDeltaGuide |
Dirac at MAP | Quick MAP; no posterior uncertainty. |
AutoMultivariateNormalGuide |
Full-rank Normal (Cholesky) | Strong posterior correlations and total latent dimension small enough for a dense Cholesky to fit in memory. Memory and per-step cost both scale as \(O(D^2)\). |
AutoLowRankMultivariateNormalGuide |
Low-rank + diagonal | Hierarchical models with localised correlations; specify rank (typically 5-20). |
AutoLaplaceApproximation |
Gaussian centered at MAP with Hessian inverse | Post-hoc quadratic-around-MAP; cheap upgrade from AutoDeltaGuide. |
AutoNormalizingFlow |
Composed bijector over Normal base | Multimodal or heavy-tailed posteriors; you supply the flow stack. |
AutoIAFGuide |
Inverse autoregressive flow | Flagship NF default; good general-purpose non-Gaussian guide. |
AutoNeuralSplineGuide |
Rational-quadratic spline coupling | Sharper than IAF on bounded support. |
AutoMixtureGuide |
K-component mixture of guides | Multimodal posteriors; specify num_components. |
The Tier-1 through Tier-6 benchmark grid in tests/benchmarks/ walks these against synthetic posteriors with known answers; the report at docs/developer/inference-benchmarks.md is the empirical truth-table for which guide passes which kind of problem.
Step 3: Which objective?¶
| Objective | Bound | When |
|---|---|---|
ELBO(num_particles=1) |
Reparameterized ELBO | Default. |
ELBO(num_particles=K) |
K-sample ELBO | Slightly tighter; rarely worth the cost. |
IWAEBound(K) |
Importance-weighted bound | Tighter than ELBO for K > 1. |
RenyiBound(alpha, K) |
Rényi divergence bound | Interpolates ELBO (α=1) and IWAE (α=0). |
VRIWAEBound(alpha, K) |
Variational Rényi-IWAE | Mass-covering vs mode-seeking knob via α; tight via K. |
For most workflows, the default ELBO() is the right call. IWAEBound matters when the posterior is multimodal and the guide is flow-shaped (the importance weighting tightens the bound; coupling layers exploit it). RenyiBound is useful when you specifically want a mass-covering posterior (e.g. for Bayesian-optimization acquisition functions).
Step 4: Which gradient estimator?¶
| Estimator | What it does | When |
|---|---|---|
Reparameterized |
Standard pathwise gradient | Default. |
StickingTheLanding |
Detaches variational params in log q |
When q → p* and you're getting noisy gradients at convergence. |
DoublyReparameterized |
DReG for IWAE | With IWAEBound(K) at large K. |
ScoreFunction |
REINFORCE | Discrete (non-reparameterizable) latents you can't marginalize. |
Pair IWAEBound(num_particles=16, estimator=DoublyReparameterized()) for the flagship "tight bound, low-variance gradients" combination. K (the num_particles) trades wall-clock for tightness: \(K = 1\) recovers the ELBO; the bound is monotonically tighter in \(K\) but with diminishing returns past \(K = 32\). Memory scales linearly in \(K\) so for large models you'll want \(K \in [4, 16]\).
Calling SVI¶
from quivers.inference import (
AutoIAFGuide, IWAEBound, DoublyReparameterized, SVI,
)
guide = AutoIAFGuide(model, observed_names={"y"}, num_flows=4)
objective = IWAEBound(num_particles=16, estimator=DoublyReparameterized())
optimizer = torch.optim.Adam(
list(model.parameters()) + list(guide.parameters()), lr=1e-3,
)
svi = SVI(model, guide, optimizer, objective)
for step in range(5000):
loss = svi.step(x_data, {"y": y_data})
Calling NUTS¶
For the times you want unbiased posterior samples:
from quivers.inference import NUTSKernel, MCMC
kernel = NUTSKernel(
target_accept=0.85,
max_tree_depth=10,
mass_matrix="diagonal",
)
mcmc = MCMC(
kernel,
num_warmup=1000,
num_samples=2000,
num_chains=4,
init_strategy="prior",
)
result = mcmc.run(model, x_data, {"y": y_data})
print(result.r_hat) # per-site split-Rhat
print(result.ess) # per-site effective sample size
print("divergences:", result.total_divergences)
Diagnostics to check before trusting the run:
- Rhat < 1.01 for every site (otherwise: chains haven't mixed; run longer warmup).
- ESS > 100 per chain for every site (otherwise: samples are too correlated; tighten step size).
- Divergences near zero (otherwise: posterior has curvature the integrator missed; bump
target_acceptto 0.95 or reparameterize).
Hybrid samplers¶
Two configurations sit between VI and pure MCMC:
WarmupThenHMC(guide, kernel, svi_steps, mcmc_warmup, mcmc_samples)runssvi_stepsof variational warm-up and then launches MCMC chains initialized from the trained guide. Useful when HMC's warmup wastes a lot of compute exploring tails the guide already knows are empty.AutoDAIS(base, model, observations, num_steps=..., init_step_size=...)wraps a base guide withnum_stepsannealed HMC trajectories. It closes the parity gap with the Pyro / NumPyro AutoDAIS on multimodal posteriors where pure VI collapses.
from quivers.inference import AutoDAIS, ELBO
base = AutoNormalGuide(model, observed_names={"y"})
guide = AutoDAIS(
base, model, observations={"y": y_data},
num_steps=8, init_step_size=0.05,
)
elbo = ELBO()
svi = SVI(model, guide, optimizer, elbo)
Predictive¶
Predictive is the same regardless of which inference you ran:
from quivers.inference import Predictive
# `posterior` accepts either a trained Guide or an MCMCResult
pred = Predictive(model, posterior=guide, num_samples=1000)
pred = Predictive(model, posterior=mcmc_result, num_samples=1000)
posterior_predictive = pred(x_test) # dict of (num_samples, ...) tensors
Reading the benchmark grid¶
Whenever you're unsure which guide handles a model shape, look at
docs/developer/inference-benchmarks.md. The Tier-1 through Tier-6 grid runs every shipped guide against synthetic posteriors with known answers (conjugate models, hierarchical structures, pathological geometries, multimodal mixtures, latent-variable models, constrained supports), reports the metrics, and gives you an empirical answer. If a guide is missing for your model shape, the failing-cell pattern often suggests the right reparameterisation or a different guide.
Try this¶
- Take the eight-schools model from chapter 3 and run every guide on it, comparing posterior
taurecovery against the NUTS reference. - Take the GMM from chapter 4 and run
AutoMixtureGuide(num_components=4)againstAutoNormalGuide. The mode-recovery difference is dramatic. - Run
AutoDAISon the donut posterior (a non-simply-connected support); compare toAutoIAFGuide.
Next¶
Chapter 7 (optional reading) peeks under the hood at the categorical machinery: algebras, change-of-base, enriched composition. Useful if you want to extend the library or understand the type-error messages fluently. If you're happy with the DSL surface, you can stop here.
References¶
- Radford M. Neal. 2011. MCMC using Hamiltonian dynamics. In Steve Brooks, Andrew Gelman, Galin L. Jones, and Xiao-Li Meng, editors, Handbook of Markov Chain Monte Carlo, chapter 5. Chapman & Hall/CRC.