6. Choosing an inference algorithm

Quivers ships nine variational guides (Hoffman, Blei, Wang & Paisley, 2013 for the SVI substrate), four objectives, two MCMC kernels (HMC, Neal, 2011; NUTS, Hoffman & Gelman, 2014), two hybrid samplers, and four gradient estimators. Some combinations are obviously sensible; some are obvious mistakes; most fall between, and the right call depends on the shape of your posterior. This chapter is a field guide.

The mental model: you pick (a) a family (variational, MCMC, hybrid), then (b) a configuration within that family, then (c) gradient-estimator and tightness knobs. The choices roughly factor; we'll walk down the tree.

Step 1: VI, MCMC, or hybrid?

Posterior shape First-pass choice
Smooth, broadly unimodal, < 1000 latents Variational inference.
Multimodal or pathologically curved MCMC (NUTS).
Hierarchical with funnel-shaped posteriors NUTS, or VI on the non-centered reparameterisation.
High-dim (>1000 latents) and continuous VI, unless you have a lot of compute.
Mixed discrete-continuous with small discrete support VI with a marginalize block (chapter 4); HMC over the continuous remainder.
You don't know yet Start with AutoNormalGuide + ELBO. It's cheap; the failure modes are diagnostic.

VI is fast and scales; MCMC is unbiased and respects curvature. Hybrids try to get both. The cost is real: NUTS on a 600-latent hierarchical model takes minutes per chain; AutoNormalGuide on the same model takes seconds. If the VI fit is good enough for your downstream task, ship it.

Step 2: Which variational guide?

flowchart TD
    start(["Posterior shape"]) --> id{"Identifiable,<br/>weak correlation?"}
    id -- yes --> normal["AutoNormalGuide<br/>default"]
    id -- no  --> corr{"Strong correlations?"}
    corr -- yes --> dim{"Cholesky fits in memory?"}
    dim -- yes --> mvn["AutoMultivariateNormalGuide"]
    dim -- no  --> lr["AutoLowRankMultivariateNormalGuide"]
    corr -- no  --> multi{"Multimodal?"}
    multi -- yes --> mix["AutoMixtureGuide<br/>or AutoNormalizingFlow<br/>or AutoIAFGuide"]
    multi -- no  --> tails{"Non-Gaussian tails?"}
    tails -- yes --> flow["AutoIAFGuide<br/>or AutoNeuralSplineGuide"]
    tails -- no  --> normal2["back to AutoNormalGuide"]

The shipped guides:

Guide Family When
AutoNormalGuide Diagonal Normal Default. Cheap, scales, often good enough.
AutoDeltaGuide Dirac at MAP Quick MAP; no posterior uncertainty.
AutoMultivariateNormalGuide Full-rank Normal (Cholesky) Strong posterior correlations and total latent dimension small enough for a dense Cholesky to fit in memory. Memory and per-step cost both scale as \(O(D^2)\).
AutoLowRankMultivariateNormalGuide Low-rank + diagonal Hierarchical models with localised correlations; specify rank (typically 5-20).
AutoLaplaceApproximation Gaussian centered at MAP with Hessian inverse Post-hoc quadratic-around-MAP; cheap upgrade from AutoDeltaGuide.
AutoNormalizingFlow Composed bijector over Normal base Multimodal or heavy-tailed posteriors; you supply the flow stack.
AutoIAFGuide Inverse autoregressive flow Flagship NF default; good general-purpose non-Gaussian guide.
AutoNeuralSplineGuide Rational-quadratic spline coupling Sharper than IAF on bounded support.
AutoMixtureGuide K-component mixture of guides Multimodal posteriors; specify num_components.

The Tier-1 through Tier-6 benchmark grid in tests/benchmarks/ walks these against synthetic posteriors with known answers; the report at docs/developer/inference-benchmarks.md is the empirical truth-table for which guide passes which kind of problem.

Step 3: Which objective?

Objective Bound When
ELBO(num_particles=1) Reparameterized ELBO Default.
ELBO(num_particles=K) K-sample ELBO Slightly tighter; rarely worth the cost.
IWAEBound(K) Importance-weighted bound Tighter than ELBO for K > 1.
RenyiBound(alpha, K) Rényi divergence bound Interpolates ELBO (α=1) and IWAE (α=0).
VRIWAEBound(alpha, K) Variational Rényi-IWAE Mass-covering vs mode-seeking knob via α; tight via K.

For most workflows, the default ELBO() is the right call. IWAEBound matters when the posterior is multimodal and the guide is flow-shaped (the importance weighting tightens the bound; coupling layers exploit it). RenyiBound is useful when you specifically want a mass-covering posterior (e.g. for Bayesian-optimization acquisition functions).

Step 4: Which gradient estimator?

Estimator What it does When
Reparameterized Standard pathwise gradient Default.
StickingTheLanding Detaches variational params in log q When q → p* and you're getting noisy gradients at convergence.
DoublyReparameterized DReG for IWAE With IWAEBound(K) at large K.
ScoreFunction REINFORCE Discrete (non-reparameterizable) latents you can't marginalize.

Pair IWAEBound(num_particles=16, estimator=DoublyReparameterized()) for the flagship "tight bound, low-variance gradients" combination. K (the num_particles) trades wall-clock for tightness: \(K = 1\) recovers the ELBO; the bound is monotonically tighter in \(K\) but with diminishing returns past \(K = 32\). Memory scales linearly in \(K\) so for large models you'll want \(K \in [4, 16]\).

Calling SVI

from quivers.inference import (
    AutoIAFGuide, IWAEBound, DoublyReparameterized, SVI,
)

guide = AutoIAFGuide(model, observed_names={"y"}, num_flows=4)
objective = IWAEBound(num_particles=16, estimator=DoublyReparameterized())
optimizer = torch.optim.Adam(
    list(model.parameters()) + list(guide.parameters()), lr=1e-3,
)
svi = SVI(model, guide, optimizer, objective)

for step in range(5000):
    loss = svi.step(x_data, {"y": y_data})

Calling NUTS

For the times you want unbiased posterior samples:

from quivers.inference import NUTSKernel, MCMC

kernel = NUTSKernel(
    target_accept=0.85,
    max_tree_depth=10,
    mass_matrix="diagonal",
)
mcmc = MCMC(
    kernel,
    num_warmup=1000,
    num_samples=2000,
    num_chains=4,
    init_strategy="prior",
)
result = mcmc.run(model, x_data, {"y": y_data})
print(result.r_hat)                       # per-site split-Rhat
print(result.ess)                         # per-site effective sample size
print("divergences:", result.total_divergences)

Diagnostics to check before trusting the run:

  • Rhat < 1.01 for every site (otherwise: chains haven't mixed; run longer warmup).
  • ESS > 100 per chain for every site (otherwise: samples are too correlated; tighten step size).
  • Divergences near zero (otherwise: posterior has curvature the integrator missed; bump target_accept to 0.95 or reparameterize).

Hybrid samplers

Two configurations sit between VI and pure MCMC:

  • WarmupThenHMC(guide, kernel, svi_steps, mcmc_warmup, mcmc_samples) runs svi_steps of variational warm-up and then launches MCMC chains initialized from the trained guide. Useful when HMC's warmup wastes a lot of compute exploring tails the guide already knows are empty.
  • AutoDAIS(base, model, observations, num_steps=..., init_step_size=...) wraps a base guide with num_steps annealed HMC trajectories. It closes the parity gap with the Pyro / NumPyro AutoDAIS on multimodal posteriors where pure VI collapses.
from quivers.inference import AutoDAIS, ELBO

base   = AutoNormalGuide(model, observed_names={"y"})
guide  = AutoDAIS(
    base, model, observations={"y": y_data},
    num_steps=8, init_step_size=0.05,
)
elbo   = ELBO()
svi    = SVI(model, guide, optimizer, elbo)

Predictive

Predictive is the same regardless of which inference you ran:

from quivers.inference import Predictive

# `posterior` accepts either a trained Guide or an MCMCResult
pred = Predictive(model, posterior=guide,       num_samples=1000)
pred = Predictive(model, posterior=mcmc_result, num_samples=1000)

posterior_predictive = pred(x_test)           # dict of (num_samples, ...) tensors

Reading the benchmark grid

Whenever you're unsure which guide handles a model shape, look at docs/developer/inference-benchmarks.md. The Tier-1 through Tier-6 grid runs every shipped guide against synthetic posteriors with known answers (conjugate models, hierarchical structures, pathological geometries, multimodal mixtures, latent-variable models, constrained supports), reports the metrics, and gives you an empirical answer. If a guide is missing for your model shape, the failing-cell pattern often suggests the right reparameterisation or a different guide.

Try this

  • Take the eight-schools model from chapter 3 and run every guide on it, comparing posterior tau recovery against the NUTS reference.
  • Take the GMM from chapter 4 and run AutoMixtureGuide(num_components=4) against AutoNormalGuide. The mode-recovery difference is dramatic.
  • Run AutoDAIS on the donut posterior (a non-simply-connected support); compare to AutoIAFGuide.

Next

Chapter 7 (optional reading) peeks under the hood at the categorical machinery: algebras, change-of-base, enriched composition. Useful if you want to extend the library or understand the type-error messages fluently. If you're happy with the DSL surface, you can stop here.

References

  • Radford M. Neal. 2011. MCMC using Hamiltonian dynamics. In Steve Brooks, Andrew Gelman, Galin L. Jones, and Xiao-Li Meng, editors, Handbook of Markov Chain Monte Carlo, chapter 5. Chapman & Hall/CRC.