Inference benchmark results¶
This page reports how each posterior inference algorithm shipped in quivers.inference recovers known posterior moments on a deterministic suite of synthetic problems. The grid is regenerated by tests/benchmarks/runner.py from the seeded data factories and analytical references in tests/benchmarks/.
What the suite tests¶
Every benchmark is an (algorithm, problem) cell. A problem fixes:
- A generative model written in QVR and loaded from
tests/benchmarks/models/*.qvr. - A deterministic data generator (fixed
torch.manual_seed) that produces the observations the model is conditioned on. - A reference posterior moment for one latent site, computed analytically (conjugate problems), by quadrature on a dense grid (constrained-support problems), or by a long cached NUTS run (Eight Schools).
- A scalar metric (almost always \(|\mathbb{E}_q[\cdot] - \mathbb{E}_{\text{ref}}[\cdot]|\)) and a tolerance.
A cell runs the algorithm on the problem, draws posterior samples for the target site, and compares the recovered moment against the reference.
Throughput is reported as SVI iterations per second for the variational guides and as posterior draws per second (summed across chains) for the MCMC kernels.
Cell statuses¶
- PASS: recovered moment is within tolerance of the reference.
- FAIL: algorithm runs cleanly but the moment is outside tolerance.
- ERROR: algorithm raised during execution (NaN gradient, support-boundary explosion, divergent trajectory, etc.).
- capture problems invert the convention: PASS means the metric exceeds the tolerance, confirming a documented failure mode.
Determinism: every cell calls torch.manual_seed(0) before constructing the problem, so the same (algorithm, problem) pair reproduces across runs given fixed PyTorch and NumPy versions.
Algorithms¶
All algorithms are evaluated on every problem. Hyperparameters are uniform across problems so that the grid measures the algorithms, not a per-problem tuning effort.
| Algorithm | Family | Key hyperparameters |
|---|---|---|
AutoNormal |
Mean-field SVI, factorised diagonal Normal in unconstrained space | Adam, lr=0.05, 800 steps (1500 for positive-support sites), 1500 posterior draws |
AutoMVN |
Full-covariance SVI, single MVN in unconstrained space | Adam, lr=0.05, 800 steps (1500 for positive-support sites), init_scale=0.3, 1500 draws |
AutoLaplace |
MAP plus a Gaussian centred at the mode with Hessian covariance | Adam, lr=0.05, 500 steps, 1500 draws |
HMC |
Hamiltonian Monte Carlo with fixed integrator length | step_size=0.1 (adapted), num_steps=10, diagonal mass matrix (adapted), 200 warmup, 400 samples, 2 chains |
NUTS |
No-U-Turn HMC | target_accept=0.8, max_tree_depth=8, diagonal mass matrix, 200 warmup, 400 samples, 2 chains |
Variational guides operate in unconstrained space via the bijector attached to each latent's support, so positive-support and bounded-support sites are exercised through exp / softplus / sigmoid transforms rather than through constrained Gaussian families.
Tier 1: conjugate posteriors¶
Five textbook problems with closed-form posteriors. They establish a floor: every algorithm should match the analytical moment to within a tight tolerance.
Beta-Bernoulli¶
Model. Conjugate Beta prior on a Bernoulli rate:
Data. \(N = 50\) Bernoulli draws at \(\theta^\star = 0.7\).
Reference. Conjugacy gives \(\theta \mid y \sim \mathrm{Beta}\bigl(\alpha_0 + \sum_i y_i,\ \beta_0 + N - \sum_i y_i\bigr)\) with closed-form mean \(\alpha / (\alpha + \beta)\).
Metric. |E[theta]_q - E[theta]_true|, tolerance 0.05.
Normal-Normal¶
Model. Conjugate Normal prior on a Normal mean with known variance:
Data. \(N = 30\) Normal draws at \(\mu^\star = 1.5\), \(\sigma = 1\).
Reference. Posterior precision \(\tau_N = \tau_0 + N / \sigma^2\) gives a Normal posterior with mean \((\tau_0 \mu_0 + N \bar{y} / \sigma^2) / \tau_N\).
Metric. |E[mu]_q - E[mu]_true|, tolerance 0.15.
Normal-Inverse-Gamma¶
Model. Joint conjugate prior on unknown mean and variance:
Data. \(N = 60\) Normal draws at \(\mu^\star = 0.3\), \(\sigma^{2\star} = 1.5\).
Reference. NIG posterior updates (Murphy 2007 ยง5) give marginal mean \(\mu_N = (\kappa_0 \mu_0 + N \bar{y}) / (\kappa_0 + N)\).
Stress test for guides handling two latents with mixed supports: the unconstrained \(\mu\) and the positive \(\sigma^2\) (whose bijector is \(\exp\) / softplus).
Metric. |E[mu]_q - E[mu]_true|, tolerance 0.2.
Gamma-Exponential¶
Model. Conjugate Gamma prior on an Exponential rate:
Data. \(N = 80\) Exponential draws at \(r^\star = 2\).
Reference. \(r \mid y \sim \mathrm{Gamma}\bigl(a_0 + N,\ b_0 + \sum_i y_i\bigr)\), with mean \(a / b\).
Metric. |E[rate]_q - E[rate]_true|, tolerance 0.3.
Bayesian linear regression¶
Model. Two-parameter linear regression with iid standard-Normal design and known observation noise:
with \(\sigma = 0.3\), \(a^\star = 0.7\), \(b^\star = -0.5\).
Reference. Closed-form Gaussian posterior with precision \(I + X^\top X / \sigma^2\) and mean \(\Sigma X^\top y / \sigma^2\).
Metric. |E[a]_q - E[a]_true|, tolerance 0.1.
Results¶
Posterior accuracy (metric / tolerance):
| Problem | AutoNormal | AutoMVN | AutoLaplace | HMC | NUTS |
|---|---|---|---|---|---|
| Beta-Bernoulli | PASS 0.0398 / 0.05 |
PASS 0.0402 / 0.05 |
PASS 0.00926 / 0.05 |
PASS 0.000594 / 0.05 |
PASS 0.00157 / 0.05 |
| Normal-Normal | PASS 0.123 / 0.15 |
PASS 0.124 / 0.15 |
PASS 4.77e-07 / 0.15 |
PASS 0.000607 / 0.15 |
PASS 0.0225 / 0.15 |
| Normal-Inverse-Gamma | PASS 0.0345 / 0.2 |
PASS 0.0298 / 0.2 |
PASS 5.96e-08 / 0.2 |
PASS 0.00715 / 0.2 |
PASS 0.0068 / 0.2 |
| Gamma-Exponential | PASS 0.0513 / 0.3 |
PASS 0.057 / 0.3 |
PASS 0.0249 / 0.3 |
PASS 0.00806 / 0.3 |
PASS 0.0163 / 0.3 |
| Bayesian linear regression | PASS 0.0113 / 0.1 |
PASS 0.00401 / 0.1 |
PASS 1.79e-07 / 0.1 |
PASS 0.000134 / 0.1 |
PASS 0.00127 / 0.1 |
Throughput (iters/s for SVI, draws/s for MCMC):
| Problem | AutoNormal | AutoMVN | AutoLaplace | HMC | NUTS |
|---|---|---|---|---|---|
| Beta-Bernoulli | 1149.7 | 868.0 | 1863.5 | 114.2 | 167.9 |
| Normal-Normal | 2013.3 | 1395.7 | 3451.5 | 236.8 | 256.7 |
| Normal-Inverse-Gamma | 883.0 | 571.8 | 1574.8 | 102.6 | 51.0 |
| Gamma-Exponential | 1803.4 | 1153.3 | 3276.4 | 233.1 | 180.0 |
| Bayesian linear regression | 1222.2 | 710.6 | 2305.4 | 163.5 | 40.8 |
Tier 2: hierarchical posteriors¶
The Eight Schools problem (Rubin 1981) in both parameterisations. Tests how each algorithm handles the funnel geometry that arises when a group-level scale tau shrinks toward zero.
Eight Schools (centered)¶
Model.
for \(j = 1, \dots, 8\) on the canonical Rubin (1981) effect sizes \(y = (28, 8, -3, 7, -1, 1, 18, 12)\).
Reference. Cached NUTS moments (4 chains, 5000 post-warmup draws): \(\mathbb{E}[\mu] \approx 5.4\), posterior standard deviation \(\approx 4\).
Tolerance is set at three reference standard deviations: a loose target reflecting how hard the funnel geometry is for VI.
Metric. |E[mu]_q - mu_ref|, tolerance 12.
Eight Schools (non-centered)¶
Model. Same priors as the centered model, with the group-level draws reparameterised:
decoupling \(\tau\) from \(\theta_j\) and eliminating the funnel in the prior.
Reference. Same cached NUTS moments as the centered model.
Tolerance is tightened to two reference standard deviations: the reparam should pay off.
Metric. |E[mu]_q - mu_ref|, tolerance 8.
Results¶
Posterior accuracy (metric / tolerance):
| Problem | AutoNormal | AutoMVN | AutoLaplace | HMC | NUTS |
|---|---|---|---|---|---|
| Eight Schools (centered) | PASS 5.4 / 12 |
PASS 5.51 / 12 |
PASS 5.4 / 12 |
PASS 4.38 / 12 |
PASS 5.75 / 12 |
| Eight Schools (non-centered) | PASS 0.891 / 8 |
PASS 1.06 / 8 |
PASS 2.01 / 8 |
PASS 1.74 / 8 |
PASS 1.45 / 8 |
Throughput (iters/s for SVI, draws/s for MCMC):
| Problem | AutoNormal | AutoMVN | AutoLaplace | HMC | NUTS |
|---|---|---|---|---|---|
| Eight Schools (centered) | 776.5 | 559.8 | 1498.0 | 113.0 | 28.0 |
| Eight Schools (non-centered) | 729.2 | 572.4 | 1448.6 | 110.3 | 34.1 |
Tier 3: hard posterior geometry¶
Problems chosen to expose specific failure modes of mean-field VI and of HMC under poor preconditioning.
Correlated regression¶
Model. Linear regression as in Tier 1, but with a near-constant design:
with \(\rho = 0.95\) and \(N = 50\).
Reference. Closed-form Gaussian posterior with off-diagonal correlation \(\rho \approx 0.95+\).
The mean-field guide ignores this correlation; the first-moment metric below still passes (the documented underfit lives in the second moment).
Metric. |E[a]_q - E[a]_true|, tolerance 0.2.
Neal's funnel (under-estimation capture) (capture)¶
Model. Neal's funnel:
Data. Condition on \(x_i = 0\) (inference target is \(p(v \mid x = 0)\)).
Reference. The log-likelihood is linear in \(v\): \(\log p(x_i = 0 \mid v) = -\tfrac{1}{2}\log(2\pi) - v / 2\), so the conditional posterior is Gaussian with mean \(-9 N / 2 = -40.5\) and variance \(9\) at \(N = 9\). The joint posterior over \((v, x)\) remains funnel-shaped; only the conditional given \(x = 0\) is tractable.
Capture semantics. All five algorithms under-estimate the magnitude of \(v\). PASS means the metric exceeds the tolerance, confirming the documented underfit.
Metric. |E[v]_q - E[v]_true|, tolerance 20.25.
Ill-conditioned product Gaussian¶
Model. Five-dimensional product Gaussian with five orders of magnitude of prior scale and a fixed observation noise:
with \(\sigma^{\text{prior}} = (100, 10, 1, 0.1, 0.01)\).
Reference. Per-dimension Gaussian: \(x_d \mid y_d \sim \mathcal{N}\bigl(y_d / (1 + (0.1 / \sigma_d)^2),\ (1 / \sigma_d^2 + 1 / 0.01)^{-1}\bigr)\).
Tracks the middle scale \(x_3\), where the diagonal mass matrix is roughly correct but the gradient signal is dwarfed by the larger-scale dimensions.
Metric. |E[x_3]_q - E[x_3]_true|, tolerance 0.3.
Results¶
Posterior accuracy (metric / tolerance):
| Problem | AutoNormal | AutoMVN | AutoLaplace | HMC | NUTS |
|---|---|---|---|---|---|
| Correlated regression | PASS 0.0365 / 0.2 |
PASS 0.0308 / 0.2 |
PASS 7.41e-05 / 0.2 |
PASS 0.0512 / 0.2 |
PASS 0.0569 / 0.2 |
| Neal's funnel (under-estimation capture) (capture) | PASS 40.6 / 20.2 |
PASS 40.6 / 20.2 |
PASS 40.5 / 20.2 |
PASS 42 / 20.2 |
PASS 40.9 / 20.2 |
| Ill-conditioned product Gaussian | PASS 0.00643 / 0.3 |
PASS 0.00333 / 0.3 |
PASS 7.15e-07 / 0.3 |
PASS 0.086 / 0.3 |
PASS 0.0424 / 0.3 |
Throughput (iters/s for SVI, draws/s for MCMC):
| Problem | AutoNormal | AutoMVN | AutoLaplace | HMC | NUTS |
|---|---|---|---|---|---|
| Correlated regression | 1150.7 | 681.6 | 2314.7 | 167.3 | 67.1 |
| Neal's funnel (under-estimation capture) (capture) | 1795.3 | 1279.6 | 2807.7 | 281.0 | 1027.6 |
| Ill-conditioned product Gaussian | 482.0 | 461.1 | 876.7 | 68.3 | 54.5 |
Tier 6: constrained-support stress¶
Latents on a half-line or in a bounded interval. References come from dense-grid quadrature; variational guides must traverse a non-linear bijector to reach the constrained scale.
HalfNormal scale¶
Model.
Reference. No conjugate form. Integrate
on a 4096-point grid in \([0.05, 6]\) for the reference moments.
Metric. |E[sigma]_q - E[sigma]_true|, tolerance 0.15.
TruncatedNormal recovery¶
Model.
Reference. Evaluate the truncated-Normal log-likelihood on a 4096-point \(\mu\)-grid in \((0, 1)\) with stable log-CDF differences for the truncation constant; normalise for the posterior moments.
Metric. |E[mu]_q - E[mu]_true|, tolerance 0.05.
Results¶
Posterior accuracy (metric / tolerance):
| Problem | AutoNormal | AutoMVN | AutoLaplace | HMC | NUTS |
|---|---|---|---|---|---|
| HalfNormal scale | PASS 0.0265 / 0.15 |
PASS 0.0458 / 0.15 |
PASS 0.0247 / 0.15 |
PASS 0.00324 / 0.15 |
PASS 0.0124 / 0.15 |
| TruncatedNormal recovery | PASS 0.0318 / 0.05 |
PASS 0.0309 / 0.05 |
PASS 0.000249 / 0.05 |
PASS 0.00095 / 0.05 |
PASS 0.00257 / 0.05 |
Throughput (iters/s for SVI, draws/s for MCMC):
| Problem | AutoNormal | AutoMVN | AutoLaplace | HMC | NUTS |
|---|---|---|---|---|---|
| HalfNormal scale | 1561.4 | 1045.9 | 2725.6 | 190.0 | 90.7 |
| TruncatedNormal recovery | 1355.7 | 1038.6 | 2152.0 | 145.2 | 105.1 |
Reproducing the grid¶
QVR_USE_LOCAL_GRAMMAR=1 python -m tests.benchmarks.runner
The runner accepts --algorithms and --problems flags for partial runs and writes the regenerated table back to this file by default. See tests/benchmarks/runner.py for the cell definitions and tests/benchmarks/references.py for the reference posteriors.