Estimators¶
Gradient-estimator strategies plugged into Objectives: Reparameterised (default pathwise gradient), StickingTheLanding (variance reduction near convergence), DoublyReparameterised (DReG for IWAEBound at large K), and ScoreFunction (REINFORCE, for non-reparameterisable sites).
estimators
¶
Gradient estimators for variational objectives.
A GradientEstimator is the strategy that takes (latent
samples, model log-density, guide log-density) and returns a
scalar loss whose backward() produces the chosen gradient
estimator. Different strategies trade variance against
applicability:
Reparameterized— pathwise gradient (the standard SVI reparameterization trick). Lowest variance for reparameterizable families; requiresrsample.StickingTheLanding— detaches the variational-parameter dependence in :math:\log q_\phi(z)so the gradient variance asymptotically vanishes as :math:q \to p^*(Roeder-Wu-Duvenaud 2017,doi:10.48550/arXiv.1703.09194 <https://doi.org/10.48550/arXiv.1703.09194>_).DoublyReparameterized— the DReG estimator for IWAE (Tucker-Lawson-Gu-Maddison 2019,doi:10.48550/arXiv.1810.04152 <https://doi.org/10.48550/arXiv.1810.04152>_). Removes the score-function term whose variance grows with the particle count :math:K.ScoreFunction— REINFORCE / black-box VI. The fallback for non-reparameterizable sites (discrete latents, reject-sampled families). Highest variance; pair with a baseline whenever possible.
Estimators are strategies held by Objective
implementations; they don't store any state themselves and
operate on tensors only. The Reparameterized instance
is a singleton — every objective defaults to it.
GradientEstimator
¶
Bases: ABC
Strategy for computing :math:\nabla_\phi \mathcal{L} from
samples + densities.
Subclasses implement negative_objective: given the
per-particle log_p and log_q tensors (and any
estimator-specific auxiliary data), return the negated
objective whose backward() produces the desired gradient
estimator.
negative_objective
abstractmethod
¶
negative_objective(log_p: Tensor, log_q: Tensor, log_q_detached: Tensor | None = None) -> Tensor
Return the scalar loss whose gradient is the chosen estimator.
| PARAMETER | DESCRIPTION |
|---|---|
log_p
|
Model log-joint
TYPE:
|
log_q
|
Guide log-density
TYPE:
|
log_q_detached
|
TYPE:
|
Source code in src/quivers/inference/estimators.py
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 | |
Reparameterized
¶
Bases: GradientEstimator
Standard pathwise gradient.
For the ELBO with num_particles = 1 this is the textbook
reparameterization trick (Kingma-Welling 2013,
doi:10.48550/arXiv.1312.6114 <https://doi.org/10.48550/arXiv.1312.6114>_).
For higher num_particles it's the importance-weighted
score function with reparameterized samples — i.e. the IWAE
bound under the naive gradient.
StickingTheLanding
¶
Bases: GradientEstimator
Roeder-Wu-Duvenaud 2017 sticking-the-landing estimator.
Replaces log q(z) in the loss with
log q_{detach(phi)}(z): the score is evaluated at the
same sample but the variational parameters are detached
from the autograd graph. The total derivative loses its
direct dependence on the variational parameters through
log q, leaving only the indirect dependence through the
sampled z. As :math:q \to p^* the latter vanishes and
so does the gradient variance.
Use when training with a guide that's already close to the
true posterior — typically after a warm-up phase. May
increase variance early in training when q is far
from p.
DoublyReparameterized
¶
Bases: GradientEstimator
Doubly-reparameterized IWAE gradient (Tucker-Lawson-Gu- Maddison 2019).
Specialised for the IWAE bound at K particles. Reweights the
per-particle terms so the variance no longer collapses as
:math:K \to \infty. The objective itself is the standard
IWAE bound; only the gradient is reweighted.
The estimator's mathematical content is the gradient:
.. math::
\nabla_\phi \mathcal{L}_{\mathrm{IWAE}} \;=\;
\sum_k w_k^2 \, \nabla_\phi
\bigl[\log p(z_k) - \log q_\phi(z_k)\bigr]
where :math:w_k = \exp(\log p_k - \log q_k) /
\sum_j \exp(\log p_j - \log q_j). Implementing this as a
surrogate loss whose backward() yields the right gradient
is the standard trick: detach the importance weights from the
autograd graph and use them as a non-differentiable scaling
on the per-particle reparameterized difference.
ScoreFunction
¶
Bases: GradientEstimator
REINFORCE / black-box VI gradient
(Ranganath-Gerrish-Blei 2014,
doi:10.48550/arXiv.1401.0118 <https://doi.org/10.48550/arXiv.1401.0118>_).
Uses the log-derivative identity instead of the reparameterization trick. Required when sampling is not differentiable (discrete latents, hard-truncated families, accept-reject samplers). Variance is typically orders of magnitude higher than reparameterized — combine with a control-variate baseline whenever possible.