Estimators

Gradient-estimator strategies plugged into Objectives: Reparameterised (default pathwise gradient), StickingTheLanding (variance reduction near convergence), DoublyReparameterised (DReG for IWAEBound at large K), and ScoreFunction (REINFORCE, for non-reparameterisable sites).

estimators

Gradient estimators for variational objectives.

A GradientEstimator is the strategy that takes (latent samples, model log-density, guide log-density) and returns a scalar loss whose backward() produces the chosen gradient estimator. Different strategies trade variance against applicability:

  • Reparameterized — pathwise gradient (the standard SVI reparameterization trick). Lowest variance for reparameterizable families; requires rsample.
  • StickingTheLanding — detaches the variational-parameter dependence in :math:\log q_\phi(z) so the gradient variance asymptotically vanishes as :math:q \to p^* (Roeder-Wu-Duvenaud 2017, doi:10.48550/arXiv.1703.09194 <https://doi.org/10.48550/arXiv.1703.09194>_).
  • DoublyReparameterized — the DReG estimator for IWAE (Tucker-Lawson-Gu-Maddison 2019, doi:10.48550/arXiv.1810.04152 <https://doi.org/10.48550/arXiv.1810.04152>_). Removes the score-function term whose variance grows with the particle count :math:K.
  • ScoreFunction — REINFORCE / black-box VI. The fallback for non-reparameterizable sites (discrete latents, reject-sampled families). Highest variance; pair with a baseline whenever possible.

Estimators are strategies held by Objective implementations; they don't store any state themselves and operate on tensors only. The Reparameterized instance is a singleton — every objective defaults to it.

GradientEstimator

Bases: ABC

Strategy for computing :math:\nabla_\phi \mathcal{L} from samples + densities.

Subclasses implement negative_objective: given the per-particle log_p and log_q tensors (and any estimator-specific auxiliary data), return the negated objective whose backward() produces the desired gradient estimator.

negative_objective abstractmethod

negative_objective(log_p: Tensor, log_q: Tensor, log_q_detached: Tensor | None = None) -> Tensor

Return the scalar loss whose gradient is the chosen estimator.

PARAMETER DESCRIPTION
log_p

Model log-joint log p(z, y) at the sampled latents. Shape (K, batch) where K is the particle axis (K = 1 for plain ELBO) and batch is the program-input batch axis.

TYPE: Tensor

log_q

Guide log-density log q_phi(z) at the sampled latents. Same shape as log_p. Gradients flow back to the variational parameters through this tensor.

TYPE: Tensor

log_q_detached

log q_{stop_grad(phi)}(z) — the guide log-density with the variational parameters detached from the autograd graph. Required by sticking-the-landing and DReG; ignored by the basic estimators.

TYPE: Tensor or None DEFAULT: None

Source code in src/quivers/inference/estimators.py
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
@abstractmethod
def negative_objective(
    self,
    log_p: torch.Tensor,
    log_q: torch.Tensor,
    log_q_detached: torch.Tensor | None = None,
) -> torch.Tensor:
    """Return the scalar loss whose gradient is the chosen
    estimator.

    Parameters
    ----------
    log_p : torch.Tensor
        Model log-joint ``log p(z, y)`` at the sampled latents.
        Shape ``(K, batch)`` where ``K`` is the particle axis
        (``K = 1`` for plain ELBO) and ``batch`` is the
        program-input batch axis.
    log_q : torch.Tensor
        Guide log-density ``log q_phi(z)`` at the sampled
        latents. Same shape as ``log_p``. Gradients flow back
        to the variational parameters through this tensor.
    log_q_detached : torch.Tensor or None
        ``log q_{stop_grad(phi)}(z)`` — the guide log-density
        with the variational parameters detached from the
        autograd graph. Required by sticking-the-landing and
        DReG; ignored by the basic estimators.
    """
    ...

Reparameterized

Bases: GradientEstimator

Standard pathwise gradient.

For the ELBO with num_particles = 1 this is the textbook reparameterization trick (Kingma-Welling 2013, doi:10.48550/arXiv.1312.6114 <https://doi.org/10.48550/arXiv.1312.6114>_). For higher num_particles it's the importance-weighted score function with reparameterized samples — i.e. the IWAE bound under the naive gradient.

StickingTheLanding

Bases: GradientEstimator

Roeder-Wu-Duvenaud 2017 sticking-the-landing estimator.

Replaces log q(z) in the loss with log q_{detach(phi)}(z): the score is evaluated at the same sample but the variational parameters are detached from the autograd graph. The total derivative loses its direct dependence on the variational parameters through log q, leaving only the indirect dependence through the sampled z. As :math:q \to p^* the latter vanishes and so does the gradient variance.

Use when training with a guide that's already close to the true posterior — typically after a warm-up phase. May increase variance early in training when q is far from p.

DoublyReparameterized

Bases: GradientEstimator

Doubly-reparameterized IWAE gradient (Tucker-Lawson-Gu- Maddison 2019).

Specialised for the IWAE bound at K particles. Reweights the per-particle terms so the variance no longer collapses as :math:K \to \infty. The objective itself is the standard IWAE bound; only the gradient is reweighted.

The estimator's mathematical content is the gradient:

.. math::

\nabla_\phi \mathcal{L}_{\mathrm{IWAE}} \;=\;
\sum_k w_k^2 \, \nabla_\phi
    \bigl[\log p(z_k) - \log q_\phi(z_k)\bigr]

where :math:w_k = \exp(\log p_k - \log q_k) / \sum_j \exp(\log p_j - \log q_j). Implementing this as a surrogate loss whose backward() yields the right gradient is the standard trick: detach the importance weights from the autograd graph and use them as a non-differentiable scaling on the per-particle reparameterized difference.

ScoreFunction

Bases: GradientEstimator

REINFORCE / black-box VI gradient (Ranganath-Gerrish-Blei 2014, doi:10.48550/arXiv.1401.0118 <https://doi.org/10.48550/arXiv.1401.0118>_).

Uses the log-derivative identity instead of the reparameterization trick. Required when sampling is not differentiable (discrete latents, hard-truncated families, accept-reject samplers). Variance is typically orders of magnitude higher than reparameterized — combine with a control-variate baseline whenever possible.