StochTree 0.0.1
|
Classes / functions for implementing leaf models. More...
Classes | |
class | StochTree::GaussianConstantSuffStat |
Sufficient statistic and associated operations for gaussian homoskedastic constant leaf outcome model. More... | |
class | StochTree::GaussianConstantLeafModel |
Marginal likelihood and posterior computation for gaussian homoskedastic constant leaf outcome model. More... | |
class | StochTree::GaussianUnivariateRegressionSuffStat |
Sufficient statistic and associated operations for gaussian homoskedastic constant leaf outcome model. More... | |
class | StochTree::GaussianUnivariateRegressionLeafModel |
Marginal likelihood and posterior computation for gaussian homoskedastic constant leaf outcome model. More... | |
class | StochTree::GaussianMultivariateRegressionSuffStat |
Sufficient statistic and associated operations for gaussian homoskedastic constant leaf outcome model. More... | |
class | StochTree::GaussianMultivariateRegressionLeafModel |
Marginal likelihood and posterior computation for gaussian homoskedastic constant leaf outcome model. More... | |
class | StochTree::LogLinearVarianceSuffStat |
Sufficient statistic and associated operations for heteroskedastic log-linear variance model. More... | |
class | StochTree::LogLinearVarianceLeafModel |
Marginal likelihood and posterior computation for heteroskedastic log-linear variance model. More... | |
Enumerations | |
enum | StochTree::ModelType |
Leaf models for the forest sampler: More... | |
Classes / functions for implementing leaf models.
Stochastic tree algorithms are all essentially hierarchical models with an adaptive group structure defined by an ensemble of decision trees. Each novel model is governed by
LeafModel
class, defining the integrated likelihood and posterior, conditional on a particular tree structureSuffStat
class that tracks and accumulates sufficient statistics necessary for a LeafModel
To provide a thorough overview of this interface (and, importantly, how to extend it), we must introduce some mathematical notation. Any forest-based regression model involves an outcome, which we'll call \(y\), and features (or "covariates"), which we'll call \(X\). Our goal is to predict \(y\) as a function of \(X\), which we'll call \(f(X)\).
NOTE: if we have a more complicated, but still additive, model, such as \(y = X\beta + f(X)\), then we can just model \(y - X\beta = f(X)\), treating the residual \(y - X\beta\) as the outcome data, and we are back to the general setting above.
Now, since \(f(X)\) is an additive tree ensemble, we can think of it as the sum of \(b\) separate decision tree functions, where \(b\) is the number of trees in an ensemble, so that
\[ f(X) = f_1(X) + \dots + f_b(X) \]
and each decision tree function \(f_j\) has the property that features \(X\) are used to determine which leaf node an observation falls into, and then the parameters attached to that leaf node are used to compute \(f_j(X)\). The exact mechanics of this process are model-dependent, so now we introduce the "leaf node" models that stochtree
supports.
The most standard and common tree ensemble is a sum of "constant leaf" trees, in which a leaf node's parameter uniquely determines the prediction for all observations that fall into that leaf. For example, if leaf 2 for a tree is reached by the conditions that \(X_1 < 0.4 \; \& \; X_2 > 0.6\), then every observation whose first feature is less than 0.4 and whose second feature is greater than 0.6 will receive the same prediction. Mathematically, for an observation \(i\) this looks like
\[ f_j(X_i) = \sum_{\ell \in L} \mathbb{1}(X_i \in \ell) \mu_{\ell} \]
where \(L\) denotes the indices of every leaf node, \(\mu_{\ell}\) is the parameter attached to leaf node \(\ell\), and \(\mathbb{1}(X \in \ell)\) checks whether \(X_i\) falls into leaf node \(\ell\).
The way that we make such a model "stochastic" is by attaching to the leaf node parameters \(\mu_{\ell}\) a "prior" distribution. This leaf model corresponds to the "classic" BART model of Chipman et al (2010) as well as its "XBART" extension (He and Hahn (2023)). We assign each leaf node parameter a prior
\[ \mu \sim \mathcal{N}\left(0, \tau\right) \]
Assuming a homoskedastic Gaussian outcome likelihood (i.e. \(y_i \sim \mathcal{N}\left(f(X_i),\sigma^2\right)\)), the log marginal likelihood in this model, for the outcome data in node \(\ell\) of tree \(j\) is given by
\[ L(y) = -\frac{n_{\ell}}{2}\log(2\pi) - n_{\ell}\log(\sigma) + \frac{1}{2} \log\left(\frac{\sigma^2}{n_{\ell} \tau + \sigma^2}\right) - \frac{s_{yy,\ell}}{2\sigma^2} + \frac{\tau s_{y,\ell}^2}{2\sigma^2(n_{\ell} \tau + \sigma^2)} \]
where
\[ n_{\ell} = \sum_{i : X_i \in \ell} 1 \]
\[ s_{y,\ell} = \sum_{i : X_i \in \ell} r_i \]
\[ s_{yy,\ell} = \sum_{i : X_i \in \ell} r_i^2 \]
\[ r_i = y_i - \sum_{k \neq j} \mu_k(X_i) \]
In words, this model depends on the data for a given leaf node only through three sufficient statistics, \(n_{\ell}\), \(s_{y,\ell}\), and \(s_{yy,\ell}\), and it only depends on the other trees in the ensemble through the "partial residual" \(r_i\). The posterior distribution for node \(\ell\)'s leaf parameter is similarly defined as:
\[ \mu_{\ell} \mid - \sim \mathcal{N}\left(\frac{\tau s_{y,\ell}}{n_{\ell} \tau + \sigma^2}, \frac{\tau \sigma^2}{n_{\ell} \tau + \sigma^2}\right) \]
Now, consider the possibility that each observation carries a unique weight \(w_i\). These could be "case weights" in a survey context or individual-level variances ("heteroskedasticity"). These case weights transform the outcome distribution (and associated likelihood) to
\[ y_i \mid - \sim \mathcal{N}\left(\mu(X_i), \frac{\sigma^2}{w_i}\right) \]
This gives a modified log marginal likelihood of
\[ L(y) = -\frac{n_{\ell}}{2}\log(2\pi) - \frac{1}{2} \sum_{i : X_i \in \ell} \log\left(\frac{\sigma^2}{w_i}\right) + \frac{1}{2} \log\left(\frac{\sigma^2}{s_{w,\ell} \tau + \sigma^2}\right) - \frac{s_{wyy,\ell}}{2\sigma^2} + \frac{\tau s_{wy,\ell}^2}{2\sigma^2(s_{w,\ell} \tau + \sigma^2)} \]
where
\[ s_{w,\ell} = \sum_{i : X_i \in \ell} w_i \]
\[ s_{wy,\ell} = \sum_{i : X_i \in \ell} w_i r_i \]
\[ s_{wyy,\ell} = \sum_{i : X_i \in \ell} w_i r_i^2 \]
Finally, note that when we consider splitting leaf \(\ell\) into new left and right leaves, or pruning two nodes into a single leaf node, we compute the log marginal likelihood of the combined data and the log marginal likelihoods of the left and right leaves and compare these three values.
The terms \(\frac{n_{\ell}}{2}\log(2\pi)\), \(\sum_{i : X_i \in \ell} \log\left(\frac{\sigma^2}{w_i}\right)\), and \(\frac{s_{wyy,\ell}}{2\sigma^2}\) are such that their left and right node values will always sum to the respective value in the combined log marginal likelihood, so they can be ignored when evaluating splits or prunes and thus the reduced log marginal likelihood is
\[ L(y) \propto \frac{1}{2} \log\left(\frac{\sigma^2}{s_{w,\ell} \tau + \sigma^2}\right) + \frac{\tau s_{wy,\ell}^2}{2\sigma^2(n_{\ell} \tau + \sigma^2)} \]
So the GaussianConstantSuffStat class tracks a generalized version of these three statistics (which allows for each observation to have a weight \(w_i \neq 1\)):
data_size_t n
double sum_w
double sum_yw
And these values are used by the GaussianConstantLeafModel class in the SplitLogMarginalLikelihood, NoSplitLogMarginalLikelihood, PosteriorParameterMean, and PosteriorParameterVariance methods. To give one example, below is the implementation of SplitLogMarginalLikelihood:
In this model, the tree defines a "partitioned linear model" in which leaf node parameters define regression weights that are multiplied by a "basis" \(\Omega\) to determine the prediction for an observation.
\[ f_j(X_i) = \sum_{\ell \in L} \mathbb{1}(X_i \in \ell) \Omega_i \vec{\beta_{\ell}} \]
and we assign \(\beta_{\ell}\) a prior of
\[ \vec{\beta_{\ell}} \sim \mathcal{N}\left(\vec{\beta_0}, \Sigma_0\right) \]
where \(\vec{\beta_0}\) is typically a vector of zeros. The outcome likelihood is still
\[ y_i \sim \mathcal{N}\left(f(X_i), \sigma^2\right) \]
This gives a reduced log integrated likelihood of
\[ L(y) \propto - \frac{1}{2} \log\left(\textrm{det}\left(I_p + \frac{\Sigma_0\Omega'\Omega}{\sigma^2}\right)\right) + \frac{1}{2}\frac{y'\Omega}{\sigma^2}\left(\Sigma_0^{-1} + \frac{\Omega'\Omega}{\sigma^2}\right)^{-1}\frac{\Omega'y}{\sigma^2} \]
where \(\Omega\) is a matrix of bases for every observation in leaf \(\ell\) and \(p\) is the dimension of \(\Omega\). The posterior for \(\vec{\beta_{\ell}}\) is
\[ \vec{\beta_{\ell}} \sim \mathcal{N}\left(\left(\Sigma_0^{-1} + \frac{\Omega'\Omega}{\sigma^2}\right)^{-1}\left(\frac{\Omega'y}{\sigma^2}\right),\left(\Sigma_0^{-1} + \frac{\Omega'\Omega}{\sigma^2}\right)^{-1}\right) \]
This is an extension of the single-tree model of Chipman et al (2002), with:
We can also enable heteroskedasticity by defining a (diagonal) covariance matrix for the outcome likelihood
\[ \Sigma_y = \text{diag}\left(\sigma^2 / w_1,\sigma^2 / w_2,\dots,\sigma^2 / w_n\right) \]
This updates the reduced log integrated likelihood to
\[ L(y) \propto - \frac{1}{2} \log\left(\textrm{det}\left(I_p + \Sigma_{0}\Omega'\Sigma_y^{-1}\Omega\right)\right) + \frac{1}{2}y'\Sigma_{y}^{-1}\Omega\left(\Sigma_{0}^{-1} + \Omega'\Sigma_{y}^{-1}\Omega\right)^{-1}\Omega'\Sigma_{y}^{-1}y \]
and a posterior for \(\vec{\beta_{\ell}}\) of
\[ \vec{\beta_{\ell}} \sim \mathcal{N}\left(\left(\Sigma_{0}^{-1} + \Omega'\Sigma_{y}^{-1}\Omega\right)^{-1}\left(\Omega'\Sigma_{y}^{-1}y\right),\left(\Sigma_{0}^{-1} + \Omega'\Sigma_{y}^{-1}\Omega\right)^{-1}\right) \]
This specializes the Gaussian Multivariate Regression Leaf Model for a univariate leaf basis, which allows for several computational speedups (replacing generalized matrix operations with simple summation or sum-product operations). We simplify \(\Omega\) to \(\omega\), a univariate basis for every observation, so that \(\Omega'\Omega = \sum_{i:i \in \ell}\omega_i^2\) and \(\Omega'y = \sum_{i:i \in \ell}\omega_ir_i\). Similarly, the prior for the leaf parameter becomes univariate normal as in gaussian_constant_leaf_model:
\[ \beta \sim \mathcal{N}\left(0, \tau\right) \]
Allowing for case / variance weights $w_i$ as above, we derive a reduced log marginal likelihood of
\[ L(y) \propto \frac{1}{2} \log\left(\frac{\sigma^2}{s_{wxx,\ell} \tau + \sigma^2}\right) + \frac{\tau s_{wyx,\ell}^2}{2\sigma^2(s_{wxx,\ell} \tau + \sigma^2)} \]
where
\[ s_{wxx,\ell} = \sum_{i : X_i \in \ell} w_i \omega_i \omega_i \]
\[ s_{wyx,\ell} = \sum_{i : X_i \in \ell} w_i r_i \omega_i \]
and a posterior of
\[ \beta_{\ell} \mid - \sim \mathcal{N}\left(\frac{\tau s_{wyx,\ell}}{s_{wxx,\ell} \tau + \sigma^2}, \frac{\tau \sigma^2}{s_{wxx,\ell} \tau + \sigma^2}\right) \]
Each of the above models is a variation on a theme: a conjugate, partitioned Gaussian leaf model. The inverse gamma leaf model allows for forest-based heteroskedasticity modeling using an inverse gamma prior on the exponentiated leaf parameter, as discussed in Murray (2021) Define a variance function based on an ensemble of \(b\) trees as
\[ \sigma^2(X) = \exp\left(s_1(X) + \dots + s_b(X)\right) \]
where each tree function \(s_j(X)\) is defined as
\[ s_j(X_i) = \sum_{\ell \in L} \mathbb{1}(X_i \in \ell) \lambda_{\ell} \]
We reparameterize \(\lambda_{\ell} = \log(\mu_{\ell})\) and we place an inverse gamma prior on \(\mu_{\ell}\)
\[ \mu_{\ell} \sim \text{IG}\left(a, b\right) \]
As noted in Murray (2021), this model no longer enables the "Bayesian backfitting" simplification of conjugated Gaussian leaf models, in which sampling updates for a given tree only depend on other trees in the ensemble via their imprint on the partial residual \(r_i = y_i - \sum_{k \neq j} \mu_k(X_i)\). However, this model is part of a broader class of models with convenient "blocked MCMC" sampling updates (another important example being multinomial classification).
Under an outcome model
\[ y \sim \mathcal{N}\left(f(X), \sigma_0^2 \sigma^2(X)\right) \]
updates to \(\mu_{\ell}\) for a given tree \(j\) are based on a reduced log marginal likelihood of
\[ L(y) \propto a \log (b) - \log \Gamma (a) + \log \Gamma \left(a + \frac{n_{\ell}}{2}\right) - \left(a + \frac{n_{\ell}}{2}\right) \left(b + \frac{s_{\sigma,\ell}}{2\sigma_0^2}\right) \]
where
\[ n_{\ell} = \sum_{i : X_i \in \ell} 1 \]
\[ s_{\sigma,\ell} = \sum_{i: i \in \ell} \frac{(y_i - f(X_i))^2}{\prod_{k \neq j} s_k(X_i)} \]
and a posterior of
\[ \mu_{\ell} \mid - \sim \text{IG}\left( a + \frac{n_{\ell}}{2} , b + \frac{s_{\sigma,\ell}}{2\sigma_0^2} \right) \]
Thus, as above, we implement a sufficient statistic class (LogLinearVarianceSuffStat), which tracks
data_size_t n
double weighted_sum_ei
And these values are used by the LogLinearVarianceLeafModel class in the SplitLogMarginalLikelihood, NoSplitLogMarginalLikelihood, PosteriorParameterShape, and PosteriorParameterScale methods. To give one example, below is the implementation of NoSplitLogMarginalLikelihood:
enum StochTree::ModelType |
Leaf models for the forest sampler:
kConstantLeafGaussian
: Every leaf node has a zero-centered univariate normal prior and every leaf is constant.kUnivariateRegressionLeafGaussian
: Every leaf node has a zero-centered univariate normal prior and every leaf is a linear model, multiplying the leaf parameter by a (fixed) basis.kMultivariateRegressionLeafGaussian
: Every leaf node has a multivariate normal prior, centered around the zero vector, and every leaf is a linear model, matrix-multiplying the leaf parameters by a (fixed) basis vector.kLogLinearVariance
: Every leaf node has a inverse gamma prior and every leaf is constant.