StochTree 0.0.1
Loading...
Searching...
No Matches
Classes | Enumerations
Leaf Model API

Classes / functions for implementing leaf models. More...

Classes

class  StochTree::GaussianConstantSuffStat
 Sufficient statistic and associated operations for gaussian homoskedastic constant leaf outcome model. More...
 
class  StochTree::GaussianConstantLeafModel
 Marginal likelihood and posterior computation for gaussian homoskedastic constant leaf outcome model. More...
 
class  StochTree::GaussianUnivariateRegressionSuffStat
 Sufficient statistic and associated operations for gaussian homoskedastic constant leaf outcome model. More...
 
class  StochTree::GaussianUnivariateRegressionLeafModel
 Marginal likelihood and posterior computation for gaussian homoskedastic constant leaf outcome model. More...
 
class  StochTree::GaussianMultivariateRegressionSuffStat
 Sufficient statistic and associated operations for gaussian homoskedastic constant leaf outcome model. More...
 
class  StochTree::GaussianMultivariateRegressionLeafModel
 Marginal likelihood and posterior computation for gaussian homoskedastic constant leaf outcome model. More...
 
class  StochTree::LogLinearVarianceSuffStat
 Sufficient statistic and associated operations for heteroskedastic log-linear variance model. More...
 
class  StochTree::LogLinearVarianceLeafModel
 Marginal likelihood and posterior computation for heteroskedastic log-linear variance model. More...
 

Enumerations

enum  StochTree::ModelType
 Leaf models for the forest sampler: More...
 

Detailed Description

Classes / functions for implementing leaf models.

Stochastic tree algorithms are all essentially hierarchical models with an adaptive group structure defined by an ensemble of decision trees. Each novel model is governed by

To provide a thorough overview of this interface (and, importantly, how to extend it), we must introduce some mathematical notation. Any forest-based regression model involves an outcome, which we'll call \(y\), and features (or "covariates"), which we'll call \(X\). Our goal is to predict \(y\) as a function of \(X\), which we'll call \(f(X)\).

NOTE: if we have a more complicated, but still additive, model, such as \(y = X\beta + f(X)\), then we can just model \(y - X\beta = f(X)\), treating the residual \(y - X\beta\) as the outcome data, and we are back to the general setting above.

Now, since \(f(X)\) is an additive tree ensemble, we can think of it as the sum of \(b\) separate decision tree functions, where \(b\) is the number of trees in an ensemble, so that

\[ f(X) = f_1(X) + \dots + f_b(X) \]

and each decision tree function \(f_j\) has the property that features \(X\) are used to determine which leaf node an observation falls into, and then the parameters attached to that leaf node are used to compute \(f_j(X)\). The exact mechanics of this process are model-dependent, so now we introduce the "leaf node" models that stochtree supports.

Gaussian Constant Leaf Model

The most standard and common tree ensemble is a sum of "constant leaf" trees, in which a leaf node's parameter uniquely determines the prediction for all observations that fall into that leaf. For example, if leaf 2 for a tree is reached by the conditions that \(X_1 < 0.4 \; \& \; X_2 > 0.6\), then every observation whose first feature is less than 0.4 and whose second feature is greater than 0.6 will receive the same prediction. Mathematically, for an observation \(i\) this looks like

\[ f_j(X_i) = \sum_{\ell \in L} \mathbb{1}(X_i \in \ell) \mu_{\ell} \]

where \(L\) denotes the indices of every leaf node, \(\mu_{\ell}\) is the parameter attached to leaf node \(\ell\), and \(\mathbb{1}(X \in \ell)\) checks whether \(X_i\) falls into leaf node \(\ell\).

The way that we make such a model "stochastic" is by attaching to the leaf node parameters \(\mu_{\ell}\) a "prior" distribution. This leaf model corresponds to the "classic" BART model of Chipman et al (2010) as well as its "XBART" extension (He and Hahn (2023)). We assign each leaf node parameter a prior

\[ \mu \sim \mathcal{N}\left(0, \tau\right) \]

Assuming a homoskedastic Gaussian outcome likelihood (i.e. \(y_i \sim \mathcal{N}\left(f(X_i),\sigma^2\right)\)), the log marginal likelihood in this model, for the outcome data in node \(\ell\) of tree \(j\) is given by

\[ L(y) = -\frac{n_{\ell}}{2}\log(2\pi) - n_{\ell}\log(\sigma) + \frac{1}{2} \log\left(\frac{\sigma^2}{n_{\ell} \tau + \sigma^2}\right) - \frac{s_{yy,\ell}}{2\sigma^2} + \frac{\tau s_{y,\ell}^2}{2\sigma^2(n_{\ell} \tau + \sigma^2)} \]

where

\[ n_{\ell} = \sum_{i : X_i \in \ell} 1 \]

\[ s_{y,\ell} = \sum_{i : X_i \in \ell} r_i \]

\[ s_{yy,\ell} = \sum_{i : X_i \in \ell} r_i^2 \]

\[ r_i = y_i - \sum_{k \neq j} \mu_k(X_i) \]

In words, this model depends on the data for a given leaf node only through three sufficient statistics, \(n_{\ell}\), \(s_{y,\ell}\), and \(s_{yy,\ell}\), and it only depends on the other trees in the ensemble through the "partial residual" \(r_i\). The posterior distribution for node \(\ell\)'s leaf parameter is similarly defined as:

\[ \mu_{\ell} \mid - \sim \mathcal{N}\left(\frac{\tau s_{y,\ell}}{n_{\ell} \tau + \sigma^2}, \frac{\tau \sigma^2}{n_{\ell} \tau + \sigma^2}\right) \]

Now, consider the possibility that each observation carries a unique weight \(w_i\). These could be "case weights" in a survey context or individual-level variances ("heteroskedasticity"). These case weights transform the outcome distribution (and associated likelihood) to

\[ y_i \mid - \sim \mathcal{N}\left(\mu(X_i), \frac{\sigma^2}{w_i}\right) \]

This gives a modified log marginal likelihood of

\[ L(y) = -\frac{n_{\ell}}{2}\log(2\pi) - \frac{1}{2} \sum_{i : X_i \in \ell} \log\left(\frac{\sigma^2}{w_i}\right) + \frac{1}{2} \log\left(\frac{\sigma^2}{s_{w,\ell} \tau + \sigma^2}\right) - \frac{s_{wyy,\ell}}{2\sigma^2} + \frac{\tau s_{wy,\ell}^2}{2\sigma^2(s_{w,\ell} \tau + \sigma^2)} \]

where

\[ s_{w,\ell} = \sum_{i : X_i \in \ell} w_i \]

\[ s_{wy,\ell} = \sum_{i : X_i \in \ell} w_i r_i \]

\[ s_{wyy,\ell} = \sum_{i : X_i \in \ell} w_i r_i^2 \]

Finally, note that when we consider splitting leaf \(\ell\) into new left and right leaves, or pruning two nodes into a single leaf node, we compute the log marginal likelihood of the combined data and the log marginal likelihoods of the left and right leaves and compare these three values.

The terms \(\frac{n_{\ell}}{2}\log(2\pi)\), \(\sum_{i : X_i \in \ell} \log\left(\frac{\sigma^2}{w_i}\right)\), and \(\frac{s_{wyy,\ell}}{2\sigma^2}\) are such that their left and right node values will always sum to the respective value in the combined log marginal likelihood, so they can be ignored when evaluating splits or prunes and thus the reduced log marginal likelihood is

\[ L(y) \propto \frac{1}{2} \log\left(\frac{\sigma^2}{s_{w,\ell} \tau + \sigma^2}\right) + \frac{\tau s_{wy,\ell}^2}{2\sigma^2(n_{\ell} \tau + \sigma^2)} \]

So the GaussianConstantSuffStat class tracks a generalized version of these three statistics (which allows for each observation to have a weight \(w_i \neq 1\)):

And these values are used by the GaussianConstantLeafModel class in the SplitLogMarginalLikelihood, NoSplitLogMarginalLikelihood, PosteriorParameterMean, and PosteriorParameterVariance methods. To give one example, below is the implementation of SplitLogMarginalLikelihood:

double left_log_ml = (
-0.5*std::log(1 + tau_*(left_stat.sum_w/global_variance)) + ((tau_*left_stat.sum_yw*left_stat.sum_yw)/(2.0*global_variance*(tau_*left_stat.sum_w + global_variance)))
);
double right_log_ml = (
-0.5*std::log(1 + tau_*(right_stat.sum_w/global_variance)) + ((tau_*right_stat.sum_yw*right_stat.sum_yw)/(2.0*global_variance*(tau_*right_stat.sum_w + global_variance)))
);
return left_log_ml + right_log_ml;

Gaussian Multivariate Regression Leaf Model

In this model, the tree defines a "partitioned linear model" in which leaf node parameters define regression weights that are multiplied by a "basis" \(\Omega\) to determine the prediction for an observation.

\[ f_j(X_i) = \sum_{\ell \in L} \mathbb{1}(X_i \in \ell) \Omega_i \vec{\beta_{\ell}} \]

and we assign \(\beta_{\ell}\) a prior of

\[ \vec{\beta_{\ell}} \sim \mathcal{N}\left(\vec{\beta_0}, \Sigma_0\right) \]

where \(\vec{\beta_0}\) is typically a vector of zeros. The outcome likelihood is still

\[ y_i \sim \mathcal{N}\left(f(X_i), \sigma^2\right) \]

This gives a reduced log integrated likelihood of

\[ L(y) \propto - \frac{1}{2} \log\left(\textrm{det}\left(I_p + \frac{\Sigma_0\Omega'\Omega}{\sigma^2}\right)\right) + \frac{1}{2}\frac{y'\Omega}{\sigma^2}\left(\Sigma_0^{-1} + \frac{\Omega'\Omega}{\sigma^2}\right)^{-1}\frac{\Omega'y}{\sigma^2} \]

where \(\Omega\) is a matrix of bases for every observation in leaf \(\ell\) and \(p\) is the dimension of \(\Omega\). The posterior for \(\vec{\beta_{\ell}}\) is

\[ \vec{\beta_{\ell}} \sim \mathcal{N}\left(\left(\Sigma_0^{-1} + \frac{\Omega'\Omega}{\sigma^2}\right)^{-1}\left(\frac{\Omega'y}{\sigma^2}\right),\left(\Sigma_0^{-1} + \frac{\Omega'\Omega}{\sigma^2}\right)^{-1}\right) \]

This is an extension of the single-tree model of Chipman et al (2002), with:

We can also enable heteroskedasticity by defining a (diagonal) covariance matrix for the outcome likelihood

\[ \Sigma_y = \text{diag}\left(\sigma^2 / w_1,\sigma^2 / w_2,\dots,\sigma^2 / w_n\right) \]

This updates the reduced log integrated likelihood to

\[ L(y) \propto - \frac{1}{2} \log\left(\textrm{det}\left(I_p + \Sigma_{0}\Omega'\Sigma_y^{-1}\Omega\right)\right) + \frac{1}{2}y'\Sigma_{y}^{-1}\Omega\left(\Sigma_{0}^{-1} + \Omega'\Sigma_{y}^{-1}\Omega\right)^{-1}\Omega'\Sigma_{y}^{-1}y \]

and a posterior for \(\vec{\beta_{\ell}}\) of

\[ \vec{\beta_{\ell}} \sim \mathcal{N}\left(\left(\Sigma_{0}^{-1} + \Omega'\Sigma_{y}^{-1}\Omega\right)^{-1}\left(\Omega'\Sigma_{y}^{-1}y\right),\left(\Sigma_{0}^{-1} + \Omega'\Sigma_{y}^{-1}\Omega\right)^{-1}\right) \]

Gaussian Univariate Regression Leaf Model

This specializes the Gaussian Multivariate Regression Leaf Model for a univariate leaf basis, which allows for several computational speedups (replacing generalized matrix operations with simple summation or sum-product operations). We simplify \(\Omega\) to \(\omega\), a univariate basis for every observation, so that \(\Omega'\Omega = \sum_{i:i \in \ell}\omega_i^2\) and \(\Omega'y = \sum_{i:i \in \ell}\omega_ir_i\). Similarly, the prior for the leaf parameter becomes univariate normal as in gaussian_constant_leaf_model:

\[ \beta \sim \mathcal{N}\left(0, \tau\right) \]

Allowing for case / variance weights $w_i$ as above, we derive a reduced log marginal likelihood of

\[ L(y) \propto \frac{1}{2} \log\left(\frac{\sigma^2}{s_{wxx,\ell} \tau + \sigma^2}\right) + \frac{\tau s_{wyx,\ell}^2}{2\sigma^2(s_{wxx,\ell} \tau + \sigma^2)} \]

where

\[ s_{wxx,\ell} = \sum_{i : X_i \in \ell} w_i \omega_i \omega_i \]

\[ s_{wyx,\ell} = \sum_{i : X_i \in \ell} w_i r_i \omega_i \]

and a posterior of

\[ \beta_{\ell} \mid - \sim \mathcal{N}\left(\frac{\tau s_{wyx,\ell}}{s_{wxx,\ell} \tau + \sigma^2}, \frac{\tau \sigma^2}{s_{wxx,\ell} \tau + \sigma^2}\right) \]

Inverse Gamma Leaf Model

Each of the above models is a variation on a theme: a conjugate, partitioned Gaussian leaf model. The inverse gamma leaf model allows for forest-based heteroskedasticity modeling using an inverse gamma prior on the exponentiated leaf parameter, as discussed in Murray (2021) Define a variance function based on an ensemble of \(b\) trees as

\[ \sigma^2(X) = \exp\left(s_1(X) + \dots + s_b(X)\right) \]

where each tree function \(s_j(X)\) is defined as

\[ s_j(X_i) = \sum_{\ell \in L} \mathbb{1}(X_i \in \ell) \lambda_{\ell} \]

We reparameterize \(\lambda_{\ell} = \log(\mu_{\ell})\) and we place an inverse gamma prior on \(\mu_{\ell}\)

\[ \mu_{\ell} \sim \text{IG}\left(a, b\right) \]

As noted in Murray (2021), this model no longer enables the "Bayesian backfitting" simplification of conjugated Gaussian leaf models, in which sampling updates for a given tree only depend on other trees in the ensemble via their imprint on the partial residual \(r_i = y_i - \sum_{k \neq j} \mu_k(X_i)\). However, this model is part of a broader class of models with convenient "blocked MCMC" sampling updates (another important example being multinomial classification).

Under an outcome model

\[ y \sim \mathcal{N}\left(f(X), \sigma_0^2 \sigma^2(X)\right) \]

updates to \(\mu_{\ell}\) for a given tree \(j\) are based on a reduced log marginal likelihood of

\[ L(y) \propto a \log (b) - \log \Gamma (a) + \log \Gamma \left(a + \frac{n_{\ell}}{2}\right) - \left(a + \frac{n_{\ell}}{2}\right) \left(b + \frac{s_{\sigma,\ell}}{2\sigma_0^2}\right) \]

where

\[ n_{\ell} = \sum_{i : X_i \in \ell} 1 \]

\[ s_{\sigma,\ell} = \sum_{i: i \in \ell} \frac{(y_i - f(X_i))^2}{\prod_{k \neq j} s_k(X_i)} \]

and a posterior of

\[ \mu_{\ell} \mid - \sim \text{IG}\left( a + \frac{n_{\ell}}{2} , b + \frac{s_{\sigma,\ell}}{2\sigma_0^2} \right) \]

Thus, as above, we implement a sufficient statistic class (LogLinearVarianceSuffStat), which tracks

And these values are used by the LogLinearVarianceLeafModel class in the SplitLogMarginalLikelihood, NoSplitLogMarginalLikelihood, PosteriorParameterShape, and PosteriorParameterScale methods. To give one example, below is the implementation of NoSplitLogMarginalLikelihood:

double prior_terms = a_ * std::log(b_) - boost::math::lgamma(a_);
double a_term = a_ + 0.5 * suff_stat.n;
double b_term = b_ + ((0.5 * suff_stat.weighted_sum_ei) / global_variance);
double log_b_term = std::log(b_term);
double lgamma_a_term = boost::math::lgamma(a_term);
double resid_term = a_term * log_b_term;
double log_ml = prior_terms + lgamma_a_term - resid_term;
return log_ml;

Enumeration Type Documentation

◆ ModelType

Leaf models for the forest sampler:

  1. kConstantLeafGaussian: Every leaf node has a zero-centered univariate normal prior and every leaf is constant.
  2. kUnivariateRegressionLeafGaussian: Every leaf node has a zero-centered univariate normal prior and every leaf is a linear model, multiplying the leaf parameter by a (fixed) basis.
  3. kMultivariateRegressionLeafGaussian: Every leaf node has a multivariate normal prior, centered around the zero vector, and every leaf is a linear model, matrix-multiplying the leaf parameters by a (fixed) basis vector.
  4. kLogLinearVariance: Every leaf node has a inverse gamma prior and every leaf is constant.