Classes / functions for implementing leaf models. More...

Classes
class	StochTree::GaussianConstantSuffStat
	Sufficient statistic and associated operations for gaussian homoskedastic constant leaf outcome model. More...

class	StochTree::GaussianConstantLeafModel
	Marginal likelihood and posterior computation for gaussian homoskedastic constant leaf outcome model. More...

class	StochTree::GaussianUnivariateRegressionSuffStat
	Sufficient statistic and associated operations for gaussian homoskedastic constant leaf outcome model. More...

class	StochTree::GaussianUnivariateRegressionLeafModel
	Marginal likelihood and posterior computation for gaussian homoskedastic constant leaf outcome model. More...

class	StochTree::GaussianMultivariateRegressionSuffStat
	Sufficient statistic and associated operations for gaussian homoskedastic constant leaf outcome model. More...

class	StochTree::GaussianMultivariateRegressionLeafModel
	Marginal likelihood and posterior computation for gaussian homoskedastic constant leaf outcome model. More...

class	StochTree::LogLinearVarianceSuffStat
	Sufficient statistic and associated operations for heteroskedastic log-linear variance model. More...

class	StochTree::LogLinearVarianceLeafModel
	Marginal likelihood and posterior computation for heteroskedastic log-linear variance model. More...

Typedefs
using	StochTree::SuffStatVariant = std::variant< GaussianConstantSuffStat, GaussianUnivariateRegressionSuffStat, GaussianMultivariateRegressionSuffStat, LogLinearVarianceSuffStat >
	Unifying layer for disparate sufficient statistic class types.

using	StochTree::LeafModelVariant = std::variant< GaussianConstantLeafModel, GaussianUnivariateRegressionLeafModel, GaussianMultivariateRegressionLeafModel, LogLinearVarianceLeafModel >
	Unifying layer for disparate leaf model class types.

Enumerations
enum	StochTree::ModelType
	Leaf models for the forest sampler: More...

Functions
static SuffStatVariant	StochTree::suffStatFactory (ModelType model_type, int basis_dim=0)
	Factory function that creates a new `SuffStat` object for the specified model type.

static LeafModelVariant	StochTree::leafModelFactory (ModelType model_type, double tau, Eigen::MatrixXd &Sigma0, double a, double b)
	Factory function that creates a new `LeafModel` object for the specified model type.

Detailed Description

Classes / functions for implementing leaf models.

Stochastic tree algorithms are all essentially hierarchical models with an adaptive group structure defined by an ensemble of decision trees. Each novel model is governed by

A LeafModel class, defining the integrated likelihood and posterior, conditional on a particular tree structure
A SuffStat class that tracks and accumulates sufficient statistics necessary for a LeafModel

To provide a thorough overview of this interface (and, importantly, how to extend it), we must introduce some mathematical notation. Any forest-based regression model involves an outcome, which we'll call \(y\), and features (or "covariates"), which we'll call \(X\). Our goal is to predict \(y\) as a function of \(X\), which we'll call \(f(X)\).

NOTE: if we have a more complicated, but still additive, model, such as \(y = X\beta + f(X)\), then we can just model \(y - X\beta = f(X)\), treating the residual \(y - X\beta\) as the outcome data, and we are back to the general setting above.

Now, since \(f(X)\) is an additive tree ensemble, we can think of it as the sum of \(b\) separate decision tree functions, where \(b\) is the number of trees in an ensemble, so that

\[ f(X) = f_1(X) + \dots + f_b(X) \]

and each decision tree function \(f_j\) has the property that features \(X\) are used to determine which leaf node an observation falls into, and then the parameters attached to that leaf node are used to compute \(f_j(X)\). The exact mechanics of this process are model-dependent, so now we introduce the "leaf node" models that stochtree supports.

Gaussian Constant Leaf Model

The most standard and common tree ensemble is a sum of "constant leaf" trees, in which a leaf node's parameter uniquely determines the prediction for all observations that fall into that leaf. For example, if leaf 2 for a tree is reached by the conditions that \(X_1 < 0.4 \; \& \; X_2 > 0.6\), then every observation whose first feature is less than 0.4 and whose second feature is greater than 0.6 will receive the same prediction. Mathematically, for an observation \(i\) this looks like

\[ f_j(X_i) = \sum_{\ell \in L} \mathbb{1}(X_i \in \ell) \mu_{\ell} \]

where \(L\) denotes the indices of every leaf node, \(\mu_{\ell}\) is the parameter attached to leaf node \(\ell\), and \(\mathbb{1}(X \in \ell)\) checks whether \(X_i\) falls into leaf node \(\ell\).

The way that we make such a model "stochastic" is by attaching to the leaf node parameters \(\mu_{\ell}\) a "prior" distribution. This leaf model corresponds to the "classic" BART model of Chipman et al (2010) as well as its "XBART" extension (He and Hahn (2023)). We assign each leaf node parameter a prior

\[ \mu \sim N\left(0, \tau\right) \]

Assuming a homoskedastic Gaussian outcome likelihood (i.e. \(y_i \sim N\left(f(X_i),\sigma^2\right)\)), the log marginal likelihood in this model, for the outcome data in node \(\ell\) of tree \(j\) is given by

\[ L(y) = -\frac{n_{\ell}}{2}\log(2\pi) - n_{\ell}\log(\sigma) + \frac{1}{2} \log\left(\frac{\sigma^2}{n_{\ell} \tau + \sigma^2}\right) - \frac{s_{yy,\ell}}{2\sigma^2} + \frac{\tau s_{y,\ell}^2}{2\sigma^2(n_{\ell} \tau + \sigma^2)} \]

where

\[ n_{\ell} = \sum_{i : X_i \in \ell} 1 \]

\[ s_{y,\ell} = \sum_{i : X_i \in \ell} r_i \]

\[ s_{yy,\ell} = \sum_{i : X_i \in \ell} r_i^2 \]

\[ r_i = y_i - \sum_{k \neq j} f_k(X_i) \]

In words, this model depends on the data for a given leaf node only through three sufficient statistics, \(n_{\ell}\), \(s_{y,\ell}\), and \(s_{yy,\ell}\), and it only depends on the other trees in the ensemble through the "partial residual" \(r_i\). The posterior distribution for node \(\ell\)'s leaf parameter is similarly defined as:

\[ \mu_{\ell} \mid - \sim N\left(\frac{\tau s_{y,\ell}}{n_{\ell} \tau + \sigma^2}, \frac{\tau \sigma^2}{n_{\ell} \tau + \sigma^2}\right) \]

Now, consider the possibility that each observation carries a unique weight \(w_i\). These could be "case weights" in a survey context or individual-level variances ("heteroskedasticity"). These case weights transform the outcome distribution (and associated likelihood) to

\[ y_i \mid - \sim N\left(\mu(X_i), \frac{\sigma^2}{w_i}\right) \]

This gives a modified log marginal likelihood of

\[ L(y) = -\frac{n_{\ell}}{2}\log(2\pi) - \frac{1}{2} \sum_{i : X_i \in \ell} \log\left(\frac{\sigma^2}{w_i}\right) + \frac{1}{2} \log\left(\frac{\sigma^2}{s_{w,\ell} \tau + \sigma^2}\right) - \frac{s_{wyy,\ell}}{2\sigma^2} + \frac{\tau s_{wy,\ell}^2}{2\sigma^2(s_{w,\ell} \tau + \sigma^2)} \]

where

\[ s_{w,\ell} = \sum_{i : X_i \in \ell} w_i \]

\[ s_{wy,\ell} = \sum_{i : X_i \in \ell} w_i r_i \]

\[ s_{wyy,\ell} = \sum_{i : X_i \in \ell} w_i r_i^2 \]

Finally, note that when we consider splitting leaf \(\ell\) into new left and right leaves, or pruning two nodes into a single leaf node, we compute the log marginal likelihood of the combined data and the log marginal likelihoods of the left and right leaves and compare these three values.

The terms \(\frac{n_{\ell}}{2}\log(2\pi)\), \(\sum_{i : X_i \in \ell} \log\left(\frac{\sigma^2}{w_i}\right)\), and \(\frac{s_{wyy,\ell}}{2\sigma^2}\) are such that their left and right node values will always sum to the respective value in the combined log marginal likelihood, so they can be ignored when evaluating splits or prunes and thus the reduced log marginal likelihood is

\[ L(y) \propto \frac{1}{2} \log\left(\frac{\sigma^2}{s_{w,\ell} \tau + \sigma^2}\right) + \frac{\tau s_{wy,\ell}^2}{2\sigma^2(n_{\ell} \tau + \sigma^2)} \]

So the GaussianConstantSuffStat class tracks a generalized version of these three statistics (which allows for each observation to have a weight \(w_i \neq 1\)):

\(n_{\ell}\): data_size_t n
\(s_{w,\ell}\): double sum_w
\(s_{wy,\ell}\): double sum_yw

And these values are used by the GaussianConstantLeafModel class in the SplitLogMarginalLikelihood, NoSplitLogMarginalLikelihood, PosteriorParameterMean, and PosteriorParameterVariance methods. To give one example, below is the implementation of SplitLogMarginalLikelihood:

double left_log_ml = (
   -0.5*std::log(1 + tau_*(left_stat.sum_w/global_variance)) + ((tau_*left_stat.sum_yw*left_stat.sum_yw)/(2.0*global_variance*(tau_*left_stat.sum_w + global_variance)))
);
 
double right_log_ml = (
   -0.5*std::log(1 + tau_*(right_stat.sum_w/global_variance)) + ((tau_*right_stat.sum_yw*right_stat.sum_yw)/(2.0*global_variance*(tau_*right_stat.sum_w + global_variance)))
);
 
return left_log_ml + right_log_ml;

Gaussian Multivariate Regression Leaf Model

In this model, the tree defines a "partitioned linear model" in which leaf node parameters define regression weights that are multiplied by a "basis" \(\Omega\) to determine the prediction for an observation.

\[ f_j(X_i) = \sum_{\ell \in L} \mathbb{1}(X_i \in \ell) \Omega_i \vec{\beta_{\ell}} \]

and we assign \(\beta_{\ell}\) a prior of

\[ \vec{\beta_{\ell}} \sim N\left(\vec{\beta_0}, \Sigma_0\right) \]

where \(\vec{\beta_0}\) is typically a vector of zeros. The outcome likelihood is still

\[ y_i \sim N\left(f(X_i), \sigma^2\right) \]

This gives a reduced log integrated likelihood of

\[ L(y) \propto - \frac{1}{2} \log\left(\textrm{det}\left(I_p + \frac{\Sigma_0\Omega'\Omega}{\sigma^2}\right)\right) + \frac{1}{2}\frac{y'\Omega}{\sigma^2}\left(\Sigma_0^{-1} + \frac{\Omega'\Omega}{\sigma^2}\right)^{-1}\frac{\Omega'y}{\sigma^2} \]

where \(\Omega\) is a matrix of bases for every observation in leaf \(\ell\) and \(p\) is the dimension of \(\Omega\). The posterior for \(\vec{\beta_{\ell}}\) is

\[ \vec{\beta_{\ell}} \sim N\left(\left(\Sigma_0^{-1} + \frac{\Omega'\Omega}{\sigma^2}\right)^{-1}\left(\frac{\Omega'y}{\sigma^2}\right),\left(\Sigma_0^{-1} + \frac{\Omega'\Omega}{\sigma^2}\right)^{-1}\right) \]

This is an extension of the single-tree model of Chipman et al (2002), with:

Support for using a separate basis for leaf model than the partitioning (i.e. tree) model (i.e. \(X \neq \Omega\))
Support for multiple trees and sampling via grow-from-root (GFR) or MCMC

We can also enable heteroskedasticity by defining a (diagonal) covariance matrix for the outcome likelihood

\[ \Sigma_y = \text{diag}\left(\sigma^2 / w_1,\sigma^2 / w_2,\dots,\sigma^2 / w_n\right) \]

This updates the reduced log integrated likelihood to

\[ L(y) \propto - \frac{1}{2} \log\left(\textrm{det}\left(I_p + \Sigma_{0}\Omega'\Sigma_y^{-1}\Omega\right)\right) + \frac{1}{2}y'\Sigma_{y}^{-1}\Omega\left(\Sigma_{0}^{-1} + \Omega'\Sigma_{y}^{-1}\Omega\right)^{-1}\Omega'\Sigma_{y}^{-1}y \]

and a posterior for \(\vec{\beta_{\ell}}\) of

\[ \vec{\beta_{\ell}} \sim N\left(\left(\Sigma_{0}^{-1} + \Omega'\Sigma_{y}^{-1}\Omega\right)^{-1}\left(\Omega'\Sigma_{y}^{-1}y\right),\left(\Sigma_{0}^{-1} + \Omega'\Sigma_{y}^{-1}\Omega\right)^{-1}\right) \]

Gaussian Univariate Regression Leaf Model

This specializes the Gaussian Multivariate Regression Leaf Model for a univariate leaf basis, which allows for several computational speedups (replacing generalized matrix operations with simple summation or sum-product operations). We simplify \(\Omega\) to \(\omega\), a univariate basis for every observation, so that \(\Omega'\Omega = \sum_{i:i \in \ell}\omega_i^2\) and \(\Omega'y = \sum_{i:i \in \ell}\omega_ir_i\). Similarly, the prior for the leaf parameter becomes univariate normal as in gaussian_constant_leaf_model:

\[ \beta \sim N\left(0, \tau\right) \]

Allowing for case / variance weights \(w_i\) as above, we derive a reduced log marginal likelihood of

\[ L(y) \propto \frac{1}{2} \log\left(\frac{\sigma^2}{s_{wxx,\ell} \tau + \sigma^2}\right) + \frac{\tau s_{wyx,\ell}^2}{2\sigma^2(s_{wxx,\ell} \tau + \sigma^2)} \]

where

\[ s_{wxx,\ell} = \sum_{i : X_i \in \ell} w_i \omega_i \omega_i \]

\[ s_{wyx,\ell} = \sum_{i : X_i \in \ell} w_i r_i \omega_i \]

and a posterior of

\[ \beta_{\ell} \mid - \sim N\left(\frac{\tau s_{wyx,\ell}}{s_{wxx,\ell} \tau + \sigma^2}, \frac{\tau \sigma^2}{s_{wxx,\ell} \tau + \sigma^2}\right) \]

Inverse Gamma Leaf Model

Each of the above models is a variation on a theme: a conjugate, partitioned Gaussian leaf model. The inverse gamma leaf model allows for forest-based heteroskedasticity modeling using an inverse gamma prior on the exponentiated leaf parameter, as discussed in Murray (2021) Define a variance function based on an ensemble of \(b\) trees as

\[ \sigma^2(X) = \exp\left(s_1(X) + \dots + s_b(X)\right) \]

where each tree function \(s_j(X)\) is defined as

\[ s_j(X_i) = \sum_{\ell \in L} \mathbb{1}(X_i \in \ell) \lambda_{\ell} \]

We reparameterize \(\lambda_{\ell} = \log(\mu_{\ell})\) and we place an inverse gamma prior on \(\mu_{\ell}\)

\[ \mu_{\ell} \sim \text{IG}\left(a, b\right) \]

As noted in Murray (2021), this model no longer enables the "Bayesian backfitting" simplification of conjugated Gaussian leaf models, in which sampling updates for a given tree only depend on other trees in the ensemble via their imprint on the partial residual \(r_i = y_i - \sum_{k \neq j} \mu_k(X_i)\). However, this model is part of a broader class of models with convenient "blocked MCMC" sampling updates (another important example being multinomial classification).

Under an outcome model

\[ y \sim N\left(f(X), \sigma_0^2 \sigma^2(X)\right) \]

updates to \(\mu_{\ell}\) for a given tree \(j\) are based on a reduced log marginal likelihood of

\[ L(y) \propto a \log (b) - \log \Gamma (a) + \log \Gamma \left(a + \frac{n_{\ell}}{2}\right) - \left(a + \frac{n_{\ell}}{2}\right) \left(b + \frac{s_{\sigma,\ell}}{2\sigma_0^2}\right) \]

where

\[ n_{\ell} = \sum_{i : X_i \in \ell} 1 \]

\[ s_{\sigma,\ell} = \sum_{i: i \in \ell} \frac{(y_i - f(X_i))^2}{\prod_{k \neq j} s_k(X_i)} \]

and a posterior of

\[ \mu_{\ell} \mid - \sim \text{IG}\left( a + \frac{n_{\ell}}{2} , b + \frac{s_{\sigma,\ell}}{2\sigma_0^2} \right) \]

Thus, as above, we implement a sufficient statistic class (LogLinearVarianceSuffStat), which tracks

\(n_{\ell}\): data_size_t n
\(s_{\sigma,\ell}\): double weighted_sum_ei

And these values are used by the LogLinearVarianceLeafModel class in the SplitLogMarginalLikelihood, NoSplitLogMarginalLikelihood, PosteriorParameterShape, and PosteriorParameterScale methods. To give one example, below is the implementation of NoSplitLogMarginalLikelihood:

double prior_terms = a_ * std::log(b_) - boost::math::lgamma(a_);
double a_term = a_ + 0.5 * suff_stat.n;
double b_term = b_ + ((0.5 * suff_stat.weighted_sum_ei) / global_variance);
double log_b_term = std::log(b_term);
double lgamma_a_term = boost::math::lgamma(a_term);
double resid_term = a_term * log_b_term;
double log_ml = prior_terms + lgamma_a_term - resid_term;
return log_ml;

Typedef Documentation

◆ SuffStatVariant

using StochTree::SuffStatVariant = typedef std::variant<GaussianConstantSuffStat, GaussianUnivariateRegressionSuffStat, GaussianMultivariateRegressionSuffStat, LogLinearVarianceSuffStat>

Unifying layer for disparate sufficient statistic class types.

Joins together GaussianConstantSuffStat, GaussianUnivariateRegressionSuffStat, GaussianMultivariateRegressionSuffStat, and LogLinearVarianceSuffStat as a combined "variant" type. See the std::variant documentation for more detail.

◆ LeafModelVariant

using StochTree::LeafModelVariant = typedef std::variant<GaussianConstantLeafModel, GaussianUnivariateRegressionLeafModel, GaussianMultivariateRegressionLeafModel, LogLinearVarianceLeafModel>

Unifying layer for disparate leaf model class types.

Joins together GaussianConstantLeafModel, GaussianUnivariateRegressionLeafModel, GaussianMultivariateRegressionLeafModel, and LogLinearVarianceLeafModel as a combined "variant" type. See the std::variant documentation for more detail.

Enumeration Type Documentation

◆ ModelType

enum StochTree::ModelType

Leaf models for the forest sampler:

kConstantLeafGaussian: Every leaf node has a zero-centered univariate normal prior and every leaf is constant.
kUnivariateRegressionLeafGaussian: Every leaf node has a zero-centered univariate normal prior and every leaf is a linear model, multiplying the leaf parameter by a (fixed) basis.
kMultivariateRegressionLeafGaussian: Every leaf node has a multivariate normal prior, centered around the zero vector, and every leaf is a linear model, matrix-multiplying the leaf parameters by a (fixed) basis vector.
kLogLinearVariance: Every leaf node has a inverse gamma prior and every leaf is constant.

Function Documentation

◆ suffStatFactory()

static SuffStatVariant StochTree::suffStatFactory	(	ModelType	model_type,
		int	basis_dim = `0`
	)

inlinestatic

Factory function that creates a new SuffStat object for the specified model type.

Parameters

model_type	Enumeration storing the model type
basis_dim	[Optional] dimension of the basis vector, only used if `model_type = kMultivariateRegressionLeafGaussian`

◆ leafModelFactory()

static LeafModelVariant StochTree::leafModelFactory	(	ModelType	model_type,
		double	tau,
		Eigen::MatrixXd &	Sigma0,
		double	a,
		double	b
	)

inlinestatic

Factory function that creates a new LeafModel object for the specified model type.

Parameters

model_type	Enumeration storing the model type
tau	Value of the leaf node prior scale parameter, only used if `model_type = kConstantLeafGaussian` or `model_type = kUnivariateRegressionLeafGaussian`
Sigma0	Value of the leaf node prior covariance matrix, only used if `model_type = kMultivariateRegressionLeafGaussian`
a	Value of the leaf node inverse gamma prior shape parameter, only used if `model_type = kLogLinearVariance`
b	Value of the leaf node inverse gamma prior scale parameter, only used if `model_type = kLogLinearVariance`

Classes

Typedefs

Enumerations

Functions

Detailed Description

Gaussian Constant Leaf Model

Gaussian Multivariate Regression Leaf Model

Gaussian Univariate Regression Leaf Model

Inverse Gamma Leaf Model

Typedef Documentation

◆ SuffStatVariant

◆ LeafModelVariant

Enumeration Type Documentation

◆ ModelType

Function Documentation

◆ suffStatFactory()

◆ leafModelFactory()