BCF#

`stochtree.bcf.BCFModel` #

Class that handles sampling, storage, and serialization of stochastic forest models for causal effect estimation. The class takes its name from Bayesian Causal Forests, an MCMC sampler originally developed in Hahn, Murray, Carvalho (2020), but supports several sampling algorithms:

MCMC: The "classic" sampler defined in Hahn, Murray, Carvalho (2020). In order to run the MCMC sampler, set num_gfr = 0 (explained below) and then define a sampler according to several parameters:
- num_burnin: the number of iterations to run before "retaining" samples for further analysis. These "burned in" samples are helpful for allowing a sampler to converge before retaining samples.
- num_chains: the number of independent sequences of MCMC samples to generate (typically referred to in the literature as "chains")
- num_mcmc: the number of "retained" samples of the posterior distribution
- keep_every: after a sampler has "burned in", we will run the sampler for keep_every * num_mcmc iterations, retaining one of each keep_every iteration in a chain.
GFR (Grow-From-Root): A fast, greedy approximation of the BART MCMC sampling algorithm introduced in Krantsevich, He, and Hahn (2023). GFR sampler iterations are governed by the num_gfr parameter, and there are two primary ways to use this sampler:
- Standalone: setting num_gfr > 0 and both num_burnin = 0 and num_mcmc = 0 will only run and retain GFR samples of the posterior. This is typically referred to as "XBART" (accelerated BART).
- Initializer for MCMC: setting num_gfr > 0 and num_mcmc > 0 will use ensembles from the GFR algorithm to initialize num_chains independent MCMC BART samplers, which are run for num_mcmc iterations. This is typically referred to as "warm start BART".

In addition to enabling multiple samplers, we support a broad set of models. First, note that the original BCF model of Hahn, Murray, Carvalho (2020) is

\[\begin{equation*} \begin{aligned} y &= a(X) + b_z(X) + \epsilon\\ b_z(X) &= (b_1 Z + b_0 (1-Z)) t(X)\\ b_0, b_1 &\sim N\left(0, \frac{1}{2}\right)\\\\ a(X) &\sim \text{BART}()\\ t(X) &\sim \text{BART}()\\ \epsilon &\sim N(0, \sigma^2)\\ \sigma^2 &\sim IG(a, b) \end{aligned} \end{equation*}\]

for continuous outcome \(y\), binary treatment \(Z\), and covariates \(X\).

In words, there are two nonparametric mean functions -- a "prognostic" function and a "treatment effect" function -- governed by tree ensembles with BART priors and an additive (mean-zero) Gaussian error term, whose variance is parameterized with an inverse gamma prior.

The BCFModel class supports the following extensions of this model:

Continuous Treatment: If \(Z\) is continuous rather than binary, we define \(b_z(X) = \tau(X, Z) = Z \tau(X)\), where the "leaf model" for the \(\tau\) forest is essentially a regression on continuous \(Z\).
Heteroskedasticity: Rather than define \(\epsilon\) parameterically, we can let a forest \(\sigma^2(X)\) model a conditional error variance function. This can be done by setting num_trees_variance > 0 in the params dictionary passed to the sample method.

`sample(X_train, Z_train, y_train, pi_train=None, rfx_group_ids_train=None, rfx_basis_train=None, X_test=None, Z_test=None, pi_test=None, rfx_group_ids_test=None, rfx_basis_test=None, num_gfr=5, num_burnin=0, num_mcmc=100, general_params=None, prognostic_forest_params=None, treatment_effect_forest_params=None, variance_forest_params=None)` #

Runs a BCF sampler on provided training set. Outcome predictions and estimates of the prognostic and treatment effect functions will be cached for the training set and (if provided) the test set.

Parameters:

Name	Type	Description	Default
`X_train`	`array or DataFrame`	Covariates used to split trees in the ensemble. Can be passed as either a matrix or dataframe.	required
`Z_train`	`array`	Array of (continuous or binary; univariate or multivariate) treatment assignments.	required
`y_train`	`array`	Outcome to be modeled by the ensemble.	required
`pi_train`	`array`	Optional vector of propensity scores. If not provided, this will be estimated from the data.	`None`
`rfx_group_ids_train`	`array`	Optional group labels used for an additive random effects model.	`None`
`rfx_basis_train`	`array`	Optional basis for "random-slope" regression in an additive random effects model.	`None`
`X_test`	`array`	Optional test set of covariates used to define "out of sample" evaluation data.	`None`
`Z_test`	`array`	Optional test set of (continuous or binary) treatment assignments. Must be provided if `X_test` is provided.	`None`
`pi_test`	`array`	Optional test set vector of propensity scores. If not provided (but `X_test` and `Z_test` are), this will be estimated from the data.	`None`
`rfx_group_ids_test`	`array`	Optional test set group labels used for an additive random effects model. We do not currently support (but plan to in the near future), test set evaluation for group labels that were not in the training set.	`None`
`rfx_basis_test`	`array`	Optional test set basis for "random-slope" regression in additive random effects model.	`None`
`num_gfr`	`int`	Number of "warm-start" iterations run using the grow-from-root algorithm (He and Hahn, 2021). Defaults to `5`.	`5`
`num_burnin`	`int`	Number of "burn-in" iterations of the MCMC sampler. Defaults to `0`. Ignored if `num_gfr > 0`.	`0`
`num_mcmc`	`int`	Number of "retained" iterations of the MCMC sampler. Defaults to `100`. If this is set to 0, GFR (XBART) samples will be retained.	`100`
`general_params`	`dict`	Dictionary of general model parameters, each of which has a default value processed internally, so this argument is optional. `cutpoint_grid_size` (`int`): Maximum number of cutpoints to consider for each feature. Defaults to `100`. `standardize` (`bool`): Whether or not to standardize the outcome (and store the offset / scale in the model object). Defaults to `True`. `sample_sigma2_global` (`bool`): Whether or not to update the `sigma^2` global error variance parameter based on `IG(sigma2_global_shape, sigma2_global_scale)`. Defaults to `True`. `sigma2_global_init` (`float`): Starting value of global variance parameter. Set internally to the outcome variance (standardized if `standardize = True`) if not set here. `sigma2_global_shape` (`float`): Shape parameter in the `IG(sigma2_global_shape, b_glsigma2_global_scaleobal)` global error variance model. Defaults to `0`. `sigma2_global_scale` (`float`): Scale parameter in the `IG(sigma2_global_shape, b_glsigma2_global_scaleobal)` global error variance model. Defaults to `0`. `variable_weights` (`np.array`): Numeric weights reflecting the relative probability of splitting on each variable in each of the forests. Does not need to sum to 1 but cannot be negative. Defaults to `np.repeat(1/X_train.shape[1], X_train.shape[1])` if not set here. Note that if the propensity score is included as a covariate in either forest, its weight will default to `1/X_train.shape[1]`. A workaround if you wish to provide a custom weight for the propensity score is to include it as a column in `X_train` and then set `propensity_covariate` to `'none'` and adjust `keep_vars` accordingly for the mu or tau forests. `propensity_covariate` (`str`): Whether to include the propensity score as a covariate in either or both of the forests. Enter `"none"` for neither, `"mu"` for the prognostic forest, `"tau"` for the treatment forest, and `"both"` for both forests. If this is not `"none"` and a propensity score is not provided, it will be estimated from (`X_train`, `Z_train`) using `BARTModel`. Defaults to `"mu"`. `adaptive_coding` (`bool`): Whether or not to use an "adaptive coding" scheme in which a binary treatment variable is not coded manually as (0,1) or (-1,1) but learned via parameters `b_0` and `b_1` that attach to the outcome model `[b_0 (1-Z) + b_1 Z] tau(X)`. This is ignored when Z is not binary. Defaults to True. `control_coding_init` (`float`): Initial value of the "control" group coding parameter. This is ignored when `Z` is not binary. Default: `-0.5`. `treated_coding_init` (`float`): Initial value of the "treated" group coding parameter. This is ignored when `Z` is not binary. Default: `0.5`. `random_seed` (`int`): Integer parameterizing the C++ random number generator. If not specified, the C++ random number generator is seeded according to `std::random_device`. `keep_burnin` (`bool`): Whether or not "burnin" samples should be included in predictions. Defaults to `False`. Ignored if `num_mcmc == 0`. `keep_gfr` (`bool`): Whether or not "warm-start" / grow-from-root samples should be included in predictions. Defaults to `False`. Ignored if `num_mcmc == 0`. `keep_every` (`int`): How many iterations of the burned-in MCMC sampler should be run before forests and parameters are retained. Defaults to `1`. Setting `keep_every = k` for some `k > 1` will "thin" the MCMC samples by retaining every `k`-th sample, rather than simply every sample. This can reduce the autocorrelation of the MCMC samples. `num_chains` (`int`): How many independent MCMC chains should be sampled. If `num_mcmc = 0`, this is ignored. If `num_gfr = 0`, then each chain is run from root for `num_mcmc * keep_every + num_burnin` iterations, with `num_mcmc` samples retained. If `num_gfr > 0`, each MCMC chain will be initialized from a separate GFR ensemble, with the requirement that `num_gfr >= num_chains`. Defaults to `1`.	`None`
`prognostic_forest_params`	`dict`	Dictionary of prognostic forest model parameters, each of which has a default value processed internally, so this argument is optional. `num_trees` (`int`): Number of trees in the prognostic forest. Defaults to `250`. Must be a positive integer. `alpha` (`float`): Prior probability of splitting for a tree of depth 0 in the prognostic forest. Tree split prior combines `alpha` and `beta` via `alpha(1+node_depth)^-beta`. Defaults to `0.95`. `beta` (`float`): Exponent that decreases split probabilities for nodes of depth > 0 in the prognostic forest. Tree split prior combines `alpha` and `beta` via `alpha(1+node_depth)^-beta`. Defaults to `2`. `min_samples_leaf` (`int`): Minimum allowable size of a leaf, in terms of training samples, in the prognostic forest. Defaults to `5`. `max_depth` (`int`): Maximum depth of any tree in the ensemble in the prognostic forest. Defaults to `10`. Can be overriden with `-1` which does not enforce any depth limits on trees. `variable_weights` (`np.array`): Numeric weights reflecting the relative probability of splitting on each variable in the prognostic forest. Does not need to sum to 1 but cannot be negative. Defaults to uniform over the columns of `X_train` if not provided. `sample_sigma2_leaf` (`bool`): Whether or not to update the `tau` leaf scale variance parameter based on `IG(sigma2_leaf_shape, sigma2_leaf_scale)`. Cannot (currently) be set to true if `basis_train` has more than one column. Defaults to `False`. `sigma2_leaf_init` (`float`): Starting value of leaf node scale parameter. Calibrated internally as `1/num_trees` if not set here. `sigma2_leaf_shape` (`float`): Shape parameter in the `IG(sigma2_leaf_shape, sigma2_leaf_scale)` leaf node parameter variance model. Defaults to `3`. `sigma2_leaf_scale` (`float`): Scale parameter in the `IG(sigma2_leaf_shape, sigma2_leaf_scale)` leaf node parameter variance model. Calibrated internally as `0.5/num_trees` if not set here. `keep_vars` (`list` or `np.array`): Vector of variable names or column indices denoting variables that should be included in the prognostic (`mu(X)`) forest. Defaults to `None`. `drop_vars` (`list` or `np.array`): Vector of variable names or column indices denoting variables that should be excluded from the prognostic (`mu(X)`) forest. Defaults to `None`. If both `drop_vars` and `keep_vars` are set, `drop_vars` will be ignored.	`None`
`treatment_effect_forest_params`	`dict`	Dictionary of treatment effect forest model parameters, each of which has a default value processed internally, so this argument is optional. `num_trees` (`int`): Number of trees in the treatment effect forest. Defaults to `50`. Must be a positive integer. `alpha` (`float`): Prior probability of splitting for a tree of depth 0 in the treatment effect forest. Tree split prior combines `alpha` and `beta` via `alpha(1+node_depth)^-beta`. Defaults to `0.25`. `beta` (`float`): Exponent that decreases split probabilities for nodes of depth > 0 in the treatment effect forest. Tree split prior combines `alpha` and `beta` via `alpha(1+node_depth)^-beta`. Defaults to `3`. `min_samples_leaf` (`int`): Minimum allowable size of a leaf, in terms of training samples, in the treatment effect forest. Defaults to `5`. `max_depth` (`int`): Maximum depth of any tree in the ensemble in the treatment effect forest. Defaults to `5`. Can be overriden with `-1` which does not enforce any depth limits on trees. `sample_sigma2_leaf` (`bool`): Whether or not to update the `tau` leaf scale variance parameter based on `IG(sigma2_leaf_shape, sigma2_leaf_scale)`. Cannot (currently) be set to true if `basis_train` has more than one column. Defaults to `False`. `sigma2_leaf_init` (`float`): Starting value of leaf node scale parameter. Calibrated internally as `1/num_trees` if not set here. `sigma2_leaf_shape` (`float`): Shape parameter in the `IG(sigma2_leaf_shape, sigma2_leaf_scale)` leaf node parameter variance model. Defaults to `3`. `sigma2_leaf_scale` (`float`): Scale parameter in the `IG(sigma2_leaf_shape, sigma2_leaf_scale)` leaf node parameter variance model. Calibrated internally as `0.5/num_trees` if not set here. `keep_vars` (`list` or `np.array`): Vector of variable names or column indices denoting variables that should be included in the treatment effect (`tau(X)`) forest. Defaults to `None`. `drop_vars` (`list` or `np.array`): Vector of variable names or column indices denoting variables that should be excluded from the treatment effect (`tau(X)`) forest. Defaults to `None`. If both `drop_vars` and `keep_vars` are set, `drop_vars` will be ignored.	`None`
`variance_forest_params`	`dict`	Dictionary of variance forest model parameters, each of which has a default value processed internally, so this argument is optional. `num_trees` (`int`): Number of trees in the conditional variance model. Defaults to `0`. Variance is only modeled using a tree / forest if `num_trees > 0`. `alpha` (`float`): Prior probability of splitting for a tree of depth 0 in the conditional variance model. Tree split prior combines `alpha` and `beta` via `alpha(1+node_depth)^-beta`. Defaults to `0.95`. `beta` (`float`): Exponent that decreases split probabilities for nodes of depth > 0 in the conditional variance model. Tree split prior combines `alpha` and `beta` via `alpha(1+node_depth)^-beta`. Defaults to `2`. `min_samples_leaf` (`int`): Minimum allowable size of a leaf, in terms of training samples, in the conditional variance model. Defaults to `5`. `max_depth` (`int`): Maximum depth of any tree in the ensemble in the conditional variance model. Defaults to `10`. Can be overriden with `-1` which does not enforce any depth limits on trees. `leaf_prior_calibration_param` (`float`): Hyperparameter used to calibrate the [optional] `IG(var_forest_prior_shape, var_forest_prior_scale)` conditional error variance model. If `var_forest_prior_shape` and `var_forest_prior_scale` are not set below, this calibration parameter is used to set these values to `num_trees / leaf_prior_calibration_param^2 + 0.5` and `num_trees / leaf_prior_calibration_param^2`, respectively. Defaults to `1.5`. `var_forest_leaf_init` (`float`): Starting value of root forest prediction in conditional (heteroskedastic) error variance model. Calibrated internally as `np.log(0.6*np.var(y_train))/num_trees_variance`, where `y_train` is the possibly standardized outcome, if not set. `var_forest_prior_shape` (`float`): Shape parameter in the [optional] `IG(var_forest_prior_shape, var_forest_prior_scale)` conditional error variance forest (which is only sampled if `num_trees > 0`). Calibrated internally as `num_trees / 1.5^2 + 0.5` if not set here. `var_forest_prior_scale` (`float`): Scale parameter in the [optional] `IG(var_forest_prior_shape, var_forest_prior_scale)` conditional error variance forest (which is only sampled if `num_trees > 0`). Calibrated internally as `num_trees / 1.5^2` if not set here. `keep_vars` (`list` or `np.array`): Vector of variable names or column indices denoting variables that should be included in the variance forest. Defaults to `None`. `drop_vars` (`list` or `np.array`): Vector of variable names or column indices denoting variables that should be excluded from the variance forest. Defaults to `None`. If both `drop_vars` and `keep_vars` are set, `drop_vars` will be ignored.	`None`

Returns:

Name	Type	Description
`self`	`BCFModel`	Sampled BCF Model.

`predict_tau(X, Z, propensity=None)` #

Predict CATE function for every provided observation.

Parameters:

Name	Type	Description	Default
`X`	`array or DataFrame`	Test set covariates.	required
`Z`	`array`	Test set treatment indicators.	required
`propensity`	`array`	Optional test set propensities. Must be provided if propensities were provided when the model was sampled.	`None`

Returns:

Type	Description
`array`	Array with as many rows as in `X` and as many columns as retained samples of the algorithm.

`predict_variance(covariates, propensity=None)` #

Predict expected conditional variance from a BART model.

Parameters:

Name	Type	Description	Default
`covariates`	`array`	Test set covariates.	required
`propensity`	`array`	Test set propensity scores. Optional (not currently used in variance forests).	`None`

Returns:

Type	Description
`array`	Array of predictions corresponding to the variance forest. Each array will contain as many rows as in `covariates` and as many columns as retained samples of the algorithm.

`predict(X, Z, propensity=None, rfx_group_ids=None, rfx_basis=None)` #

Predict outcome model components (CATE function and prognostic function) as well as overall outcome for every provided observation. Predicted outcomes are computed as yhat = mu_x + Z*tau_x where mu_x is a sample of the prognostic function and tau_x is a sample of the treatment effect (CATE) function.

Parameters:

Name	Type	Description	Default
`X`	`array or DataFrame`	Test set covariates.	required
`Z`	`array`	Test set treatment indicators.	required
`propensity`	`np.array`	Optional test set propensities. Must be provided if propensities were provided when the model was sampled.	`None`
`rfx_group_ids`	`array`	Optional group labels used for an additive random effects model.	`None`
`rfx_basis`	`array`	Optional basis for "random-slope" regression in an additive random effects model.	`None`

Returns:

Name	Type	Description
`tau_x`	`array`	Conditional average treatment effect (CATE) samples for every observation provided.
`mu_x`	`array`	Prognostic effect samples for every observation provided.
`rfx`	`(array, optional)`	Random effect samples for every observation provided, if the model includes a random effects term.
`yhat_x`	`array`	Outcome prediction samples for every observation provided.
`sigma2_x`	`(array, optional)`	Variance forest samples for every observation provided. Only returned if the model includes a heteroskedasticity forest.

`to_json()` #

Converts a sampled BART model to JSON string representation (which can then be saved to a file or processed using the json library)

Returns:

Type	Description
`str`	JSON string representing model metadata (hyperparameters), sampled parameters, and sampled forests

`from_json(json_string)` #

Converts a JSON string to an in-memory BART model.

Parameters:

Name	Type	Description	Default
`json_string`	`str`	JSON string representing model metadata (hyperparameters), sampled parameters, and sampled forests	required

`from_json_string_list(json_string_list)` #

Convert a list of (in-memory) JSON strings that represent BCF models to a single combined BCF model object which can be used for prediction, etc...

Parameters:

Name	Type	Description	Default
`json_string_list`	`list of str`	List of JSON strings which can be parsed to objects of type `JSONSerializer` containing Json representation of a BCF model	required

`is_sampled()` #

Whether or not a BCF model has been sampled.

Returns:

Type	Description
`bool`	`True` if a BCF model has been sampled, `False` otherwise

BCF#

stochtree.bcf.BCFModel #

predict_tau(X, Z, propensity=None) #

predict_variance(covariates, propensity=None) #

predict(X, Z, propensity=None, rfx_group_ids=None, rfx_basis=None) #

to_json() #

from_json(json_string) #

from_json_string_list(json_string_list) #

is_sampled() #

`stochtree.bcf.BCFModel` #

`predict_tau(X, Z, propensity=None)` #

`predict_variance(covariates, propensity=None)` #

`predict(X, Z, propensity=None, rfx_group_ids=None, rfx_basis=None)` #

`to_json()` #

`from_json(json_string)` #

`from_json_string_list(json_string_list)` #

`is_sampled()` #