Run the Bayesian Causal Forest (BCF) algorithm for regularized causal effect estimation.

Usage

bcf(
  X_train,
  Z_train,
  y_train,
  propensity_train = NULL,
  rfx_group_ids_train = NULL,
  rfx_basis_train = NULL,
  X_test = NULL,
  Z_test = NULL,
  propensity_test = NULL,
  rfx_group_ids_test = NULL,
  rfx_basis_test = NULL,
  num_gfr = 5,
  num_burnin = 0,
  num_mcmc = 100,
  previous_model_json = NULL,
  previous_model_warmstart_sample_num = NULL,
  general_params = list(),
  prognostic_forest_params = list(),
  treatment_effect_forest_params = list(),
  variance_forest_params = list()
)

Arguments

X_train

Covariates used to split trees in the ensemble. May be provided either as a dataframe or a matrix. Matrix covariates will be assumed to be all numeric. Covariates passed as a dataframe will be preprocessed based on the variable types (e.g. categorical columns stored as unordered factors will be one-hot encoded, categorical columns stored as ordered factors will passed as integers to the core algorithm, along with the metadata that the column is ordered categorical).

Z_train

Vector of (continuous or binary) treatment assignments.

y_train

Outcome to be modeled by the ensemble.

propensity_train

(Optional) Vector of propensity scores. If not provided, this will be estimated from the data.

rfx_group_ids_train

(Optional) Group labels used for an additive random effects model.

rfx_basis_train

(Optional) Basis for "random-slope" regression in an additive random effects model. If rfx_group_ids_train is provided with a regression basis, an intercept-only random effects model will be estimated.

X_test

(Optional) Test set of covariates used to define "out of sample" evaluation data. May be provided either as a dataframe or a matrix, but the format of X_test must be consistent with that of X_train.

Z_test

(Optional) Test set of (continuous or binary) treatment assignments.

propensity_test

(Optional) Vector of propensity scores. If not provided, this will be estimated from the data.

rfx_group_ids_test

(Optional) Test set group labels used for an additive random effects model. We do not currently support (but plan to in the near future), test set evaluation for group labels that were not in the training set.

rfx_basis_test

(Optional) Test set basis for "random-slope" regression in additive random effects model.

num_gfr

Number of "warm-start" iterations run using the grow-from-root algorithm (He and Hahn, 2021). Default: 5.

num_burnin

Number of "burn-in" iterations of the MCMC sampler. Default: 0.

num_mcmc

Number of "retained" iterations of the MCMC sampler. Default: 100.

previous_model_json

(Optional) JSON string containing a previous BCF model. This can be used to "continue" a sampler interactively after inspecting the samples or to run parallel chains "warm-started" from existing forest samples. Default: NULL.

previous_model_warmstart_sample_num

(Optional) Sample number from previous_model_json that will be used to warmstart this BCF sampler. One-indexed (so that the first sample is used for warm-start by setting previous_model_warmstart_sample_num = 1). Default: NULL.

general_params

(Optional) A list of general (non-forest-specific) model parameters, each of which has a default value processed internally, so this argument list is optional.

cutpoint_grid_size Maximum size of the "grid" of potential cutpoints to consider in the GFR algorithm. Default: 100.
standardize Whether or not to standardize the outcome (and store the offset / scale in the model object). Default: TRUE.
sample_sigma2_global Whether or not to update the sigma^2 global error variance parameter based on IG(sigma2_global_shape, sigma2_global_scale). Default: TRUE.
sigma2_global_init Starting value of global error variance parameter. Calibrated internally as 1.0*var((y_train-mean(y_train))/sd(y_train)) if not set.
sigma2_global_shape Shape parameter in the IG(sigma2_global_shape, sigma2_global_scale) global error variance model. Default: 0.
sigma2_global_scale Scale parameter in the IG(sigma2_global_shape, sigma2_global_scale) global error variance model. Default: 0.
variable_weights Numeric weights reflecting the relative probability of splitting on each variable. Does not need to sum to 1 but cannot be negative. Defaults to rep(1/ncol(X_train), ncol(X_train)) if not set here. Note that if the propensity score is included as a covariate in either forest, its weight will default to 1/ncol(X_train). A workaround if you wish to provide a custom weight for the propensity score is to include it as a column in X_train and then set propensity_covariate to 'none' adjust keep_vars accordingly for the mu or tau forests.
propensity_covariate Whether to include the propensity score as a covariate in either or both of the forests. Enter "none" for neither, "mu" for the prognostic forest, "tau" for the treatment forest, and "both" for both forests. If this is not "none" and a propensity score is not provided, it will be estimated from (X_train, Z_train) using stochtree::bart(). Default: "mu".
adaptive_coding Whether or not to use an "adaptive coding" scheme in which a binary treatment variable is not coded manually as (0,1) or (-1,1) but learned via parameters b_0 and b_1 that attach to the outcome model [b_0 (1-Z) + b_1 Z] tau(X). This is ignored when Z is not binary. Default: TRUE.
control_coding_init Initial value of the "control" group coding parameter. This is ignored when Z is not binary. Default: -0.5.
treated_coding_init Initial value of the "treatment" group coding parameter. This is ignored when Z is not binary. Default: 0.5.
rfx_prior_var Prior on the (diagonals of the) covariance of the additive group-level random regression coefficients. Must be a vector of length ncol(rfx_basis_train). Default: rep(1, ncol(rfx_basis_train))
random_seed Integer parameterizing the C++ random number generator. If not specified, the C++ random number generator is seeded according to std::random_device.
keep_burnin Whether or not "burnin" samples should be included in the stored samples of forests and other parameters. Default FALSE. Ignored if num_mcmc = 0.
keep_gfr Whether or not "grow-from-root" samples should be included in the stored samples of forests and other parameters. Default FALSE. Ignored if num_mcmc = 0.
keep_every How many iterations of the burned-in MCMC sampler should be run before forests and parameters are retained. Default 1. Setting keep_every <- k for some k > 1 will "thin" the MCMC samples by retaining every k-th sample, rather than simply every sample. This can reduce the autocorrelation of the MCMC samples.
num_chains How many independent MCMC chains should be sampled. If num_mcmc = 0, this is ignored. If num_gfr = 0, then each chain is run from root for num_mcmc * keep_every + num_burnin iterations, with num_mcmc samples retained. If num_gfr > 0, each MCMC chain will be initialized from a separate GFR ensemble, with the requirement that num_gfr >= num_chains. Default: 1.
verbose Whether or not to print progress during the sampling loops. Default: FALSE.
probit_outcome_model Whether or not the outcome should be modeled as explicitly binary via a probit link. If TRUE, y must only contain the values 0 and 1. Default: FALSE.
rfx_working_parameter_prior_mean Prior mean for the random effects "working parameter". Default: NULL. Must be a vector whose dimension matches the number of random effects bases, or a scalar value that will be expanded to a vector.
rfx_group_parameters_prior_mean Prior mean for the random effects "group parameters." Default: NULL. Must be a vector whose dimension matches the number of random effects bases, or a scalar value that will be expanded to a vector.
rfx_working_parameter_prior_cov Prior covariance matrix for the random effects "working parameter." Default: NULL. Must be a square matrix whose dimension matches the number of random effects bases, or a scalar value that will be expanded to a diagonal matrix.
rfx_group_parameter_prior_cov Prior covariance matrix for the random effects "group parameters." Default: NULL. Must be a square matrix whose dimension matches the number of random effects bases, or a scalar value that will be expanded to a diagonal matrix.
rfx_variance_prior_shape Shape parameter for the inverse gamma prior on the variance of the random effects "group parameter." Default: 1.
rfx_variance_prior_scale Scale parameter for the inverse gamma prior on the variance of the random effects "group parameter." Default: 1.

prognostic_forest_params

(Optional) A list of prognostic forest model parameters, each of which has a default value processed internally, so this argument list is optional.

num_trees Number of trees in the ensemble for the prognostic forest. Default: 250. Must be a positive integer.
alpha Prior probability of splitting for a tree of depth 0 in the prognostic forest. Tree split prior combines alpha and beta via alpha*(1+node_depth)^-beta. Default: 0.95.
beta Exponent that decreases split probabilities for nodes of depth > 0 in the prognostic forest. Tree split prior combines alpha and beta via alpha*(1+node_depth)^-beta. Default: 2.
min_samples_leaf Minimum allowable size of a leaf, in terms of training samples, in the prognostic forest. Default: 5.
max_depth Maximum depth of any tree in the ensemble in the prognostic forest. Default: 10. Can be overridden with -1 which does not enforce any depth limits on trees.
variable_weights Numeric weights reflecting the relative probability of splitting on each variable in the prognostic forest. Does not need to sum to 1 but cannot be negative. Defaults to rep(1/ncol(X_train), ncol(X_train)) if not set here.
sample_sigma2_leaf Whether or not to update the leaf scale variance parameter based on IG(sigma2_leaf_shape, sigma2_leaf_scale).
sigma2_leaf_init Starting value of leaf node scale parameter. Calibrated internally as 1/num_trees if not set here.
sigma2_leaf_shape Shape parameter in the IG(sigma2_leaf_shape, sigma2_leaf_scale) leaf node parameter variance model. Default: 3.
sigma2_leaf_scale Scale parameter in the IG(sigma2_leaf_shape, sigma2_leaf_scale) leaf node parameter variance model. Calibrated internally as 0.5/num_trees if not set here.
keep_vars Vector of variable names or column indices denoting variables that should be included in the forest. Default: NULL.
drop_vars Vector of variable names or column indices denoting variables that should be excluded from the forest. Default: NULL. If both drop_vars and keep_vars are set, drop_vars will be ignored.
num_features_subsample How many features to subsample when growing each tree for the GFR algorithm. Defaults to the number of features in the training dataset.

treatment_effect_forest_params

(Optional) A list of treatment effect forest model parameters, each of which has a default value processed internally, so this argument list is optional.

num_trees Number of trees in the ensemble for the treatment effect forest. Default: 50. Must be a positive integer.
alpha Prior probability of splitting for a tree of depth 0 in the treatment effect forest. Tree split prior combines alpha and beta via alpha*(1+node_depth)^-beta. Default: 0.25.
beta Exponent that decreases split probabilities for nodes of depth > 0 in the treatment effect forest. Tree split prior combines alpha and beta via alpha*(1+node_depth)^-beta. Default: 3.
min_samples_leaf Minimum allowable size of a leaf, in terms of training samples, in the treatment effect forest. Default: 5.
max_depth Maximum depth of any tree in the ensemble in the treatment effect forest. Default: 5. Can be overridden with -1 which does not enforce any depth limits on trees.
variable_weights Numeric weights reflecting the relative probability of splitting on each variable in the treatment effect forest. Does not need to sum to 1 but cannot be negative. Defaults to rep(1/ncol(X_train), ncol(X_train)) if not set here.
sample_sigma2_leaf Whether or not to update the leaf scale variance parameter based on IG(sigma2_leaf_shape, sigma2_leaf_scale). Cannot (currently) be set to true if ncol(Z_train)>1. Default: FALSE.
sigma2_leaf_init Starting value of leaf node scale parameter. Calibrated internally as 1/num_trees if not set here.
sigma2_leaf_shape Shape parameter in the IG(sigma2_leaf_shape, sigma2_leaf_scale) leaf node parameter variance model. Default: 3.
sigma2_leaf_scale Scale parameter in the IG(sigma2_leaf_shape, sigma2_leaf_scale) leaf node parameter variance model. Calibrated internally as 0.5/num_trees if not set here.
delta_max Maximum plausible conditional distributional treatment effect (i.e. P(Y(1) = 1 | X) - P(Y(0) = 1 | X)) when the outcome is binary. Only used when the outcome is specified as a probit model in general_params. Must be > 0 and < 1. Default: 0.9. Ignored if sigma2_leaf_init is set directly, as this parameter is used to calibrate sigma2_leaf_init.
keep_vars Vector of variable names or column indices denoting variables that should be included in the forest. Default: NULL.
drop_vars Vector of variable names or column indices denoting variables that should be excluded from the forest. Default: NULL. If both drop_vars and keep_vars are set, drop_vars will be ignored.
num_features_subsample How many features to subsample when growing each tree for the GFR algorithm. Defaults to the number of features in the training dataset.

variance_forest_params

(Optional) A list of variance forest model parameters, each of which has a default value processed internally, so this argument list is optional.

num_trees Number of trees in the ensemble for the conditional variance model. Default: 0. Variance is only modeled using a tree / forest if num_trees > 0.
alpha Prior probability of splitting for a tree of depth 0 in the variance model. Tree split prior combines alpha and beta via alpha*(1+node_depth)^-beta. Default: 0.95.
beta Exponent that decreases split probabilities for nodes of depth > 0 in the variance model. Tree split prior combines alpha and beta via alpha*(1+node_depth)^-beta. Default: 2.
min_samples_leaf Minimum allowable size of a leaf, in terms of training samples, in the variance model. Default: 5.
max_depth Maximum depth of any tree in the ensemble in the variance model. Default: 10. Can be overridden with -1 which does not enforce any depth limits on trees.
leaf_prior_calibration_param Hyperparameter used to calibrate the IG(var_forest_prior_shape, var_forest_prior_scale) conditional error variance model. If var_forest_prior_shape and var_forest_prior_scale are not set below, this calibration parameter is used to set these values to num_trees / leaf_prior_calibration_param^2 + 0.5 and num_trees / leaf_prior_calibration_param^2, respectively. Default: 1.5.
variance_forest_init Starting value of root forest prediction in conditional (heteroskedastic) error variance model. Calibrated internally as log(0.6*var((y_train-mean(y_train))/sd(y_train)))/num_trees if not set.
var_forest_prior_shape Shape parameter in the IG(var_forest_prior_shape, var_forest_prior_scale) conditional error variance model (which is only sampled if num_trees > 0). Calibrated internally as num_trees / 1.5^2 + 0.5 if not set.
var_forest_prior_scale Scale parameter in the IG(var_forest_prior_shape, var_forest_prior_scale) conditional error variance model (which is only sampled if num_trees > 0). Calibrated internally as num_trees / 1.5^2 if not set.
keep_vars Vector of variable names or column indices denoting variables that should be included in the forest. Default: NULL.
drop_vars Vector of variable names or column indices denoting variables that should be excluded from the forest. Default: NULL. If both drop_vars and keep_vars are set, drop_vars will be ignored.
num_features_subsample How many features to subsample when growing each tree for the GFR algorithm. Defaults to the number of features in the training dataset.

Value

List of sampling outputs and a wrapper around the sampled forests (which can be used for in-memory prediction on new data, or serialized to JSON on disk).

Examples

n <- 500
p <- 5
X <- matrix(runif(n*p), ncol = p)
mu_x <- (
    ((0 <= X[,1]) & (0.25 > X[,1])) * (-7.5) + 
    ((0.25 <= X[,1]) & (0.5 > X[,1])) * (-2.5) + 
    ((0.5 <= X[,1]) & (0.75 > X[,1])) * (2.5) + 
    ((0.75 <= X[,1]) & (1 > X[,1])) * (7.5)
)
pi_x <- (
    ((0 <= X[,1]) & (0.25 > X[,1])) * (0.2) + 
    ((0.25 <= X[,1]) & (0.5 > X[,1])) * (0.4) + 
    ((0.5 <= X[,1]) & (0.75 > X[,1])) * (0.6) + 
    ((0.75 <= X[,1]) & (1 > X[,1])) * (0.8)
)
tau_x <- (
    ((0 <= X[,2]) & (0.25 > X[,2])) * (0.5) + 
    ((0.25 <= X[,2]) & (0.5 > X[,2])) * (1.0) + 
    ((0.5 <= X[,2]) & (0.75 > X[,2])) * (1.5) + 
    ((0.75 <= X[,2]) & (1 > X[,2])) * (2.0)
)
Z <- rbinom(n, 1, pi_x)
noise_sd <- 1
y <- mu_x + tau_x*Z + rnorm(n, 0, noise_sd)
test_set_pct <- 0.2
n_test <- round(test_set_pct*n)
n_train <- n - n_test
test_inds <- sort(sample(1:n, n_test, replace = FALSE))
train_inds <- (1:n)[!((1:n) %in% test_inds)]
X_test <- X[test_inds,]
X_train <- X[train_inds,]
pi_test <- pi_x[test_inds]
pi_train <- pi_x[train_inds]
Z_test <- Z[test_inds]
Z_train <- Z[train_inds]
y_test <- y[test_inds]
y_train <- y[train_inds]
mu_test <- mu_x[test_inds]
mu_train <- mu_x[train_inds]
tau_test <- tau_x[test_inds]
tau_train <- tau_x[train_inds]
bcf_model <- bcf(X_train = X_train, Z_train = Z_train, y_train = y_train, 
                 propensity_train = pi_train, X_test = X_test, Z_test = Z_test, 
                 propensity_test = pi_test, num_gfr = 10, 
                 num_burnin = 0, num_mcmc = 10)