Run the Bayesian Causal Forest (BCF) algorithm for regularized causal effect estimation.
Source:R/bcf.R
bcf.RdRun the Bayesian Causal Forest (BCF) algorithm for regularized causal effect estimation.
Usage
bcf(
X_train,
Z_train,
y_train,
propensity_train = NULL,
rfx_group_ids_train = NULL,
rfx_basis_train = NULL,
X_test = NULL,
Z_test = NULL,
propensity_test = NULL,
rfx_group_ids_test = NULL,
rfx_basis_test = NULL,
num_gfr = 5,
num_burnin = 0,
num_mcmc = 100,
previous_model_json = NULL,
previous_model_warmstart_sample_num = NULL,
general_params = list(),
prognostic_forest_params = list(),
treatment_effect_forest_params = list(),
variance_forest_params = list(),
random_effects_params = list()
)Arguments
- X_train
Covariates used to split trees in the ensemble. May be provided either as a dataframe or a matrix. Matrix covariates will be assumed to be all numeric. Covariates passed as a dataframe will be preprocessed based on the variable types (e.g. categorical columns stored as unordered factors will be one-hot encoded, categorical columns stored as ordered factors will passed as integers to the core algorithm, along with the metadata that the column is ordered categorical).
- Z_train
Vector of (continuous or binary) treatment assignments.
- y_train
Outcome to be modeled by the ensemble.
- propensity_train
(Optional) Vector of propensity scores. If not provided, this will be estimated from the data.
- rfx_group_ids_train
(Optional) Group labels used for an additive random effects model.
- rfx_basis_train
(Optional) Basis for "random-slope" regression in an additive random effects model. If
rfx_group_ids_trainis provided with a regression basis, an intercept-only random effects model will be estimated.- X_test
(Optional) Test set of covariates used to define "out of sample" evaluation data. May be provided either as a dataframe or a matrix, but the format of
X_testmust be consistent with that ofX_train.- Z_test
(Optional) Test set of (continuous or binary) treatment assignments.
- propensity_test
(Optional) Vector of propensity scores. If not provided, this will be estimated from the data.
- rfx_group_ids_test
(Optional) Test set group labels used for an additive random effects model. We do not currently support (but plan to in the near future), test set evaluation for group labels that were not in the training set.
- rfx_basis_test
(Optional) Test set basis for "random-slope" regression in additive random effects model.
- num_gfr
Number of "warm-start" iterations run using the grow-from-root algorithm (He and Hahn, 2021). Default: 5.
- num_burnin
Number of "burn-in" iterations of the MCMC sampler. Default: 0.
- num_mcmc
Number of "retained" iterations of the MCMC sampler. Default: 100.
- previous_model_json
(Optional) JSON string containing a previous BCF model. This can be used to "continue" a sampler interactively after inspecting the samples or to run parallel chains "warm-started" from existing forest samples. Default:
NULL.- previous_model_warmstart_sample_num
(Optional) Sample number from
previous_model_jsonthat will be used to warmstart this BCF sampler. One-indexed (so that the first sample is used for warm-start by settingprevious_model_warmstart_sample_num = 1). Default:NULL. Ifnum_chainsin thegeneral_paramslist is > 1, then each successive chain will be initialized from a different sample, counting backwards fromprevious_model_warmstart_sample_num. That is, ifprevious_model_warmstart_sample_num = 10andnum_chains = 4, then chain 1 will be initialized from sample 10, chain 2 from sample 9, chain 3 from sample 8, and chain 4 from sample 7. Ifprevious_model_jsonis provided butprevious_model_warmstart_sample_numis NULL, the last sample in the previous model will be used to initialize the first chain, counting backwards as noted before. If more chains are requested than there are samples inprevious_model_json, a warning will be raised and only the last sample will be used.- general_params
(Optional) A list of general (non-forest-specific) model parameters, each of which has a default value processed internally, so this argument list is optional.
cutpoint_grid_sizeMaximum size of the "grid" of potential cutpoints to consider in the GFR algorithm. Default:100.standardizeWhether or not to standardize the outcome (and store the offset / scale in the model object). Default:TRUE.sample_sigma2_globalWhether or not to update thesigma^2global error variance parameter based onIG(sigma2_global_shape, sigma2_global_scale). Default:TRUE.sigma2_global_initStarting value of global error variance parameter. Calibrated internally as1.0*var((y_train-mean(y_train))/sd(y_train))if not set.sigma2_global_shapeShape parameter in theIG(sigma2_global_shape, sigma2_global_scale)global error variance model. Default:0.sigma2_global_scaleScale parameter in theIG(sigma2_global_shape, sigma2_global_scale)global error variance model. Default:0.variable_weightsNumeric weights reflecting the relative probability of splitting on each variable. Does not need to sum to 1 but cannot be negative. Defaults torep(1/ncol(X_train), ncol(X_train))if not set here. Note that if the propensity score is included as a covariate in either forest, its weight will default to1/ncol(X_train). A workaround if you wish to provide a custom weight for the propensity score is to include it as a column inX_trainand then setpropensity_covariateto'none'adjustkeep_varsaccordingly for theprognosticortreatment_effectforests.propensity_covariateWhether to include the propensity score as a covariate in either or both of the forests. Enter"none"for neither,"prognostic"for the prognostic forest,"treatment_effect"for the treatment forest, and"both"for both forests. If this is not"none"and a propensity score is not provided, it will be estimated from (X_train,Z_train) usingstochtree::bart(). Default:"mu".adaptive_codingWhether or not to use an "adaptive coding" scheme in which a binary treatment variable is not coded manually as (0,1) or (-1,1) but learned via parametersb_0andb_1that attach to the outcome model[b_0 (1-Z) + b_1 Z] tau(X). This is ignored when Z is not binary. Default:TRUE.control_coding_initInitial value of the "control" group coding parameter. This is ignored when Z is not binary. Default:-0.5.treated_coding_initInitial value of the "treatment" group coding parameter. This is ignored when Z is not binary. Default:0.5.rfx_prior_varPrior on the (diagonals of the) covariance of the additive group-level random regression coefficients. Must be a vector of lengthncol(rfx_basis_train). Default:rep(1, ncol(rfx_basis_train))random_seedInteger parameterizing the C++ random number generator. If not specified, the C++ random number generator is seeded according tostd::random_device.keep_burninWhether or not "burnin" samples should be included in the stored samples of forests and other parameters. DefaultFALSE. Ignored ifnum_mcmc = 0.keep_gfrWhether or not "grow-from-root" samples should be included in the stored samples of forests and other parameters. DefaultFALSE. Ignored ifnum_mcmc = 0.keep_everyHow many iterations of the burned-in MCMC sampler should be run before forests and parameters are retained. Default1. Settingkeep_every <- kfor somek > 1will "thin" the MCMC samples by retaining everyk-th sample, rather than simply every sample. This can reduce the autocorrelation of the MCMC samples.num_chainsHow many independent MCMC chains should be sampled. Ifnum_mcmc = 0, this is ignored. Ifnum_gfr = 0, then each chain is run from root fornum_mcmc * keep_every + num_burniniterations, withnum_mcmcsamples retained. Ifnum_gfr > 0, each MCMC chain will be initialized from a separate GFR ensemble, with the requirement thatnum_gfr >= num_chains. Default:1. Note that ifnum_chains > 1, the returned model object will contain samples from all chains, stored consecutively. That is, if there are 4 chains with 100 samples each, the first 100 samples will be from chain 1, the next 100 samples will be from chain 2, etc... For more detail on working with multi-chain BCF models, see the multi chain vignette.verboseWhether or not to print progress during the sampling loops. Default:FALSE.probit_outcome_modelWhether or not the outcome should be modeled as explicitly binary via a probit link. IfTRUE,ymust only contain the values0and1. Default:FALSE.num_threadsNumber of threads to use in the GFR and MCMC algorithms, as well as prediction. If OpenMP is not available on a user's setup, this will default to1, otherwise to the maximum number of available threads.
- prognostic_forest_params
(Optional) A list of prognostic forest model parameters, each of which has a default value processed internally, so this argument list is optional.
num_treesNumber of trees in the ensemble for the prognostic forest. Default:250. Must be a positive integer.alphaPrior probability of splitting for a tree of depth 0 in the prognostic forest. Tree split prior combinesalphaandbetaviaalpha*(1+node_depth)^-beta. Default:0.95.betaExponent that decreases split probabilities for nodes of depth > 0 in the prognostic forest. Tree split prior combinesalphaandbetaviaalpha*(1+node_depth)^-beta. Default:2.min_samples_leafMinimum allowable size of a leaf, in terms of training samples, in the prognostic forest. Default:5.max_depthMaximum depth of any tree in the ensemble in the prognostic forest. Default:10. Can be overridden with-1which does not enforce any depth limits on trees.variable_weightsNumeric weights reflecting the relative probability of splitting on each variable in the prognostic forest. Does not need to sum to 1 but cannot be negative. Defaults torep(1/ncol(X_train), ncol(X_train))if not set here.sample_sigma2_leafWhether or not to update the leaf scale variance parameter based onIG(sigma2_leaf_shape, sigma2_leaf_scale).sigma2_leaf_initStarting value of leaf node scale parameter. Calibrated internally as1/num_treesif not set here.sigma2_leaf_shapeShape parameter in theIG(sigma2_leaf_shape, sigma2_leaf_scale)leaf node parameter variance model. Default:3.sigma2_leaf_scaleScale parameter in theIG(sigma2_leaf_shape, sigma2_leaf_scale)leaf node parameter variance model. Calibrated internally as0.5/num_treesif not set here.keep_varsVector of variable names or column indices denoting variables that should be included in the forest. Default:NULL.drop_varsVector of variable names or column indices denoting variables that should be excluded from the forest. Default:NULL. If bothdrop_varsandkeep_varsare set,drop_varswill be ignored.num_features_subsampleHow many features to subsample when growing each tree for the GFR algorithm. Defaults to the number of features in the training dataset.
- treatment_effect_forest_params
(Optional) A list of treatment effect forest model parameters, each of which has a default value processed internally, so this argument list is optional.
num_treesNumber of trees in the ensemble for the treatment effect forest. Default:50. Must be a positive integer.alphaPrior probability of splitting for a tree of depth 0 in the treatment effect forest. Tree split prior combinesalphaandbetaviaalpha*(1+node_depth)^-beta. Default:0.25.betaExponent that decreases split probabilities for nodes of depth > 0 in the treatment effect forest. Tree split prior combinesalphaandbetaviaalpha*(1+node_depth)^-beta. Default:3.min_samples_leafMinimum allowable size of a leaf, in terms of training samples, in the treatment effect forest. Default:5.max_depthMaximum depth of any tree in the ensemble in the treatment effect forest. Default:5. Can be overridden with-1which does not enforce any depth limits on trees.variable_weightsNumeric weights reflecting the relative probability of splitting on each variable in the treatment effect forest. Does not need to sum to 1 but cannot be negative. Defaults torep(1/ncol(X_train), ncol(X_train))if not set here.sample_sigma2_leafWhether or not to update the leaf scale variance parameter based onIG(sigma2_leaf_shape, sigma2_leaf_scale). Cannot (currently) be set to true ifncol(Z_train)>1. Default:FALSE.sigma2_leaf_initStarting value of leaf node scale parameter. Calibrated internally as1/num_treesif not set here.sigma2_leaf_shapeShape parameter in theIG(sigma2_leaf_shape, sigma2_leaf_scale)leaf node parameter variance model. Default:3.sigma2_leaf_scaleScale parameter in theIG(sigma2_leaf_shape, sigma2_leaf_scale)leaf node parameter variance model. Calibrated internally as0.5/num_treesif not set here.delta_maxMaximum plausible conditional distributional treatment effect (i.e. P(Y(1) = 1 | X) - P(Y(0) = 1 | X)) when the outcome is binary. Only used when the outcome is specified as a probit model ingeneral_params. Must be > 0 and < 1. Default:0.9. Ignored ifsigma2_leaf_initis set directly, as this parameter is used to calibratesigma2_leaf_init.keep_varsVector of variable names or column indices denoting variables that should be included in the forest. Default:NULL.drop_varsVector of variable names or column indices denoting variables that should be excluded from the forest. Default:NULL. If bothdrop_varsandkeep_varsare set,drop_varswill be ignored.num_features_subsampleHow many features to subsample when growing each tree for the GFR algorithm. Defaults to the number of features in the training dataset.
- variance_forest_params
(Optional) A list of variance forest model parameters, each of which has a default value processed internally, so this argument list is optional.
num_treesNumber of trees in the ensemble for the conditional variance model. Default:0. Variance is only modeled using a tree / forest ifnum_trees > 0.alphaPrior probability of splitting for a tree of depth 0 in the variance model. Tree split prior combinesalphaandbetaviaalpha*(1+node_depth)^-beta. Default:0.95.betaExponent that decreases split probabilities for nodes of depth > 0 in the variance model. Tree split prior combinesalphaandbetaviaalpha*(1+node_depth)^-beta. Default:2.min_samples_leafMinimum allowable size of a leaf, in terms of training samples, in the variance model. Default:5.max_depthMaximum depth of any tree in the ensemble in the variance model. Default:10. Can be overridden with-1which does not enforce any depth limits on trees.leaf_prior_calibration_paramHyperparameter used to calibrate theIG(var_forest_prior_shape, var_forest_prior_scale)conditional error variance model. Ifvar_forest_prior_shapeandvar_forest_prior_scaleare not set below, this calibration parameter is used to set these values tonum_trees / leaf_prior_calibration_param^2 + 0.5andnum_trees / leaf_prior_calibration_param^2, respectively. Default:1.5.variance_forest_initStarting value of root forest prediction in conditional (heteroskedastic) error variance model. Calibrated internally aslog(0.6*var((y_train-mean(y_train))/sd(y_train)))/num_treesif not set.var_forest_prior_shapeShape parameter in theIG(var_forest_prior_shape, var_forest_prior_scale)conditional error variance model (which is only sampled ifnum_trees > 0). Calibrated internally asnum_trees / 1.5^2 + 0.5if not set.var_forest_prior_scaleScale parameter in theIG(var_forest_prior_shape, var_forest_prior_scale)conditional error variance model (which is only sampled ifnum_trees > 0). Calibrated internally asnum_trees / 1.5^2if not set.keep_varsVector of variable names or column indices denoting variables that should be included in the forest. Default:NULL.drop_varsVector of variable names or column indices denoting variables that should be excluded from the forest. Default:NULL. If bothdrop_varsandkeep_varsare set,drop_varswill be ignored.num_features_subsampleHow many features to subsample when growing each tree for the GFR algorithm. Defaults to the number of features in the training dataset.
- random_effects_params
(Optional) A list of random effects model parameters, each of which has a default value processed internally, so this argument list is optional.
model_specSpecification of the random effects model. Options are "custom", "intercept_only", and "intercept_plus_treatment". If "custom" is specified, then a user-provided basis must be passed throughrfx_basis_train. If "intercept_only" is specified, a random effects basis of all ones will be dispatched internally at sampling and prediction time. If "intercept_plus_treatment" is specified, a random effects basis that combines an "intercept" basis of all ones with the treatment variable (Z_train) will be dispatched internally at sampling and prediction time. Default: "custom". If either "intercept_only" or "intercept_plus_treatment" is specified,rfx_basis_trainandrfx_basis_test(if provided) will be ignored.working_parameter_prior_meanPrior mean for the random effects "working parameter". Default:NULL. Must be a vector whose dimension matches the number of random effects bases, or a scalar value that will be expanded to a vector.group_parameters_prior_meanPrior mean for the random effects "group parameters." Default:NULL. Must be a vector whose dimension matches the number of random effects bases, or a scalar value that will be expanded to a vector.working_parameter_prior_covPrior covariance matrix for the random effects "working parameter." Default:NULL. Must be a square matrix whose dimension matches the number of random effects bases, or a scalar value that will be expanded to a diagonal matrix.group_parameter_prior_covPrior covariance matrix for the random effects "group parameters." Default:NULL. Must be a square matrix whose dimension matches the number of random effects bases, or a scalar value that will be expanded to a diagonal matrix.variance_prior_shapeShape parameter for the inverse gamma prior on the variance of the random effects "group parameter." Default:1.variance_prior_scaleScale parameter for the inverse gamma prior on the variance of the random effects "group parameter." Default:1.
Value
List of sampling outputs and a wrapper around the sampled forests (which can be used for in-memory prediction on new data, or serialized to JSON on disk).
Examples
n <- 500
p <- 5
X <- matrix(runif(n*p), ncol = p)
mu_x <- (
((0 <= X[,1]) & (0.25 > X[,1])) * (-7.5) +
((0.25 <= X[,1]) & (0.5 > X[,1])) * (-2.5) +
((0.5 <= X[,1]) & (0.75 > X[,1])) * (2.5) +
((0.75 <= X[,1]) & (1 > X[,1])) * (7.5)
)
pi_x <- (
((0 <= X[,1]) & (0.25 > X[,1])) * (0.2) +
((0.25 <= X[,1]) & (0.5 > X[,1])) * (0.4) +
((0.5 <= X[,1]) & (0.75 > X[,1])) * (0.6) +
((0.75 <= X[,1]) & (1 > X[,1])) * (0.8)
)
tau_x <- (
((0 <= X[,2]) & (0.25 > X[,2])) * (0.5) +
((0.25 <= X[,2]) & (0.5 > X[,2])) * (1.0) +
((0.5 <= X[,2]) & (0.75 > X[,2])) * (1.5) +
((0.75 <= X[,2]) & (1 > X[,2])) * (2.0)
)
Z <- rbinom(n, 1, pi_x)
noise_sd <- 1
y <- mu_x + tau_x*Z + rnorm(n, 0, noise_sd)
test_set_pct <- 0.2
n_test <- round(test_set_pct*n)
n_train <- n - n_test
test_inds <- sort(sample(1:n, n_test, replace = FALSE))
train_inds <- (1:n)[!((1:n) %in% test_inds)]
X_test <- X[test_inds,]
X_train <- X[train_inds,]
pi_test <- pi_x[test_inds]
pi_train <- pi_x[train_inds]
Z_test <- Z[test_inds]
Z_train <- Z[train_inds]
y_test <- y[test_inds]
y_train <- y[train_inds]
mu_test <- mu_x[test_inds]
mu_train <- mu_x[train_inds]
tau_test <- tau_x[test_inds]
tau_train <- tau_x[train_inds]
bcf_model <- bcf(X_train = X_train, Z_train = Z_train, y_train = y_train,
propensity_train = pi_train, X_test = X_test, Z_test = Z_test,
propensity_test = pi_test, num_gfr = 10,
num_burnin = 0, num_mcmc = 10)