forest.ForestContainer

forest.ForestContainer(
    num_trees,
    output_dimension=1,
    leaf_constant=True,
    is_exponentiated=False,
)

Container that stores sampled (and retained) tree ensembles from BART, BCF or a custom sampler.

Parameters

Name Type Description Default
num_trees int Number of trees that each forest should contain required
output_dimension int Dimension of the leaf node parameters in each tree 1
leaf_constant bool Whether the leaf node model is “constant” (i.e. prediction is simply a sum of leaf node parameters for every observation in a dataset) or not (i.e. each leaf node parameter is multiplied by a “basis vector” before being returned as a prediction). True
is_exponentiated bool Whether or not the leaf node parameters are stored in log scale (in which case, they must be exponentiated before being returned as predictions). False

Methods

Name Description
predict Predict from each forest in the container, using the provided Dataset object.
predict_raw Predict raw leaf values for a every forest in the container, using the provided Dataset object
predict_raw_single_forest Predict raw leaf values for a specific forest (indexed by forest_num), using the provided Dataset object
predict_raw_single_tree Predict raw leaf values for a specific tree of a specific forest (indexed by tree_num and forest_num
set_root_leaves Set constant (root) leaf node values for every tree in the forest indexed by forest_num.
collapse Collapse forests in this container by a pre-specified batch size.
combine_forests Collapse specified forests into a single forest
add_to_forest Add a constant value to every leaf of every tree of a given forest
multiply_forest Multiply every leaf of every tree of a given forest by constant value
save_to_json_file Save the forests in the container to a JSON file.
load_from_json_file Load a forest container from output stored in a JSON file.
dump_json_string Dump a forest container into an in-memory JSON string (which can be directly serialized or
load_from_json_string Reload a forest container from an in-memory JSON string.
load_from_json_object Reload a forest container from an in-memory JSONSerializer object.
add_sample Add a new all-root ensemble to the container, with all of the leaves set to the value / vector provided
add_numeric_split Add a numeric (i.e. X[,i] <= c) split to a given tree in the ensemble
get_tree_leaves Retrieve a vector of indices of leaf nodes for a given tree in a given forest
get_tree_split_counts Retrieve a vector of split counts for every training set feature in a given tree in a given forest
get_forest_split_counts Retrieve a vector of split counts for every training set feature in a given forest
get_overall_split_counts Retrieve a vector of split counts for every training set feature, aggregated across ensembles and trees.
get_granular_split_counts Retrieve a vector of split counts for every training set variable in a given forest, reported separately for each ensemble and tree
num_forest_leaves Return the total number of leaves for a given forest in the ForestContainer
sum_leaves_squared Return the total sum of squared leaf values for a given forest in the ForestContainer
is_leaf_node Whether or not a given node of a given tree in a given forest in the ForestContainer is a leaf
is_numeric_split_node Whether or not a given node of a given tree in a given forest in the ForestContainer is a numeric split node
is_categorical_split_node Whether or not a given node of a given tree in a given forest in the ForestContainer is a categorical split node
parent_node Parent node of given node of a given tree in a given forest in the ForestContainer
left_child_node Left child node of given node of a given tree in a given forest in the ForestContainer
right_child_node Right child node of given node of a given tree in a given forest in the ForestContainer
node_depth Depth of given node of a given tree in a given forest in the ForestContainer.
node_split_index Split index of given node of a given tree in a given forest in the ForestContainer.
node_split_threshold Threshold that defines a numeric split for a given node of a given tree in a given forest in the ForestContainer.
node_split_categories Array of category indices that define a categorical split for a given node of a given tree in a given forest in the ForestContainer.
node_leaf_values Node parameter value(s) for a given node of a given tree in a given forest in the ForestContainer.
num_samples Number of forest samples in the ForestContainer.
num_nodes Number of nodes in a given tree in a given forest in the ForestContainer.
num_leaves Number of leaves in a given tree in a given forest in the ForestContainer.
num_leaf_parents Number of leaf parents (split nodes with two leaves as children) in a given tree in a given forest in the ForestContainer.
num_split_nodes Number of split_nodes in a given tree in a given forest in the ForestContainer.
nodes Array of node indices in a given tree in a given forest in the ForestContainer.
leaves Array of leaf indices in a given tree in a given forest in the ForestContainer.
delete_sample Modify the ForestContainer by removing the forest sample indexed by forest_num.

predict

forest.ForestContainer.predict(dataset)

Predict from each forest in the container, using the provided Dataset object.

Parameters

Name Type Description Default
dataset Dataset Python object wrapping the “dataset” class used by C++ sampling and prediction data structures. required

Returns

Name Type Description
np.array Numpy array with (n, m) dimensions, where n is the number of observations in dataset and m is the number of samples in the forest container.

predict_raw

forest.ForestContainer.predict_raw(dataset)

Predict raw leaf values for a every forest in the container, using the provided Dataset object

Parameters

Name Type Description Default
dataset Dataset Python object wrapping the “dataset” class used by C++ sampling and prediction data structures. required

Returns

Name Type Description
np.array Numpy array with (n, k, m) dimensions, where n is the number of observations in dataset, k is the dimension of the leaf parameter, and m is the number of samples in the forest container. If k = 1, then the returned array is simply (n, m) dimensions.

predict_raw_single_forest

forest.ForestContainer.predict_raw_single_forest(dataset, forest_num)

Predict raw leaf values for a specific forest (indexed by forest_num), using the provided Dataset object

Parameters

Name Type Description Default
dataset Dataset Python object wrapping the “dataset” class used by C++ sampling and prediction data structures. required
forest_num int Index of the forest from which to predict. Forest indices are 0-based. required

Returns

Name Type Description
np.array Numpy array with (n, k) dimensions, where n is the number of observations in dataset and k is the dimension of the leaf parameter.

predict_raw_single_tree

forest.ForestContainer.predict_raw_single_tree(dataset, forest_num, tree_num)

Predict raw leaf values for a specific tree of a specific forest (indexed by tree_num and forest_num respectively), using the provided Dataset object.

Parameters

Name Type Description Default
dataset Dataset Python object wrapping the “dataset” class used by C++ sampling and prediction data structures. required
forest_num int Index of the forest from which to predict. Forest indices are 0-based. required
tree_num int Index of the tree which to predict (within forest indexed by forest_num). Tree indices are 0-based. required

Returns

Name Type Description
np.array Numpy array with (n, k) dimensions, where n is the number of observations in dataset and k is the dimension of the leaf parameter.

set_root_leaves

forest.ForestContainer.set_root_leaves(forest_num, leaf_value)

Set constant (root) leaf node values for every tree in the forest indexed by forest_num. Assumes the forest consists of all root (single-node) trees.

Parameters

Name Type Description Default
forest_num int Index of the forest for which we will set root node parameters. required
leaf_value float or np.array Constant values to which root nodes are to be set. If the trees in forest forest_num are univariate, then leaf_value must be a float, while if the trees in forest forest_num are multivariate, then leaf_value must be a np.array. required

collapse

forest.ForestContainer.collapse(batch_size)

Collapse forests in this container by a pre-specified batch size. For example, if we have a container of twenty 10-tree forests, and we specify a batch_size of 5, then this method will yield four 50-tree forests. “Excess” forests remaining after the size of a forest container is divided by batch_size will be pruned from the beginning of the container (i.e. earlier sampled forests will be deleted). This method has no effect if batch_size is larger than the number of forests in a container.

Parameters

Name Type Description Default
batch_size int Number of forests to be collapsed into a single forest required

combine_forests

forest.ForestContainer.combine_forests(forest_inds)

Collapse specified forests into a single forest

Parameters

Name Type Description Default
forest_inds np.array Indices of forests to be combined (0-indexed). required

add_to_forest

forest.ForestContainer.add_to_forest(forest_index, constant_value)

Add a constant value to every leaf of every tree of a given forest

Parameters

Name Type Description Default
forest_index int Index of forest whose leaves will be modified (0-indexed) required
constant_value float Value to add to every leaf of every tree of the forest at forest_index required

multiply_forest

forest.ForestContainer.multiply_forest(forest_index, constant_multiple)

Multiply every leaf of every tree of a given forest by constant value

Parameters

Name Type Description Default
forest_index int Index of forest whose leaves will be modified (0-indexed) required
constant_multiple float Value to multiply through by every leaf of every tree of the forest at forest_index required

save_to_json_file

forest.ForestContainer.save_to_json_file(json_filename)

Save the forests in the container to a JSON file.

Parameters

Name Type Description Default
json_filename str Name of JSON file to which forest container state will be saved. May contain absolute or relative paths. required

load_from_json_file

forest.ForestContainer.load_from_json_file(json_filename)

Load a forest container from output stored in a JSON file.

Parameters

Name Type Description Default
json_filename str Name of JSON file from which forest container state will be restored. May contain absolute or relative paths. required

dump_json_string

forest.ForestContainer.dump_json_string()

Dump a forest container into an in-memory JSON string (which can be directly serialized or combined with other JSON strings before serialization).

Returns

Name Type Description
str In-memory string containing state of a forest container.

load_from_json_string

forest.ForestContainer.load_from_json_string(json_string)

Reload a forest container from an in-memory JSON string.

Parameters

Name Type Description Default
json_string str In-memory string containing state of a forest container. required

load_from_json_object

forest.ForestContainer.load_from_json_object(json_object)

Reload a forest container from an in-memory JSONSerializer object.

Parameters

Name Type Description Default
json_object JSONSerializer In-memory JSONSerializer object. required

add_sample

forest.ForestContainer.add_sample(leaf_value)

Add a new all-root ensemble to the container, with all of the leaves set to the value / vector provided

Parameters

Name Type Description Default
leaf_value float or np.array Value (or vector of values) to initialize root nodes of every tree in a forest required

add_numeric_split

forest.ForestContainer.add_numeric_split(
    forest_num,
    tree_num,
    leaf_num,
    feature_num,
    split_threshold,
    left_leaf_value,
    right_leaf_value,
)

Add a numeric (i.e. X[,i] <= c) split to a given tree in the ensemble

Parameters

Name Type Description Default
forest_num int Index of the forest which contains the tree to be split required
tree_num int Index of the tree to be split required
leaf_num int Leaf to be split required
feature_num int Feature that defines the new split required
split_threshold float Value that defines the cutoff of the new split required
left_leaf_value float or np.array Value (or array of values) to assign to the newly created left node required
right_leaf_value float or np.array Value (or array of values) to assign to the newly created right node required

get_tree_leaves

forest.ForestContainer.get_tree_leaves(forest_num, tree_num)

Retrieve a vector of indices of leaf nodes for a given tree in a given forest

Parameters

Name Type Description Default
forest_num int Index of the forest which contains tree tree_num required
tree_num float or np.array Index of the tree for which leaf indices will be retrieved required

Returns

Name Type Description
np.array One-dimensional numpy array, containing the indices of leaf nodes in a given tree.

get_tree_split_counts

forest.ForestContainer.get_tree_split_counts(forest_num, tree_num, num_features)

Retrieve a vector of split counts for every training set feature in a given tree in a given forest

Parameters

Name Type Description Default
forest_num int Index of the forest which contains tree tree_num required
tree_num int Index of the tree for which split counts will be retrieved required
num_features int Total number of features in the training set required

Returns

Name Type Description
np.array One-dimensional numpy array with as many elements as in the forest model’s training set, containing the split count for each feature for a given forest and tree.

get_forest_split_counts

forest.ForestContainer.get_forest_split_counts(forest_num, num_features)

Retrieve a vector of split counts for every training set feature in a given forest

Parameters

Name Type Description Default
forest_num int Index of the forest which contains tree tree_num required
num_features int Total number of features in the training set required

Returns

Name Type Description
np.array One-dimensional numpy array with as many elements as in the forest model’s training set, containing the split count for each feature for a given forest (summed across every tree in the forest).

get_overall_split_counts

forest.ForestContainer.get_overall_split_counts(num_features)

Retrieve a vector of split counts for every training set feature, aggregated across ensembles and trees.

Parameters

Name Type Description Default
num_features int Total number of features in the training set required

Returns

Name Type Description
np.array One-dimensional numpy array with as many elements as in the forest model’s training set, containing the split count for each feature summed across every forest of every tree in the container.

get_granular_split_counts

forest.ForestContainer.get_granular_split_counts(num_features)

Retrieve a vector of split counts for every training set variable in a given forest, reported separately for each ensemble and tree

Parameters

Name Type Description Default
num_features int Total number of features in the training set required

Returns

Name Type Description
np.array Three-dimensional numpy array, containing the number of splits a variable receives in each tree of each forest in a ForestContainer. Array will have dimensions (m,b,p) where m is the number of forests in the container, b is the number of trees in each forest, and p is the number of features in the forest model’s training dataset.

num_forest_leaves

forest.ForestContainer.num_forest_leaves(forest_num)

Return the total number of leaves for a given forest in the ForestContainer

Parameters

Name Type Description Default
forest_num int Index of the forest to be queried required

Returns

Name Type Description
int Number of leaves in a given forest in a ForestContainer

sum_leaves_squared

forest.ForestContainer.sum_leaves_squared(forest_num)

Return the total sum of squared leaf values for a given forest in the ForestContainer

Parameters

Name Type Description Default
forest_num int Index of the forest to be queried required

Returns

Name Type Description
float Sum of squared leaf values in a given forest in a ForestContainer

is_leaf_node

forest.ForestContainer.is_leaf_node(forest_num, tree_num, node_id)

Whether or not a given node of a given tree in a given forest in the ForestContainer is a leaf

Parameters

Name Type Description Default
forest_num int Index of the forest to be queried required
tree_num int Index of the tree to be queried required
node_id int Index of the node to be queried required

Returns

Name Type Description
bool True if node node_id in tree tree_num of forest forest_num is a leaf, False otherwise

is_numeric_split_node

forest.ForestContainer.is_numeric_split_node(forest_num, tree_num, node_id)

Whether or not a given node of a given tree in a given forest in the ForestContainer is a numeric split node

Parameters

Name Type Description Default
forest_num int Index of the forest to be queried required
tree_num int Index of the tree to be queried required
node_id int Index of the node to be queried required

Returns

Name Type Description
bool True if node node_id in tree tree_num of forest forest_num is a numeric split node, False otherwise

is_categorical_split_node

forest.ForestContainer.is_categorical_split_node(forest_num, tree_num, node_id)

Whether or not a given node of a given tree in a given forest in the ForestContainer is a categorical split node

Parameters

Name Type Description Default
forest_num int Index of the forest to be queried required
tree_num int Index of the tree to be queried required
node_id int Index of the node to be queried required

Returns

Name Type Description
bool True if node node_id in tree tree_num of forest forest_num is a categorical split node, False otherwise

parent_node

forest.ForestContainer.parent_node(forest_num, tree_num, node_id)

Parent node of given node of a given tree in a given forest in the ForestContainer

Parameters

Name Type Description Default
forest_num int Index of the forest to be queried required
tree_num int Index of the tree to be queried required
node_id int Index of the node to be queried required

Returns

Name Type Description
int Index of the parent of node node_id in tree tree_num of forest forest_num. If node_id is a root node, returns -1.

left_child_node

forest.ForestContainer.left_child_node(forest_num, tree_num, node_id)

Left child node of given node of a given tree in a given forest in the ForestContainer

Parameters

Name Type Description Default
forest_num int Index of the forest to be queried required
tree_num int Index of the tree to be queried required
node_id int Index of the node to be queried required

Returns

Name Type Description
int Index of the left child of node node_id in tree tree_num of forest forest_num. If node_id is a leaf, returns -1.

right_child_node

forest.ForestContainer.right_child_node(forest_num, tree_num, node_id)

Right child node of given node of a given tree in a given forest in the ForestContainer

Parameters

Name Type Description Default
forest_num int Index of the forest to be queried required
tree_num int Index of the tree to be queried required
node_id int Index of the node to be queried required

Returns

Name Type Description
int Index of the right child of node node_id in tree tree_num of forest forest_num. If node_id is a leaf, returns -1.

node_depth

forest.ForestContainer.node_depth(forest_num, tree_num, node_id)

Depth of given node of a given tree in a given forest in the ForestContainer.

Parameters

Name Type Description Default
forest_num int Index of the forest to be queried required
tree_num int Index of the tree to be queried required
node_id int Index of the node to be queried required

Returns

Name Type Description
int Depth of node node_id in tree tree_num of forest forest_num. The root node is defined as “depth zero.”

node_split_index

forest.ForestContainer.node_split_index(forest_num, tree_num, node_id)

Split index of given node of a given tree in a given forest in the ForestContainer. Returns -1 if the node is a leaf.

Parameters

Name Type Description Default
forest_num int Index of the forest to be queried required
tree_num int Index of the tree to be queried required
node_id int Index of the node to be queried required

Returns

Name Type Description
int Split index of node_id in tree tree_num of forest forest_num.

node_split_threshold

forest.ForestContainer.node_split_threshold(forest_num, tree_num, node_id)

Threshold that defines a numeric split for a given node of a given tree in a given forest in the ForestContainer. Returns np.Inf if the node is a leaf or a categorical split node.

Parameters

Name Type Description Default
forest_num int Index of the forest to be queried required
tree_num int Index of the tree to be queried required
node_id int Index of the node to be queried required

Returns

Name Type Description
float Threshold that defines a numeric split for node node_id in tree tree_num of forest forest_num.

node_split_categories

forest.ForestContainer.node_split_categories(forest_num, tree_num, node_id)

Array of category indices that define a categorical split for a given node of a given tree in a given forest in the ForestContainer. Returns np.array([np.Inf]) if the node is a leaf or a numeric split node.

Parameters

Name Type Description Default
forest_num int Index of the forest to be queried required
tree_num int Index of the tree to be queried required
node_id int Index of the node to be queried required

Returns

Name Type Description
np.array Array of category indices that define a categorical split for node node_id in tree tree_num of forest forest_num.

node_leaf_values

forest.ForestContainer.node_leaf_values(forest_num, tree_num, node_id)

Node parameter value(s) for a given node of a given tree in a given forest in the ForestContainer. Values are stale if the node is a split node.

Parameters

Name Type Description Default
forest_num int Index of the forest to be queried required
tree_num int Index of the tree to be queried required
node_id int Index of the node to be queried required

Returns

Name Type Description
np.array Array of parameter values for node node_id in tree tree_num of forest forest_num.

num_samples

forest.ForestContainer.num_samples()

Number of forest samples in the ForestContainer.

Returns

Name Type Description
int Total number of forest samples.

num_nodes

forest.ForestContainer.num_nodes(forest_num, tree_num)

Number of nodes in a given tree in a given forest in the ForestContainer.

Parameters

Name Type Description Default
forest_num int Index of the forest to be queried required
tree_num int Index of the tree to be queried required

Returns

Name Type Description
int Total number of nodes in tree tree_num of forest forest_num.

num_leaves

forest.ForestContainer.num_leaves(forest_num, tree_num)

Number of leaves in a given tree in a given forest in the ForestContainer.

Parameters

Name Type Description Default
forest_num int Index of the forest to be queried required
tree_num int Index of the tree to be queried required

Returns

Name Type Description
int Total number of leaves in tree tree_num of forest forest_num.

num_leaf_parents

forest.ForestContainer.num_leaf_parents(forest_num, tree_num)

Number of leaf parents (split nodes with two leaves as children) in a given tree in a given forest in the ForestContainer.

Parameters

Name Type Description Default
forest_num int Index of the forest to be queried required
tree_num int Index of the tree to be queried required

Returns

Name Type Description
int Total number of leaf parents in tree tree_num of forest forest_num.

num_split_nodes

forest.ForestContainer.num_split_nodes(forest_num, tree_num)

Number of split_nodes in a given tree in a given forest in the ForestContainer.

Parameters

Name Type Description Default
forest_num int Index of the forest to be queried required
tree_num int Index of the tree to be queried required

Returns

Name Type Description
int Total number of split nodes in tree tree_num of forest forest_num.

nodes

forest.ForestContainer.nodes(forest_num, tree_num)

Array of node indices in a given tree in a given forest in the ForestContainer.

Parameters

Name Type Description Default
forest_num int Index of the forest to be queried required
tree_num int Index of the tree to be queried required

Returns

Name Type Description
np.array Array of indices of nodes in tree tree_num of forest forest_num.

leaves

forest.ForestContainer.leaves(forest_num, tree_num)

Array of leaf indices in a given tree in a given forest in the ForestContainer.

Parameters

Name Type Description Default
forest_num int Index of the forest to be queried required
tree_num int Index of the tree to be queried required

Returns

Name Type Description
np.array Array of indices of leaf nodes in tree tree_num of forest forest_num.

delete_sample

forest.ForestContainer.delete_sample(forest_num)

Modify the ForestContainer by removing the forest sample indexed by forest_num.

Parameters

Name Type Description Default
forest_num int Index of the forest to be removed from the ForestContainer required