Skip to content

Forest API#

stochtree.forest.Forest #

In-memory python wrapper around a C++ tree ensemble object

Parameters:

Name Type Description Default
num_trees int

Number of trees that each forest should contain

required
output_dimension int

Dimension of the leaf node parameters in each tree

1
leaf_constant bool

Whether the leaf node model is "constant" (i.e. prediction is simply a sum of leaf node parameters for every observation in a dataset) or not (i.e. each leaf node parameter is multiplied by a "basis vector" before being returned as a prediction).

True
is_exponentiated bool

Whether or not the leaf node parameters are stored in log scale (in which case, they must be exponentiated before being returned as predictions).

False

reset_root() #

Reset forest to a forest with all single node (i.e. "root") trees

reset(forest_container, forest_num) #

Reset forest to the forest indexed by forest_num in forest_container

Parameters:

Name Type Description Default
forest_container `ForestContainer

Stochtree object storing tree ensembles

required
forest_num int

Index of the ensemble used to reset the Forest

required

predict(dataset) #

Predict from each forest in the container, using the provided Dataset object.

Parameters:

Name Type Description Default
dataset Dataset

Python object wrapping the "dataset" class used by C++ sampling and prediction data structures.

required

Returns:

Type Description
array

One-dimensional numpy array with length equal to the number of observations in dataset.

predict_raw(dataset) #

Predict raw leaf values for a every forest in the container, using the provided Dataset object

Parameters:

Name Type Description Default
dataset Dataset

Python object wrapping the "dataset" class used by C++ sampling and prediction data structures.

required

Returns:

Type Description
array

Numpy array with (n, k) dimensions, where n is the number of observations in dataset and k is the dimension of the leaf parameter. If k = 1, then the returned array is simply one-dimensional with n observations.

set_root_leaves(leaf_value) #

Set constant (root) leaf node values for every tree in the forest. Assumes the forest consists of all root (single-node) trees.

Parameters:

Name Type Description Default
leaf_value float or array

Constant values to which root nodes are to be set. If the trees in forest forest_num are univariate, then leaf_value must be a float, while if the trees in forest forest_num are multivariate, then leaf_value must be a np.array.

required

add_numeric_split(tree_num, leaf_num, feature_num, split_threshold, left_leaf_value, right_leaf_value) #

Add a numeric (i.e. X[,i] <= c) split to a given tree in the forest

Parameters:

Name Type Description Default
tree_num int

Index of the tree to be split

required
leaf_num int

Leaf to be split

required
feature_num int

Feature that defines the new split

required
split_threshold float

Value that defines the cutoff of the new split

required
left_leaf_value float or array

Value (or array of values) to assign to the newly created left node

required
right_leaf_value float or array

Value (or array of values) to assign to the newly created right node

required

get_tree_leaves(tree_num) #

Retrieve a vector of indices of leaf nodes for a given tree in the forest

Parameters:

Name Type Description Default
tree_num float or array

Index of the tree for which leaf indices will be retrieved

required

Returns:

Type Description
array

One-dimensional numpy array, containing the indices of leaf nodes in a given tree.

get_tree_split_counts(tree_num, num_features) #

Retrieve a vector of split counts for every training set variable in a given tree in the forest

Parameters:

Name Type Description Default
tree_num int

Index of the tree for which split counts will be retrieved

required
num_features int

Total number of features in the training set

required

Returns:

Type Description
array

One-dimensional numpy array with as many elements as in the forest model's training set, containing the split count for each feature for a given tree of the forest.

get_overall_split_counts(num_features) #

Retrieve a vector of split counts for every training set variable in the forest

Parameters:

Name Type Description Default
num_features int

Total number of features in the training set

required

Returns:

Type Description
array

One-dimensional numpy array with as many elements as in the forest model's training set, containing the overall split count in the forest for each feature.

get_granular_split_counts(num_features) #

Retrieve a vector of split counts for every training set variable in the forest, reported separately for each tree

Parameters:

Name Type Description Default
num_features int

Total number of features in the training set

required

Returns:

Type Description
array

One-dimensional numpy array with as many elements as in the forest model's training set, containing the split count for each feature for a every tree in the forest.

num_forest_leaves() #

Return the total number of leaves in a forest

Returns:

Type Description
int

Number of leaves in a forest

sum_leaves_squared() #

Return the total sum of squared leaf values in a forest

Returns:

Type Description
float

Sum of squared leaf values in a forest

is_leaf_node(tree_num, node_id) #

Whether or not a given node of a given tree of a forest is a leaf

tree_num : int Index of the tree to be queried node_id : int Index of the node to be queried

Returns:

Type Description
bool

True if node node_id in tree tree_num is a leaf, False otherwise

is_numeric_split_node(tree_num, node_id) #

Whether or not a given node of a given tree of a forest is a numeric split node

Parameters:

Name Type Description Default
tree_num int

Index of the tree to be queried

required
node_id int

Index of the node to be queried

required

Returns:

Type Description
bool

True if node node_id in tree tree_num is a numeric split node, False otherwise

is_categorical_split_node(tree_num, node_id) #

Whether or not a given node of a given tree of a forest is a categorical split node

Parameters:

Name Type Description Default
tree_num int

Index of the tree to be queried

required
node_id int

Index of the node to be queried

required

Returns:

Type Description
bool

True if node node_id in tree tree_num is a categorical split node, False otherwise

parent_node(tree_num, node_id) #

Parent node of given node of a given tree of a forest

Parameters:

Name Type Description Default
tree_num int

Index of the tree to be queried

required
node_id int

Index of the node to be queried

required

Returns:

Type Description
int

Index of the parent of node node_id in tree tree_num. If node_id is a root node, returns -1.

left_child_node(tree_num, node_id) #

Left child node of given node of a given tree of a forest

Parameters:

Name Type Description Default
tree_num int

Index of the tree to be queried

required
node_id int

Index of the node to be queried

required

Returns:

Type Description
int

Index of the left child of node node_id in tree tree_num. If node_id is a leaf, returns -1.

right_child_node(tree_num, node_id) #

Right child node of given node of a given tree of a forest

Parameters:

Name Type Description Default
tree_num int

Index of the tree to be queried

required
node_id int

Index of the node to be queried

required

Returns:

Type Description
int

Index of the right child of node node_id in tree tree_num. If node_id is a leaf, returns -1.

node_depth(tree_num, node_id) #

Depth of given node of a given tree of a forest Returns -1 if the node is a leaf.

Parameters:

Name Type Description Default
tree_num int

Index of the tree to be queried

required
node_id int

Index of the node to be queried

required

Returns:

Type Description
int

Depth of node node_id in tree tree_num. The root node is defined as "depth zero."

node_split_index(tree_num, node_id) #

Split index of given node of a given tree of a forest. Returns -1 if the node is a leaf.

Parameters:

Name Type Description Default
tree_num int

Index of the tree to be queried

required
node_id int

Index of the node to be queried

required

Returns:

Type Description
int

Split index of node_id in tree tree_num.

node_split_threshold(tree_num, node_id) #

Threshold that defines a numeric split for a given node of a given tree of a forest. Returns np.Inf if the node is a leaf or a categorical split node.

Parameters:

Name Type Description Default
tree_num int

Index of the tree to be queried

required
node_id int

Index of the node to be queried

required

Returns:

Type Description
float

Threshold that defines a numeric split for node node_id in tree tree_num.

node_split_categories(tree_num, node_id) #

Array of category indices that define a categorical split for a given node of a given tree of a forest. Returns np.array([np.Inf]) if the node is a leaf or a numeric split node.

Parameters:

Name Type Description Default
tree_num int

Index of the tree to be queried

required
node_id int

Index of the node to be queried

required

Returns:

Type Description
array

Array of category indices that define a categorical split for node node_id in tree tree_num.

node_leaf_values(tree_num, node_id) #

Leaf node value(s) for a given node of a given tree of a forest. Values are stale if the node is a split node.

Parameters:

Name Type Description Default
tree_num int

Index of the tree to be queried

required
node_id int

Index of the node to be queried

required

Returns:

Type Description
array

Array of parameter values for node node_id in tree tree_num.

num_nodes(tree_num) #

Number of nodes in a given tree of a forest

Parameters:

Name Type Description Default
tree_num int

Index of the tree to be queried

required

Returns:

Type Description
int

Total number of nodes in tree tree_num.

num_leaves(tree_num) #

Number of leaves in a given tree of a forest

Parameters:

Name Type Description Default
tree_num int

Index of the tree to be queried

required

Returns:

Type Description
int

Total number of leaves in tree tree_num.

num_leaf_parents(tree_num) #

Number of leaf parents in a given tree of a forest

Parameters:

Name Type Description Default
tree_num int

Index of the tree to be queried

required

Returns:

Type Description
int

Total number of leaf parents in tree tree_num.

num_split_nodes(tree_num) #

Number of split_nodes in a given tree of a forest

Parameters:

Name Type Description Default
tree_num int

Index of the tree to be queried

required

Returns:

Type Description
int

Total number of split nodes in tree tree_num.

nodes(tree_num) #

Array of node indices in a given tree of a forest

Parameters:

Name Type Description Default
tree_num int

Index of the tree to be queried

required

Returns:

Type Description
array

Array of indices of nodes in tree tree_num.

leaves(tree_num) #

Array of leaf indices in a given tree of a forest

Parameters:

Name Type Description Default
tree_num int

Index of the tree to be queried

required

Returns:

Type Description
array

Array of indices of leaf nodes in tree tree_num.

stochtree.forest.ForestContainer #

Container that stores sampled (and retained) tree ensembles from BART, BCF or a custom sampler.

Parameters:

Name Type Description Default
num_trees int

Number of trees that each forest should contain

required
output_dimension int

Dimension of the leaf node parameters in each tree

1
leaf_constant bool

Whether the leaf node model is "constant" (i.e. prediction is simply a sum of leaf node parameters for every observation in a dataset) or not (i.e. each leaf node parameter is multiplied by a "basis vector" before being returned as a prediction).

True
is_exponentiated bool

Whether or not the leaf node parameters are stored in log scale (in which case, they must be exponentiated before being returned as predictions).

False

predict(dataset) #

Predict from each forest in the container, using the provided Dataset object.

Parameters:

Name Type Description Default
dataset Dataset

Python object wrapping the "dataset" class used by C++ sampling and prediction data structures.

required

Returns:

Type Description
array

Numpy array with (n, m) dimensions, where n is the number of observations in dataset and m is the number of samples in the forest container.

predict_raw(dataset) #

Predict raw leaf values for a every forest in the container, using the provided Dataset object

Parameters:

Name Type Description Default
dataset Dataset

Python object wrapping the "dataset" class used by C++ sampling and prediction data structures.

required

Returns:

Type Description
array

Numpy array with (n, k, m) dimensions, where n is the number of observations in dataset, k is the dimension of the leaf parameter, and m is the number of samples in the forest container. If k = 1, then the returned array is simply (n, m) dimensions.

predict_raw_single_forest(dataset, forest_num) #

Predict raw leaf values for a specific forest (indexed by forest_num), using the provided Dataset object

Parameters:

Name Type Description Default
dataset Dataset

Python object wrapping the "dataset" class used by C++ sampling and prediction data structures.

required
forest_num int

Index of the forest from which to predict. Forest indices are 0-based.

required

Returns:

Type Description
array

Numpy array with (n, k) dimensions, where n is the number of observations in dataset and k is the dimension of the leaf parameter.

predict_raw_single_tree(dataset, forest_num, tree_num) #

Predict raw leaf values for a specific tree of a specific forest (indexed by tree_num and forest_num respectively), using the provided Dataset object.

Parameters:

Name Type Description Default
dataset Dataset

Python object wrapping the "dataset" class used by C++ sampling and prediction data structures.

required
forest_num int

Index of the forest from which to predict. Forest indices are 0-based.

required
tree_num int

Index of the tree which to predict (within forest indexed by forest_num). Tree indices are 0-based.

required

Returns:

Type Description
array

Numpy array with (n, k) dimensions, where n is the number of observations in dataset and k is the dimension of the leaf parameter.

set_root_leaves(forest_num, leaf_value) #

Set constant (root) leaf node values for every tree in the forest indexed by forest_num. Assumes the forest consists of all root (single-node) trees.

Parameters:

Name Type Description Default
forest_num int

Index of the forest for which we will set root node parameters.

required
leaf_value float or array

Constant values to which root nodes are to be set. If the trees in forest forest_num are univariate, then leaf_value must be a float, while if the trees in forest forest_num are multivariate, then leaf_value must be a np.array.

required

save_to_json_file(json_filename) #

Save the forests in the container to a JSON file.

Parameters:

Name Type Description Default
json_filename str

Name of JSON file to which forest container state will be saved. May contain absolute or relative paths.

required

load_from_json_file(json_filename) #

Load a forest container from output stored in a JSON file.

Parameters:

Name Type Description Default
json_filename str

Name of JSON file from which forest container state will be restored. May contain absolute or relative paths.

required

dump_json_string() #

Dump a forest container into an in-memory JSON string (which can be directly serialized or combined with other JSON strings before serialization).

Returns:

Type Description
str

In-memory string containing state of a forest container.

load_from_json_string(json_string) #

Reload a forest container from an in-memory JSON string.

Parameters:

Name Type Description Default
json_string str

In-memory string containing state of a forest container.

required

add_sample(leaf_value) #

Add a new all-root ensemble to the container, with all of the leaves set to the value / vector provided

Parameters:

Name Type Description Default
leaf_value float or array

Value (or vector of values) to initialize root nodes of every tree in a forest

required

add_numeric_split(forest_num, tree_num, leaf_num, feature_num, split_threshold, left_leaf_value, right_leaf_value) #

Add a numeric (i.e. X[,i] <= c) split to a given tree in the ensemble

Parameters:

Name Type Description Default
forest_num int

Index of the forest which contains the tree to be split

required
tree_num int

Index of the tree to be split

required
leaf_num int

Leaf to be split

required
feature_num int

Feature that defines the new split

required
split_threshold float

Value that defines the cutoff of the new split

required
left_leaf_value float or array

Value (or array of values) to assign to the newly created left node

required
right_leaf_value float or array

Value (or array of values) to assign to the newly created right node

required

get_tree_leaves(forest_num, tree_num) #

Retrieve a vector of indices of leaf nodes for a given tree in a given forest

Parameters:

Name Type Description Default
forest_num int

Index of the forest which contains tree tree_num

required
tree_num float or array

Index of the tree for which leaf indices will be retrieved

required

Returns:

Type Description
array

One-dimensional numpy array, containing the indices of leaf nodes in a given tree.

get_tree_split_counts(forest_num, tree_num, num_features) #

Retrieve a vector of split counts for every training set feature in a given tree in a given forest

Parameters:

Name Type Description Default
forest_num int

Index of the forest which contains tree tree_num

required
tree_num int

Index of the tree for which split counts will be retrieved

required
num_features int

Total number of features in the training set

required

Returns:

Type Description
array

One-dimensional numpy array with as many elements as in the forest model's training set, containing the split count for each feature for a given forest and tree.

get_forest_split_counts(forest_num, num_features) #

Retrieve a vector of split counts for every training set feature in a given forest

Parameters:

Name Type Description Default
forest_num int

Index of the forest which contains tree tree_num

required
num_features int

Total number of features in the training set

required

Returns:

Type Description
array

One-dimensional numpy array with as many elements as in the forest model's training set, containing the split count for each feature for a given forest (summed across every tree in the forest).

get_overall_split_counts(num_features) #

Retrieve a vector of split counts for every training set feature, aggregated across ensembles and trees.

Parameters:

Name Type Description Default
num_features int

Total number of features in the training set

required

Returns:

Type Description
array

One-dimensional numpy array with as many elements as in the forest model's training set, containing the split count for each feature summed across every forest of every tree in the container.

get_granular_split_counts(num_features) #

Retrieve a vector of split counts for every training set variable in a given forest, reported separately for each ensemble and tree

Parameters:

Name Type Description Default
num_features int

Total number of features in the training set

required

Returns:

Type Description
array

Three-dimensional numpy array, containing the number of splits a variable receives in each tree of each forest in a ForestContainer. Array will have dimensions (m,b,p) where m is the number of forests in the container, b is the number of trees in each forest, and p is the number of features in the forest model's training dataset.

num_forest_leaves(forest_num) #

Return the total number of leaves for a given forest in the ForestContainer

Parameters:

Name Type Description Default
forest_num int

Index of the forest to be queried

required

Returns:

Type Description
int

Number of leaves in a given forest in a ForestContainer

sum_leaves_squared(forest_num) #

Return the total sum of squared leaf values for a given forest in the ForestContainer

Parameters:

Name Type Description Default
forest_num int

Index of the forest to be queried

required

Returns:

Type Description
float

Sum of squared leaf values in a given forest in a ForestContainer

is_leaf_node(forest_num, tree_num, node_id) #

Whether or not a given node of a given tree in a given forest in the ForestContainer is a leaf

Parameters:

Name Type Description Default
forest_num int

Index of the forest to be queried

required
tree_num int

Index of the tree to be queried

required
node_id int

Index of the node to be queried

required

Returns:

Type Description
bool

True if node node_id in tree tree_num of forest forest_num is a leaf, False otherwise

is_numeric_split_node(forest_num, tree_num, node_id) #

Whether or not a given node of a given tree in a given forest in the ForestContainer is a numeric split node

Parameters:

Name Type Description Default
forest_num int

Index of the forest to be queried

required
tree_num int

Index of the tree to be queried

required
node_id int

Index of the node to be queried

required

Returns:

Type Description
bool

True if node node_id in tree tree_num of forest forest_num is a numeric split node, False otherwise

is_categorical_split_node(forest_num, tree_num, node_id) #

Whether or not a given node of a given tree in a given forest in the ForestContainer is a categorical split node

Parameters:

Name Type Description Default
forest_num int

Index of the forest to be queried

required
tree_num int

Index of the tree to be queried

required
node_id int

Index of the node to be queried

required

Returns:

Type Description
bool

True if node node_id in tree tree_num of forest forest_num is a categorical split node, False otherwise

parent_node(forest_num, tree_num, node_id) #

Parent node of given node of a given tree in a given forest in the ForestContainer

Parameters:

Name Type Description Default
forest_num int

Index of the forest to be queried

required
tree_num int

Index of the tree to be queried

required
node_id int

Index of the node to be queried

required

Returns:

Type Description
int

Index of the parent of node node_id in tree tree_num of forest forest_num. If node_id is a root node, returns -1.

left_child_node(forest_num, tree_num, node_id) #

Left child node of given node of a given tree in a given forest in the ForestContainer

Parameters:

Name Type Description Default
forest_num int

Index of the forest to be queried

required
tree_num int

Index of the tree to be queried

required
node_id int

Index of the node to be queried

required

Returns:

Type Description
int

Index of the left child of node node_id in tree tree_num of forest forest_num. If node_id is a leaf, returns -1.

right_child_node(forest_num, tree_num, node_id) #

Right child node of given node of a given tree in a given forest in the ForestContainer

Parameters:

Name Type Description Default
forest_num int

Index of the forest to be queried

required
tree_num int

Index of the tree to be queried

required
node_id int

Index of the node to be queried

required

Returns:

Type Description
int

Index of the right child of node node_id in tree tree_num of forest forest_num. If node_id is a leaf, returns -1.

node_depth(forest_num, tree_num, node_id) #

Depth of given node of a given tree in a given forest in the ForestContainer.

Parameters:

Name Type Description Default
forest_num int

Index of the forest to be queried

required
tree_num int

Index of the tree to be queried

required
node_id int

Index of the node to be queried

required

Returns:

Type Description
int

Depth of node node_id in tree tree_num of forest forest_num. The root node is defined as "depth zero."

node_split_index(forest_num, tree_num, node_id) #

Split index of given node of a given tree in a given forest in the ForestContainer. Returns -1 if the node is a leaf.

Parameters:

Name Type Description Default
forest_num int

Index of the forest to be queried

required
tree_num int

Index of the tree to be queried

required
node_id int

Index of the node to be queried

required

Returns:

Type Description
int

Split index of node_id in tree tree_num of forest forest_num.

node_split_threshold(forest_num, tree_num, node_id) #

Threshold that defines a numeric split for a given node of a given tree in a given forest in the ForestContainer. Returns np.Inf if the node is a leaf or a categorical split node.

Parameters:

Name Type Description Default
forest_num int

Index of the forest to be queried

required
tree_num int

Index of the tree to be queried

required
node_id int

Index of the node to be queried

required

Returns:

Type Description
float

Threshold that defines a numeric split for node node_id in tree tree_num of forest forest_num.

node_split_categories(forest_num, tree_num, node_id) #

Array of category indices that define a categorical split for a given node of a given tree in a given forest in the ForestContainer. Returns np.array([np.Inf]) if the node is a leaf or a numeric split node.

Parameters:

Name Type Description Default
forest_num int

Index of the forest to be queried

required
tree_num int

Index of the tree to be queried

required
node_id int

Index of the node to be queried

required

Returns:

Type Description
array

Array of category indices that define a categorical split for node node_id in tree tree_num of forest forest_num.

node_leaf_values(forest_num, tree_num, node_id) #

Node parameter value(s) for a given node of a given tree in a given forest in the ForestContainer. Values are stale if the node is a split node.

Parameters:

Name Type Description Default
forest_num int

Index of the forest to be queried

required
tree_num int

Index of the tree to be queried

required
node_id int

Index of the node to be queried

required

Returns:

Type Description
array

Array of parameter values for node node_id in tree tree_num of forest forest_num.

num_nodes(forest_num, tree_num) #

Number of nodes in a given tree in a given forest in the ForestContainer.

Parameters:

Name Type Description Default
forest_num int

Index of the forest to be queried

required
tree_num int

Index of the tree to be queried

required

Returns:

Type Description
int

Total number of nodes in tree tree_num of forest forest_num.

num_leaves(forest_num, tree_num) #

Number of leaves in a given tree in a given forest in the ForestContainer.

Parameters:

Name Type Description Default
forest_num int

Index of the forest to be queried

required
tree_num int

Index of the tree to be queried

required

Returns:

Type Description
int

Total number of leaves in tree tree_num of forest forest_num.

num_leaf_parents(forest_num, tree_num) #

Number of leaf parents (split nodes with two leaves as children) in a given tree in a given forest in the ForestContainer.

Parameters:

Name Type Description Default
forest_num int

Index of the forest to be queried

required
tree_num int

Index of the tree to be queried

required

Returns:

Type Description
int

Total number of leaf parents in tree tree_num of forest forest_num.

num_split_nodes(forest_num, tree_num) #

Number of split_nodes in a given tree in a given forest in the ForestContainer.

Parameters:

Name Type Description Default
forest_num int

Index of the forest to be queried

required
tree_num int

Index of the tree to be queried

required

Returns:

Type Description
int

Total number of split nodes in tree tree_num of forest forest_num.

nodes(forest_num, tree_num) #

Array of node indices in a given tree in a given forest in the ForestContainer.

Parameters:

Name Type Description Default
forest_num int

Index of the forest to be queried

required
tree_num int

Index of the tree to be queried

required

Returns:

Type Description
array

Array of indices of nodes in tree tree_num of forest forest_num.

leaves(forest_num, tree_num) #

Array of leaf indices in a given tree in a given forest in the ForestContainer.

Parameters:

Name Type Description Default
forest_num int

Index of the forest to be queried

required
tree_num int

Index of the tree to be queried

required

Returns:

Type Description
array

Array of indices of leaf nodes in tree tree_num of forest forest_num.

delete_sample(forest_num) #

Modify the ForestContainer by removing the forest sample indexed by forest_num.

Parameters:

Name Type Description Default
forest_num int

Index of the forest to be removed from the ForestContainer

required