forest.Forest

forest.Forest(
    num_trees,
    output_dimension=1,
    leaf_constant=True,
    is_exponentiated=False,
)

In-memory python wrapper around a C++ tree ensemble object

Parameters

Name Type Description Default
num_trees int Number of trees that each forest should contain required
output_dimension int Dimension of the leaf node parameters in each tree 1
leaf_constant bool Whether the leaf node model is “constant” (i.e. prediction is simply a sum of leaf node parameters for every observation in a dataset) or not (i.e. each leaf node parameter is multiplied by a “basis vector” before being returned as a prediction). True
is_exponentiated bool Whether or not the leaf node parameters are stored in log scale (in which case, they must be exponentiated before being returned as predictions). False

Methods

Name Description
reset_root Reset forest to a forest with all single node (i.e. “root”) trees
reset Reset forest to the forest indexed by forest_num in forest_container
predict Predict from each forest in the container, using the provided Dataset object.
predict_raw Predict raw leaf values for a every forest in the container, using the provided Dataset object
set_root_leaves Set constant (root) leaf node values for every tree in the forest.
merge_forest Create a larger forest by merging the trees of this forest with those of another forest
add_constant Add a constant value to every leaf of every tree in an ensemble. If leaves are multi-dimensional, constant_value will be added to every dimension of the leaves.
multiply_constant Multiply every leaf of every tree by a constant value. If leaves are multi-dimensional, constant_multiple will be multiplied through every dimension of the leaves.
add_numeric_split Add a numeric (i.e. X[,i] <= c) split to a given tree in the forest
get_tree_leaves Retrieve a vector of indices of leaf nodes for a given tree in the forest
get_tree_split_counts Retrieve a vector of split counts for every training set variable in a given tree in the forest
get_overall_split_counts Retrieve a vector of split counts for every training set variable in the forest
get_granular_split_counts Retrieve a vector of split counts for every training set variable in the forest, reported separately for each tree
num_forest_leaves Return the total number of leaves in a forest
sum_leaves_squared Return the total sum of squared leaf values in a forest
is_leaf_node Whether or not a given node of a given tree of a forest is a leaf
is_numeric_split_node Whether or not a given node of a given tree of a forest is a numeric split node
is_categorical_split_node Whether or not a given node of a given tree of a forest is a categorical split node
parent_node Parent node of given node of a given tree of a forest
left_child_node Left child node of given node of a given tree of a forest
right_child_node Right child node of given node of a given tree of a forest
node_depth Depth of given node of a given tree of a forest
node_split_index Split index of given node of a given tree of a forest.
node_split_threshold Threshold that defines a numeric split for a given node of a given tree of a forest.
node_split_categories Array of category indices that define a categorical split for a given node of a given tree of a forest.
node_leaf_values Leaf node value(s) for a given node of a given tree of a forest.
num_nodes Number of nodes in a given tree of a forest
num_leaves Number of leaves in a given tree of a forest
num_leaf_parents Number of leaf parents in a given tree of a forest
num_split_nodes Number of split_nodes in a given tree of a forest
nodes Array of node indices in a given tree of a forest
leaves Array of leaf indices in a given tree of a forest
is_empty When a Forest object is created, it is “empty” in the sense that none

reset_root

forest.Forest.reset_root()

Reset forest to a forest with all single node (i.e. “root”) trees

reset

forest.Forest.reset(forest_container, forest_num)

Reset forest to the forest indexed by forest_num in forest_container

Parameters

Name Type Description Default
forest_container `ForestContainer Stochtree object storing tree ensembles required
forest_num int Index of the ensemble used to reset the Forest required

predict

forest.Forest.predict(dataset)

Predict from each forest in the container, using the provided Dataset object.

Parameters

Name Type Description Default
dataset Dataset Python object wrapping the “dataset” class used by C++ sampling and prediction data structures. required

Returns

Name Type Description
np.array One-dimensional numpy array with length equal to the number of observations in dataset.

predict_raw

forest.Forest.predict_raw(dataset)

Predict raw leaf values for a every forest in the container, using the provided Dataset object

Parameters

Name Type Description Default
dataset Dataset Python object wrapping the “dataset” class used by C++ sampling and prediction data structures. required

Returns

Name Type Description
np.array Numpy array with (n, k) dimensions, where n is the number of observations in dataset and k is the dimension of the leaf parameter. If k = 1, then the returned array is simply one-dimensional with n observations.

set_root_leaves

forest.Forest.set_root_leaves(leaf_value)

Set constant (root) leaf node values for every tree in the forest. Assumes the forest consists of all root (single-node) trees.

Parameters

Name Type Description Default
leaf_value float or np.array Constant values to which root nodes are to be set. If the trees in forest forest_num are univariate, then leaf_value must be a float, while if the trees in forest forest_num are multivariate, then leaf_value must be a np.array. required

merge_forest

forest.Forest.merge_forest(other_forest)

Create a larger forest by merging the trees of this forest with those of another forest

Parameters

Name Type Description Default
other_forest Forest Forest to be merged into this forest required

add_constant

forest.Forest.add_constant(constant_value)

Add a constant value to every leaf of every tree in an ensemble. If leaves are multi-dimensional, constant_value will be added to every dimension of the leaves.

Parameters

Name Type Description Default
constant_value float Value that will be added to every leaf of every tree required

multiply_constant

forest.Forest.multiply_constant(constant_multiple)

Multiply every leaf of every tree by a constant value. If leaves are multi-dimensional, constant_multiple will be multiplied through every dimension of the leaves.

Parameters

Name Type Description Default
constant_multiple float Value that will be multiplied by every leaf of every tree required

add_numeric_split

forest.Forest.add_numeric_split(
    tree_num,
    leaf_num,
    feature_num,
    split_threshold,
    left_leaf_value,
    right_leaf_value,
)

Add a numeric (i.e. X[,i] <= c) split to a given tree in the forest

Parameters

Name Type Description Default
tree_num int Index of the tree to be split required
leaf_num int Leaf to be split required
feature_num int Feature that defines the new split required
split_threshold float Value that defines the cutoff of the new split required
left_leaf_value float or np.array Value (or array of values) to assign to the newly created left node required
right_leaf_value float or np.array Value (or array of values) to assign to the newly created right node required

get_tree_leaves

forest.Forest.get_tree_leaves(tree_num)

Retrieve a vector of indices of leaf nodes for a given tree in the forest

Parameters

Name Type Description Default
tree_num float or np.array Index of the tree for which leaf indices will be retrieved required

Returns

Name Type Description
np.array One-dimensional numpy array, containing the indices of leaf nodes in a given tree.

get_tree_split_counts

forest.Forest.get_tree_split_counts(tree_num, num_features)

Retrieve a vector of split counts for every training set variable in a given tree in the forest

Parameters

Name Type Description Default
tree_num int Index of the tree for which split counts will be retrieved required
num_features int Total number of features in the training set required

Returns

Name Type Description
np.array One-dimensional numpy array with as many elements as in the forest model’s training set, containing the split count for each feature for a given tree of the forest.

get_overall_split_counts

forest.Forest.get_overall_split_counts(num_features)

Retrieve a vector of split counts for every training set variable in the forest

Parameters

Name Type Description Default
num_features int Total number of features in the training set required

Returns

Name Type Description
np.array One-dimensional numpy array with as many elements as in the forest model’s training set, containing the overall split count in the forest for each feature.

get_granular_split_counts

forest.Forest.get_granular_split_counts(num_features)

Retrieve a vector of split counts for every training set variable in the forest, reported separately for each tree

Parameters

Name Type Description Default
num_features int Total number of features in the training set required

Returns

Name Type Description
np.array One-dimensional numpy array with as many elements as in the forest model’s training set, containing the split count for each feature for a every tree in the forest.

num_forest_leaves

forest.Forest.num_forest_leaves()

Return the total number of leaves in a forest

Returns

Name Type Description
int Number of leaves in a forest

sum_leaves_squared

forest.Forest.sum_leaves_squared()

Return the total sum of squared leaf values in a forest

Returns

Name Type Description
float Sum of squared leaf values in a forest

is_leaf_node

forest.Forest.is_leaf_node(tree_num, node_id)

Whether or not a given node of a given tree of a forest is a leaf

tree_num : int Index of the tree to be queried node_id : int Index of the node to be queried

Returns

Name Type Description
bool True if node node_id in tree tree_num is a leaf, False otherwise

is_numeric_split_node

forest.Forest.is_numeric_split_node(tree_num, node_id)

Whether or not a given node of a given tree of a forest is a numeric split node

Parameters

Name Type Description Default
tree_num int Index of the tree to be queried required
node_id int Index of the node to be queried required

Returns

Name Type Description
bool True if node node_id in tree tree_num is a numeric split node, False otherwise

is_categorical_split_node

forest.Forest.is_categorical_split_node(tree_num, node_id)

Whether or not a given node of a given tree of a forest is a categorical split node

Parameters

Name Type Description Default
tree_num int Index of the tree to be queried required
node_id int Index of the node to be queried required

Returns

Name Type Description
bool True if node node_id in tree tree_num is a categorical split node, False otherwise

parent_node

forest.Forest.parent_node(tree_num, node_id)

Parent node of given node of a given tree of a forest

Parameters

Name Type Description Default
tree_num int Index of the tree to be queried required
node_id int Index of the node to be queried required

Returns

Name Type Description
int Index of the parent of node node_id in tree tree_num. If node_id is a root node, returns -1.

left_child_node

forest.Forest.left_child_node(tree_num, node_id)

Left child node of given node of a given tree of a forest

Parameters

Name Type Description Default
tree_num int Index of the tree to be queried required
node_id int Index of the node to be queried required

Returns

Name Type Description
int Index of the left child of node node_id in tree tree_num. If node_id is a leaf, returns -1.

right_child_node

forest.Forest.right_child_node(tree_num, node_id)

Right child node of given node of a given tree of a forest

Parameters

Name Type Description Default
tree_num int Index of the tree to be queried required
node_id int Index of the node to be queried required

Returns

Name Type Description
int Index of the right child of node node_id in tree tree_num. If node_id is a leaf, returns -1.

node_depth

forest.Forest.node_depth(tree_num, node_id)

Depth of given node of a given tree of a forest Returns -1 if the node is a leaf.

Parameters

Name Type Description Default
tree_num int Index of the tree to be queried required
node_id int Index of the node to be queried required

Returns

Name Type Description
int Depth of node node_id in tree tree_num. The root node is defined as “depth zero.”

node_split_index

forest.Forest.node_split_index(tree_num, node_id)

Split index of given node of a given tree of a forest. Returns -1 if the node is a leaf.

Parameters

Name Type Description Default
tree_num int Index of the tree to be queried required
node_id int Index of the node to be queried required

Returns

Name Type Description
int Split index of node_id in tree tree_num.

node_split_threshold

forest.Forest.node_split_threshold(tree_num, node_id)

Threshold that defines a numeric split for a given node of a given tree of a forest. Returns np.Inf if the node is a leaf or a categorical split node.

Parameters

Name Type Description Default
tree_num int Index of the tree to be queried required
node_id int Index of the node to be queried required

Returns

Name Type Description
float Threshold that defines a numeric split for node node_id in tree tree_num.

node_split_categories

forest.Forest.node_split_categories(tree_num, node_id)

Array of category indices that define a categorical split for a given node of a given tree of a forest. Returns np.array([np.Inf]) if the node is a leaf or a numeric split node.

Parameters

Name Type Description Default
tree_num int Index of the tree to be queried required
node_id int Index of the node to be queried required

Returns

Name Type Description
np.array Array of category indices that define a categorical split for node node_id in tree tree_num.

node_leaf_values

forest.Forest.node_leaf_values(tree_num, node_id)

Leaf node value(s) for a given node of a given tree of a forest. Values are stale if the node is a split node.

Parameters

Name Type Description Default
tree_num int Index of the tree to be queried required
node_id int Index of the node to be queried required

Returns

Name Type Description
np.array Array of parameter values for node node_id in tree tree_num.

num_nodes

forest.Forest.num_nodes(tree_num)

Number of nodes in a given tree of a forest

Parameters

Name Type Description Default
tree_num int Index of the tree to be queried required

Returns

Name Type Description
int Total number of nodes in tree tree_num.

num_leaves

forest.Forest.num_leaves(tree_num)

Number of leaves in a given tree of a forest

Parameters

Name Type Description Default
tree_num int Index of the tree to be queried required

Returns

Name Type Description
int Total number of leaves in tree tree_num.

num_leaf_parents

forest.Forest.num_leaf_parents(tree_num)

Number of leaf parents in a given tree of a forest

Parameters

Name Type Description Default
tree_num int Index of the tree to be queried required

Returns

Name Type Description
int Total number of leaf parents in tree tree_num.

num_split_nodes

forest.Forest.num_split_nodes(tree_num)

Number of split_nodes in a given tree of a forest

Parameters

Name Type Description Default
tree_num int Index of the tree to be queried required

Returns

Name Type Description
int Total number of split nodes in tree tree_num.

nodes

forest.Forest.nodes(tree_num)

Array of node indices in a given tree of a forest

Parameters

Name Type Description Default
tree_num int Index of the tree to be queried required

Returns

Name Type Description
np.array Array of indices of nodes in tree tree_num.

leaves

forest.Forest.leaves(tree_num)

Array of leaf indices in a given tree of a forest

Parameters

Name Type Description Default
tree_num int Index of the tree to be queried required

Returns

Name Type Description
np.array Array of indices of leaf nodes in tree tree_num.

is_empty

forest.Forest.is_empty()

When a Forest object is created, it is “empty” in the sense that none of its component trees have leaves with values. There are two ways to “initialize” a Forest object. First, the set_root_leaves() method of the Forest class simply initializes every tree in the forest to a single node carrying the same (user-specified) leaf value. Second, the prepare_for_sampler() method of the ForestSampler class initializes every tree in the forest to a single node with the same value and also propagates this information through to the temporary tracking data structrues in a ForestSampler object, which must be synchronized with a Forest during a forest sampler loop.

Returns

Name Type Description
bool True if a Forest has not yet been initialized with a constant root value, False otherwise if the forest has already been initialized / grown.