Classes
class	CategorySampleTracker
	Mapping categories to the indices they contain TODO: Add run-time checks for categories with a few observations.

class	ColumnMatrix
	Internal wrapper around `Eigen::MatrixXd` interface for multidimensional floating point data. More...

class	ColumnVector
	Internal wrapper around `Eigen::VectorXd` interface for univariate floating point data. The (frequently updated) full / partial residual used in sampling forests is stored internally as a `ColumnVector` by the sampling functions (see Forest Sampler API). More...

class	CutpointGridContainer
	Container class for FeatureCutpointGrid objects stored for every feature in a dataset. More...

class	FeatureCutpointGrid
	Computing and tracking cutpoints available for a given feature at a given node Store cutpoint bins in 0-indexed fashion, so that if a given node has. More...

class	FeaturePresortPartition
	Data structure that tracks pre-sorted feature values through a tree's split lifecycle. More...

class	FeaturePresortRoot
	Data structure for presorting a feature by its values. More...

class	FeaturePresortRootContainer
	Container class for FeaturePresortRoot objects stored for every feature in a dataset. More...

class	FeatureUnsortedPartition
	Mapping nodes to the indices they contain. More...

class	ForestContainer
	Container of `TreeEnsemble` forest objects. This is the primary (in-memory) storage interface for multiple "samples" of a decision tree ensemble in `stochtree`. More...

class	ForestDataset
	API for loading and accessing data used to sample tree ensembles The covariates / bases / weights used in sampling forests are stored internally as a `ForestDataset` by the sampling functions (see Forest Sampler API). More...

class	ForestTracker
	"Superclass" wrapper around tracking data structures for forest sampling algorithms More...

class	GammaSampler

class	GaussianConstantLeafModel
	Marginal likelihood and posterior computation for gaussian homoskedastic constant leaf outcome model. More...

class	GaussianConstantSuffStat
	Sufficient statistic and associated operations for gaussian homoskedastic constant leaf outcome model. More...

class	GaussianMultivariateRegressionLeafModel
	Marginal likelihood and posterior computation for gaussian homoskedastic constant leaf outcome model. More...

class	GaussianMultivariateRegressionSuffStat
	Sufficient statistic and associated operations for gaussian homoskedastic constant leaf outcome model. More...

class	GaussianUnivariateRegressionLeafModel
	Marginal likelihood and posterior computation for gaussian homoskedastic constant leaf outcome model. More...

class	GaussianUnivariateRegressionSuffStat
	Sufficient statistic and associated operations for gaussian homoskedastic constant leaf outcome model. More...

class	GlobalHomoskedasticVarianceModel
	Marginal likelihood and posterior computation for gaussian homoskedastic constant leaf outcome model. More...

class	IGVariancePrior

class	InverseGammaSampler

class	LabelMapper
	Standalone container for the map from category IDs to 0-based indices. More...

class	LeafNodeHomoskedasticVarianceModel
	Marginal likelihood and posterior computation for gaussian homoskedastic constant leaf outcome model. More...

class	LogLinearVarianceLeafModel
	Marginal likelihood and posterior computation for heteroskedastic log-linear variance model. More...

class	LogLinearVarianceSuffStat
	Sufficient statistic and associated operations for heteroskedastic log-linear variance model. More...

class	MultivariateNormalSampler

class	MultivariateRegressionRandomEffectsModel
	Posterior computation and sampling and state storage for random effects model with a group-level multivariate basis regression. More...

class	NodeCutpointTracker
	Computing and tracking cutpoints available for a given feature at a given node. More...

class	NodeOffsetSize
	Tracking cutpoints available at a given node. More...

class	RandomEffectsContainer

class	RandomEffectsDataset
	API for loading and accessing data used to sample (additive) random effects. More...

class	RandomEffectsGaussianPrior

class	RandomEffectsRegressionGaussianPrior

class	RandomEffectsTracker
	Wrapper around data structures for random effects sampling algorithms. More...

class	SampleCategoryMapper
	Class storing sample-node map for each tree in an ensemble TODO: Add run-time checks for categories with a few observations. More...

class	SampleNodeMapper
	Class storing sample-node map for each tree in an ensemble. More...

class	SamplePredMapper
	Class storing sample-prediction map for each tree in an ensemble. More...

class	SortedNodeSampleTracker
	Data structure for tracking observations through a tree partition with each feature pre-sorted. More...

class	Tree
	Decision tree data structure. More...

class	TreeEnsemble
	Class storing a "forest," or an ensemble of decision trees. More...

class	TreePrior

class	TreeSplit
	Representation of arbitrary tree split rules, including numeric split rules (`X[,i] <= c`) and categorical split rules (`X[,i] in {2,4,6,7}`) More...

class	UnivariateNormalSampler

class	UnsortedNodeSampleTracker
	Mapping nodes to the indices they contain. More...

Typedefs
using	SuffStatVariant = std::variant< GaussianConstantSuffStat, GaussianUnivariateRegressionSuffStat, GaussianMultivariateRegressionSuffStat, LogLinearVarianceSuffStat >
	Unifying layer for disparate sufficient statistic class types.

using	LeafModelVariant = std::variant< GaussianConstantLeafModel, GaussianUnivariateRegressionLeafModel, GaussianMultivariateRegressionLeafModel, LogLinearVarianceLeafModel >
	Unifying layer for disparate leaf model class types.

Enumerations
enum	ModelType
	Leaf models for the forest sampler: More...

enum	TreeNodeType
	Tree node type.

Functions
static void	ExtractMultipleFeaturesFromMemory (std::vector< std::string > text_data, const Parser parser, std::vector< int32_t > &column_indices, Eigen::MatrixXd &data, data_size_t num_rows)
	Extract multiple features from the raw data loaded from a file into an `Eigen::MatrixXd`. Lightly modified from LightGBM's datasetloader interface to support `stochtree`'s use cases.

static void	ExtractSingleFeatureFromMemory (std::vector< std::string > text_data, const Parser parser, int32_t column_index, Eigen::VectorXd &data, data_size_t num_rows)
	Extract a single feature from the raw data loaded from a file into an `Eigen::VectorXd`. Lightly modified from LightGBM's datasetloader interface to support `stochtree`'s use cases.

template<typename container_type , typename prob_type >
void	sample_without_replacement (container_type output, prob_type p, container_type *a, int population_size, int sample_size, std::mt19937 &gen)
	Sample without replacement according to a set of probability weights. This template function is a C++ variant of numpy's implementation: https://github.com/numpy/numpy/blob/031f44252d613f4524ad181e3eb2ae2791e22187/numpy/random/_generator.pyx#L925.

static SuffStatVariant	suffStatFactory (ModelType model_type, int basis_dim=0)
	Factory function that creates a new `SuffStat` object for the specified model type.

static LeafModelVariant	leafModelFactory (ModelType model_type, double tau, Eigen::MatrixXd &Sigma0, double a, double b)
	Factory function that creates a new `LeafModel` object for the specified model type.

std::string	TreeNodeTypeToString (TreeNodeType type)
	Get string representation of TreeNodeType.

TreeNodeType	TreeNodeTypeFromString (std::string const &name)
	Get NodeType from string.

bool	operator== (const Tree &lhs, const Tree &rhs)
	Comparison operator for trees.

bool	SplitTrueNumeric (double fvalue, double threshold)
	Determine whether an observation produces a "true" value in a numeric split node.

bool	SplitTrueCategorical (double fvalue, std::vector< std::uint32_t > const &category_list)
	Determine whether an observation produces a "true" value in a categorical split node.

int	NextNodeNumeric (double fvalue, double threshold, int left_child, int right_child)
	Return left or right node id based on a numeric split.

int	NextNodeCategorical (double fvalue, std::vector< std::uint32_t > const &category_list, int left_child, int right_child)
	Return left or right node id based on a categorical split.

int	EvaluateTree (Tree const &tree, Eigen::MatrixXd &data, int row)

int	EvaluateTree (Tree const &tree, Eigen::Map< Eigen::Matrix< double, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor > > &data, int row)

bool	RowSplitLeft (Eigen::MatrixXd &covariates, int row, int split_index, double split_value)
	Determine whether a given observation is "true" at a split proposed by split_index and split_value.

bool	RowSplitLeft (Eigen::MatrixXd &covariates, int row, int split_index, std::vector< std::uint32_t > const &category_list)
	Determine whether a given observation is "true" at a split proposed by split_index and split_value.

static void	VarSplitRange (ForestTracker &tracker, ForestDataset &dataset, int tree_num, int leaf_split, int feature_split, double &var_min, double &var_max)
	Computer the range of available split values for a continuous variable, given the current structure of a tree.

static bool	NodesNonConstantAfterSplit (ForestDataset &dataset, ForestTracker &tracker, TreeSplit &split, int tree_num, int leaf_split, int feature_split)
	Determines whether a proposed split creates two leaf nodes with constant values for every feature (thus ensuring that the tree cannot split further).

template<typename LeafModel , typename LeafSuffStat , typename... LeafSuffStatConstructorArgs>
static void	GFRSampleOneIter (TreeEnsemble &active_forest, ForestTracker &tracker, ForestContainer &forests, LeafModel &leaf_model, ForestDataset &dataset, ColumnVector &residual, TreePrior &tree_prior, std::mt19937 &gen, std::vector< double > &variable_weights, std::vector< int > &sweep_update_indices, double global_variance, std::vector< FeatureType > &feature_types, int cutpoint_grid_size, bool keep_forest, bool pre_initialized, bool backfitting, int num_features_subsample, LeafSuffStatConstructorArgs &... leaf_suff_stat_args)

template<typename LeafModel , typename LeafSuffStat , typename... LeafSuffStatConstructorArgs>
static void	MCMCSampleOneIter (TreeEnsemble &active_forest, ForestTracker &tracker, ForestContainer &forests, LeafModel &leaf_model, ForestDataset &dataset, ColumnVector &residual, TreePrior &tree_prior, std::mt19937 &gen, std::vector< double > &variable_weights, std::vector< int > &sweep_update_indices, double global_variance, bool keep_forest, bool pre_initialized, bool backfitting, LeafSuffStatConstructorArgs &... leaf_suff_stat_args)
	Runs one iteration of the MCMC sampler for a tree ensemble model, which consists of two steps for every tree in a forest:

Detailed Description

General-purpose data structures used for keeping track of categories in a training dataset.

SampleCategoryMapper is a simplified version of SampleNodeMapper, which is not tree-specific as it tracks categories loaded into a training dataset, and we do not expect to modify it during training.

SampleCategoryMapper is used in two places:

Group random effects: mapping observations to group IDs for the purpose of computing random effects
Heteroskedasticity based on fixed categories (as opposed to partitions as in HBART by Pratola et al 2018)
- One example of this would be binary treatment causal inference with separate outcome variances for the treated and control groups (as in Krantsevich et al 2023)

CategorySampleTracker is a simplified version of FeatureUnsortedPartition, which as above does not vary based on tree / partition and is not expected to change during training.

SampleNodeMapper is inspired by the design of the DataPartition class in LightGBM, released under the MIT license with the following copyright:

Simple container-like interfaces for samples of common models.

Data structures for enumerating potential cutpoint candidates.

This is used in the XBART family of algorithms, which samples split rules based on the log marginal likelihood of every potential cutpoint. For numeric variables with large sample sizes, it is often unnecessary to consider every unique value, so we allow for an adaptive "grid" of potential cutpoint values.

Algorithms for enumerating cutpoints take Dataset and SortedNodeSampleTracker objects as inputs, so that each feature is "pre-sorted" according to its value within a given node. The size of the adaptive cutpoint grid is set by the cutpoint_grid_size configuration parameter.

Numeric Features

When a node has fewer available observations than cutpoint_grid_size, full enumeration of unique available cutpoints is done via the EnumerateNumericCutpointsDeduplication function

When a node has more available observations than cutpoint_grid_size, potential cutpoints are "thinned out" by considering every k-th observation, where k is implied by the number of observations and the target cutpoint_grid_size.

Ordered Categorical Features

In this case, the grid is every unique value of the ordered categorical feature in ascending order.

Unordered Categorical Features

In this case, the grid is every unique value of the unordered categorical feature, arranged in an outcome-dependent order, as described in Fisher (1958)

Data structures used for tracking dataset through the tree building process.

The first category of data structure tracks observations available in nodes of a tree. a. UnsortedNodeSampleTracker tracks the observations available in every leaf of every tree in an ensemble, in no feature-specific sort order. It is primarily designed for use in BART-based algorithms. b. SortedNodeSampleTracker tracks the observations available in a every leaf of a tree, pre-sorted separately for each feature. It is primarily designed for use in XBART-based algorithms.

The second category, SampleNodeMapper, maps observations from a dataset to leaf nodes.

SampleNodeMapper is inspired by the design of the DataPartition class in LightGBM, released under the MIT license with the following copyright:

SortedNodeSampleTracker is inspired by the "approximate" split finding method in xgboost, released under the Apache license with the following copyright:

Classes

Typedefs

Enumerations

Functions

Detailed Description

Numeric Features

Ordered Categorical Features

Unordered Categorical Features