StochTree 0.0.1
Loading...
Searching...
No Matches
Classes | Enumerations | Functions
StochTree Namespace Reference

Classes

class  CategorySampleTracker
 Mapping categories to the indices they contain TODO: Add run-time checks for categories with a few observations.
 
class  ColumnMatrix
 Internal wrapper around Eigen::MatrixXd interface for multidimensional floating point data. More...
 
class  ColumnVector
 Internal wrapper around Eigen::VectorXd interface for univariate floating point data. The (frequently updated) full / partial residual used in sampling forests is stored internally as a ColumnVector by the sampling functions (see Forest Sampler API). More...
 
class  CutpointGridContainer
 Container class for FeatureCutpointGrid objects stored for every feature in a dataset. More...
 
class  FeatureCutpointGrid
 Computing and tracking cutpoints available for a given feature at a given node Store cutpoint bins in 0-indexed fashion, so that if a given node has. More...
 
class  FeaturePresortPartition
 Data structure that tracks pre-sorted feature values through a tree's split lifecycle. More...
 
class  FeaturePresortRoot
 Data structure for presorting a feature by its values. More...
 
class  FeaturePresortRootContainer
 Container class for FeaturePresortRoot objects stored for every feature in a dataset. More...
 
class  FeatureUnsortedPartition
 Mapping nodes to the indices they contain. More...
 
class  ForestContainer
 Container of TreeEnsemble forest objects. This is the primary (in-memory) storage interface for multiple "samples" of a decision tree ensemble in stochtree. More...
 
class  ForestDataset
 API for loading and accessing data used to sample tree ensembles The covariates / bases / weights used in sampling forests are stored internally as a ForestDataset by the sampling functions (see Forest Sampler API). More...
 
class  ForestTracker
 "Superclass" wrapper around tracking data structures for forest sampling algorithms More...
 
class  GammaSampler
 
class  GaussianConstantLeafModel
 Marginal likelihood and posterior computation for gaussian homoskedastic constant leaf outcome model. More...
 
class  GaussianConstantSuffStat
 Sufficient statistic and associated operations for gaussian homoskedastic constant leaf outcome model. More...
 
class  GaussianMultivariateRegressionLeafModel
 Marginal likelihood and posterior computation for gaussian homoskedastic constant leaf outcome model. More...
 
class  GaussianMultivariateRegressionSuffStat
 Sufficient statistic and associated operations for gaussian homoskedastic constant leaf outcome model. More...
 
class  GaussianUnivariateRegressionLeafModel
 Marginal likelihood and posterior computation for gaussian homoskedastic constant leaf outcome model. More...
 
class  GaussianUnivariateRegressionSuffStat
 Sufficient statistic and associated operations for gaussian homoskedastic constant leaf outcome model. More...
 
class  GlobalHomoskedasticVarianceModel
 Marginal likelihood and posterior computation for gaussian homoskedastic constant leaf outcome model. More...
 
class  IGVariancePrior
 
class  InverseGammaSampler
 
class  LabelMapper
 Standalone container for the map from category IDs to 0-based indices. More...
 
class  LeafNodeHomoskedasticVarianceModel
 Marginal likelihood and posterior computation for gaussian homoskedastic constant leaf outcome model. More...
 
class  LogLinearVarianceLeafModel
 Marginal likelihood and posterior computation for heteroskedastic log-linear variance model. More...
 
class  LogLinearVarianceSuffStat
 Sufficient statistic and associated operations for heteroskedastic log-linear variance model. More...
 
class  MultivariateNormalSampler
 
class  MultivariateRegressionRandomEffectsModel
 Posterior computation and sampling and state storage for random effects model with a group-level multivariate basis regression. More...
 
class  NodeCutpointTracker
 Computing and tracking cutpoints available for a given feature at a given node. More...
 
class  NodeOffsetSize
 Tracking cutpoints available at a given node. More...
 
class  RandomEffectsContainer
 
class  RandomEffectsDataset
 API for loading and accessing data used to sample (additive) random effects. More...
 
class  RandomEffectsGaussianPrior
 
class  RandomEffectsRegressionGaussianPrior
 
class  RandomEffectsTracker
 Wrapper around data structures for random effects sampling algorithms. More...
 
class  SampleCategoryMapper
 Class storing sample-node map for each tree in an ensemble TODO: Add run-time checks for categories with a few observations. More...
 
class  SampleNodeMapper
 Class storing sample-node map for each tree in an ensemble. More...
 
class  SamplePredMapper
 Class storing sample-prediction map for each tree in an ensemble. More...
 
class  SortedNodeSampleTracker
 Data structure for tracking observations through a tree partition with each feature pre-sorted. More...
 
class  Tree
 Decision tree data structure. More...
 
class  TreeEnsemble
 Class storing a "forest," or an ensemble of decision trees. More...
 
class  TreePrior
 
class  TreeSplit
 Representation of arbitrary tree split rules, including numeric split rules (X[,i] <= c) and categorical split rules (X[,i] in {2,4,6,7}) More...
 
class  UnivariateNormalSampler
 
class  UnsortedNodeSampleTracker
 Mapping nodes to the indices they contain. More...
 

Enumerations

enum  ModelType
 Leaf models for the forest sampler: More...
 
enum  TreeNodeType
 Tree node type.
 

Functions

static void ExtractMultipleFeaturesFromMemory (std::vector< std::string > *text_data, const Parser *parser, std::vector< int32_t > &column_indices, Eigen::MatrixXd &data, data_size_t num_rows)
 Extract multiple features from the raw data loaded from a file into an Eigen::MatrixXd. Lightly modified from LightGBM's datasetloader interface to support stochtree's use cases.
 
static void ExtractSingleFeatureFromMemory (std::vector< std::string > *text_data, const Parser *parser, int32_t column_index, Eigen::VectorXd &data, data_size_t num_rows)
 Extract a single feature from the raw data loaded from a file into an Eigen::VectorXd. Lightly modified from LightGBM's datasetloader interface to support stochtree's use cases.
 
std::string TreeNodeTypeToString (TreeNodeType type)
 Get string representation of TreeNodeType.
 
TreeNodeType TreeNodeTypeFromString (std::string const &name)
 Get NodeType from string.
 
bool operator== (const Tree &lhs, const Tree &rhs)
 Comparison operator for trees.
 
bool SplitTrueNumeric (double fvalue, double threshold)
 Determine whether an observation produces a "true" value in a numeric split node.
 
bool SplitTrueCategorical (double fvalue, std::vector< std::uint32_t > const &category_list)
 Determine whether an observation produces a "true" value in a categorical split node.
 
int NextNodeNumeric (double fvalue, double threshold, int left_child, int right_child)
 Return left or right node id based on a numeric split.
 
int NextNodeCategorical (double fvalue, std::vector< std::uint32_t > const &category_list, int left_child, int right_child)
 Return left or right node id based on a categorical split.
 
int EvaluateTree (Tree const &tree, Eigen::MatrixXd &data, int row)
 
int EvaluateTree (Tree const &tree, Eigen::Map< Eigen::Matrix< double, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor > > &data, int row)
 
bool RowSplitLeft (Eigen::MatrixXd &covariates, int row, int split_index, double split_value)
 Determine whether a given observation is "true" at a split proposed by split_index and split_value.
 
bool RowSplitLeft (Eigen::MatrixXd &covariates, int row, int split_index, std::vector< std::uint32_t > const &category_list)
 Determine whether a given observation is "true" at a split proposed by split_index and split_value.
 
static void VarSplitRange (ForestTracker &tracker, ForestDataset &dataset, int tree_num, int leaf_split, int feature_split, double &var_min, double &var_max)
 Computer the range of available split values for a continuous variable, given the current structure of a tree.
 
static bool NodesNonConstantAfterSplit (ForestDataset &dataset, ForestTracker &tracker, TreeSplit &split, int tree_num, int leaf_split, int feature_split)
 Determines whether a proposed split creates two leaf nodes with constant values for every feature (thus ensuring that the tree cannot split further).
 
template<typename LeafModel , typename LeafSuffStat , typename... LeafSuffStatConstructorArgs>
static void GFRSampleOneIter (TreeEnsemble &active_forest, ForestTracker &tracker, ForestContainer &forests, LeafModel &leaf_model, ForestDataset &dataset, ColumnVector &residual, TreePrior &tree_prior, std::mt19937 &gen, std::vector< double > &variable_weights, double global_variance, std::vector< FeatureType > &feature_types, int cutpoint_grid_size, bool keep_forest, bool pre_initialized, bool backfitting, LeafSuffStatConstructorArgs &... leaf_suff_stat_args)
 
template<typename LeafModel , typename LeafSuffStat , typename... LeafSuffStatConstructorArgs>
static void MCMCSampleOneIter (TreeEnsemble &active_forest, ForestTracker &tracker, ForestContainer &forests, LeafModel &leaf_model, ForestDataset &dataset, ColumnVector &residual, TreePrior &tree_prior, std::mt19937 &gen, std::vector< double > &variable_weights, double global_variance, bool keep_forest, bool pre_initialized, bool backfitting, LeafSuffStatConstructorArgs &... leaf_suff_stat_args)
 Runs one iteration of the MCMC sampler for a tree ensemble model, which consists of two steps for every tree in a forest:
 

Detailed Description

Copyright (c) 2024 stochtree authors.

General-purpose data structures used for keeping track of categories in a training dataset.

SampleCategoryMapper is a simplified version of SampleNodeMapper, which is not tree-specific as it tracks categories loaded into a training dataset, and we do not expect to modify it during training.

SampleCategoryMapper is used in two places:

  1. Group random effects: mapping observations to group IDs for the purpose of computing random effects
  2. Heteroskedasticity based on fixed categories (as opposed to partitions as in HBART by Pratola et al 2018)
    • One example of this would be binary treatment causal inference with separate outcome variances for the treated and control groups (as in Krantsevich et al 2023)

CategorySampleTracker is a simplified version of FeatureUnsortedPartition, which as above does not vary based on tree / partition and is not expected to change during training.

SampleNodeMapper is inspired by the design of the DataPartition class in LightGBM, released under the MIT license with the following copyright:

Copyright (c) 2016 Microsoft Corporation. All rights reserved. Licensed under the MIT License. See LICENSE file in the project root for license information.

Copyright (c) 2024 stochtree authors. All rights reserved.

Simple container-like interfaces for samples of common models.

Copyright (c) 2024 stochtree authors.

Data structures for enumerating potential cutpoint candidates.

This is used in the XBART family of algorithms, which samples split rules based on the log marginal likelihood of every potential cutpoint. For numeric variables with large sample sizes, it is often unnecessary to consider every unique value, so we allow for an adaptive "grid" of potential cutpoint values.

Algorithms for enumerating cutpoints take Dataset and SortedNodeSampleTracker objects as inputs, so that each feature is "pre-sorted" according to its value within a given node. The size of the adaptive cutpoint grid is set by the cutpoint_grid_size configuration parameter.

Numeric Features

When a node has fewer available observations than cutpoint_grid_size, full enumeration of unique available cutpoints is done via the EnumerateNumericCutpointsDeduplication function

When a node has more available observations than cutpoint_grid_size, potential cutpoints are "thinned out" by considering every k-th observation, where k is implied by the number of observations and the target cutpoint_grid_size.

Ordered Categorical Features

In this case, the grid is every unique value of the ordered categorical feature in ascending order.

Unordered Categorical Features

In this case, the grid is every unique value of the unordered categorical feature, arranged in an outcome-dependent order, as described in Fisher (1958)

Copyright (c) 2024 stochtree authors. All rights reserved. Licensed under the MIT License. See LICENSE file in the project root for license information.

Copyright (c) 2024 stochtree authors. All rights reserved.

Copyright (c) 2024 stochtree authors.

Data structures used for tracking dataset through the tree building process.

The first category of data structure tracks observations available in nodes of a tree. a. UnsortedNodeSampleTracker tracks the observations available in every leaf of every tree in an ensemble, in no feature-specific sort order. It is primarily designed for use in BART-based algorithms. b. SortedNodeSampleTracker tracks the observations available in a every leaf of a tree, pre-sorted separately for each feature. It is primarily designed for use in XBART-based algorithms.

The second category, SampleNodeMapper, maps observations from a dataset to leaf nodes.

SampleNodeMapper is inspired by the design of the DataPartition class in LightGBM, released under the MIT license with the following copyright:

Copyright (c) 2016 Microsoft Corporation. All rights reserved. Licensed under the MIT License. See LICENSE file in the project root for license information.

SortedNodeSampleTracker is inspired by the "approximate" split finding method in xgboost, released under the Apache license with the following copyright:

Copyright 2015~2023 by XGBoost Contributors

Copyright (c) 2023 stochtree authors. All rights reserved. Licensed under the MIT License. See LICENSE file in the project root for license information.